Finally, a T-SQL Tuesday post that isn’t all introspection. There will be some because some of the reasons I never tried certain features for detecting change in data is odd considering my love of testing/using features myself. Thank you to the wonderful Meagan Longoria for this invite this month.
For the past year, T-SQL Tuesday has become a favorite thing to get involved with and I love the technical ones as much as I do the ones that require digging into my feelings or history.
The prompt: Share a tip, technique, or lesson learned about how you’ve handled detecting data changes in your technology of choice.
The Challenge
A few months ago, I wrote a post about comparing sets of data in SQL, This focussed on a type of challenge that is often a one off challenge. The techniques lists (along with a tool like Redgate’s SQL Data Compare aren’t the point of this post, but they are related because when you do change detection from a source, it is super important to check your results occasionally. Having a copy of the complete source to compare to your destination (even if it is just checksums of the data,) is important.
The techniques that Meagan is asking about this month are more ETL related, where you check a stream of data and sync them as changes are made. Typically, you don’t want to compare all the rows in a set, but just the ones that have changed. At some frequency, give me the changes to one set of data and keep it up to date with another.
Methods to see what rows might have changed
I have used a few methods, but never tried Change Data Capture (CDC) or Change Tracking (CT) in SQL Server. I always thought they would be great, but I had a coworker who was certain that both had issues that made them not ideal for our needs. So I never really tried/used them in any capacity, and still haven’t to this day, even after this person has been gone for years.
The method I have used (and am planning on using in my new Fabric implementation is to use one or more columns in the source data to check for changes. All the data in our source tools typically has a time with the row was last modified and or a column based on the rowversion or timestamp datatype that will allow you to look for all rows that have changed since a previous time. The row last modified time method is simple and easy to understand, it it is a nightmare at times because every tool that modifies data must keep it up.
I have had a SQL Feedback item for 9 years (in the most recent iteration) asking for a mechanism to make 100% sure these values are updated. Maybe some day!
I didn’t use rowversion (which does automatically update on every change) that frequently until recent years when I was told that it is safe to use in a sequential manner. It looks like a binary value, but you will also see it translated to a `bigint by some tools (including Microsoft tools like mirroring to Fabric.) The problem with it is that it looks mysterious to users and it can’t be manipulated when anyone wants to do something tricky like slow down/repeat a copy by changing the row modified time.
So in our code we might save a variable like @row_last_modified_time and then set it to some time earlier than the previous day’s ETL (minutes, hours, just enough to feel safe we handle the overlap). Usually we just set it to the start of ETL, since we know all changes before that will have been registered.
Finally, on rare occasions I will just check all the data in all the columns for all the rows to see if rows have changed. This is by for the safest method, but it is costly in something like a relational database. In a Fabric Data Warehouse, this is the method I am looking at because it is much safer and the Lakehouse paradigm is built on scanning lots of data (where the relational engine definitely was not!)
Transforming data
Briefly it seems prudent to mention that sometimes we need to merge 2+ tables into a single table in a query. So you might have a Table1 dimension that is actually made up of Table1, Table2, and Table3. For this, the process has generally been the same, but I typically get the most recent modification time for all the tables rather than all 3.
So using the GREATEST function, the code would look like:
SELECT GREATEST(Table1.row_last_modified_time, Table2.row_last_modified_time, Table3.row_last_modified_time) AS Table1.row_last_modified_timeFROM Table1 JOIN Table2 ON ... JOIN Table3 ON ...
This makes sure we get the latest changes to rows.
Methods to detect actual change
Just because a row shows up a possible change doesn’t mean it has changed. This is especially true when building a dimension in a data warehouse. There could be 100 columns in the source tables, but you only use 2 or 3 in the So the next thing to check is whether it needs to be updated or not.
Whether or not to do this depends on a lot of things, but mostly how much it costs to make the change. For example, a row may have some data in it that it don’t really care about. So you might have a WHERE clause that states to look at rows that include rows with a changed time of the last 10 hours, but also have a set of predicates that are OR’d together that say: column1 changed or column2 changed or column3 changed, but then we don’t care if column4 – column1000 have changed because they don’t have actual user data in them.
Prior to SQL Server 2022 and the addition of IS DISTINCT FROM, this was basically a nightmare when you have columns that allowed NULL values. You would end up with:
SELECT …FROM Table1 JOIN Table2 ON … WHERE (Table1.column2 <> Table2.column2 OR (Table1.column2 is NULL AND Table2.column2 IS NOT NULL) OR (Table1.column2 is NULL AND Table2.column2 IS NOT NULL)) --And so on for every nullable column
Now you can just use:
…WHERE Table1.column2 IS DISTINCT FROM Table2.column2
I am currently in the process of generating stored procedures to do this in Microsoft Fabric, and it is so nice that they support this syntax.
The pro to this method is that you don’t update too many rows (something that can be quite a pain in a SQL Server database with a column store index). But the con is off you can’t tell when the row was last synchronized with the row that is different.
So checking the differences in your data is no longer a simple query of keys and timestamp/row last modified time, you need to look at all of the data.
The pros and cons of it all
As Meagan also asked, “have you ever been burned by this process?”, I have to admit, “semi-frequently.” And part of it could be the age old problem of having enough time to test the software you are pumping out. If you are moving millions/billions of rows from point A to B, checking that they are all still exactly the same (especially partial rows), is daunting.
But it is necessary. Since you won’t be using all the lovely ACID properties of relational databases in these process, a failure in the wrong place, coupled with a mistake with the variables that control the process, and your data could be out of sync.
In my tests with Fabric, I am leaning towards just checking all data every day, since I don’t have billions of rows, and it is far safer. You have to scan the whole table anyhow, and I only need to check a few columns for changes/new rows.
Summary
I like this month’s post a bit better than reflecting on mistakes, career chances, and whatnot (though those are good things to do regularly.) Working with synchronizing sets of data is something I have been involved with for many years, and likely will be for many years to come.
Luckily, like the feature that spawned this post from Meagan (SQL Server Mirroring to Fabric), these processes will hopefully get easier. But when you need to transform multiple tables into one, packaged tools may or may not be sufficient.




Leave a Reply