There are many occasions when we want to see the differences between two sets of data. Sometimes a whole table, a subset of a table, or even the results from a couple of queries, like in a unit test.

Maybe you want to see that two sets are exactly the same, for example domain table in DEV, PROD, or maybe even from source control. You might you have a orders table and an orders_history table and you want to see the overlap/changes over a given period of time, like for example, to clean out any useless history.

No matter what the reason, there is a query pattern that will work for you. In this blog I will demonstrate several of these techniques and why you might want to use them in different places.

One important thing. In a lot of places, this might be called “Comparing two tables”, but I want to make it clear that this is not just the physical entity you think of as a table. It can be the results of two queries, two joined tables with the same or different where clauses. Sometimes it might even be simpler for you to compare the data represented by two CTEs. All of these can be called a table in regular discussions, but clarity when sharing concepts can help wider groups of people understand the topics.

I also want to differentiate this from comparing structures in a database. This is something I have wanted to do for a long time and it is something I still want to do…just not in this blog.

Note: The code is SQL Server native code, but it should run with little modification on most SQL based platforms

Types of comparisons

In order to discuss the best method, it is useful to think of the different reasons we might want to compare two sets. The final section will have the actual best method, because it is most thorough. But of course as it is the most thorough, it is the most work to create.

I will cover:

  • Determining if two sets are exactly alike. This is the fastest, easiest method and would be idea for things like unit testing domain data.
  • Finding rows that are different between sets. Sometimes you just want to find the different rows, like for a quick overview of the data.
  • Deep comparison of rows. Tell me everything about two sets. Matching rows, mismatching rows, the columns that don’t match, and rows that exist in one set or another.
  • Comparing rows where rows may be non-distinguishable from each other.

To do this, I will use this small set of data:

USE tempdb
GO
DROP TABLE IF EXISTS A, A1;
GO
CREATE TABLE A
(
  Aid int CONSTRAINT PKA PRIMARY KEY,
  Value varchar(10)
);
CREATE TABLE A1 --named to indicate it is a later verson
(               --of the same table
  Aid int CONSTRAINT PKA1 PRIMARY KEY,
  Value varchar(10)
);
INSERT INTO A (Aid, Value)
VALUES(1,'One'), (2,'Two'), (3,'Three'),(4,'4');
INSERT INTO A1 (Aid, Value)
VALUES(1,'One'), (2,'Two'), (4,'Four'),(5,'Five');

SELECT *
FROM   A;

SELECT *
FROM   A1;

This is the output:

Aid         Value
----------- ----------
1           One
2           Two
3           Three
4           4

Aid         Value
----------- ----------
1           One
2           Two
4           Four
5           Five

Analyzing the data visually, you can see that Aid:4 is in both sets, but the Value is spelled differently. Table A has an Aid:3, and A1 has Aid:5. Using a WHERE clause I can just show Aid:1 and 2 for sets that need to be equivalent. Every row has the same value for one, with will come up a bit later.

Note: Both tables have primary key constraints. If you are comparing non-distinct sets of data, it can be a lot more complicated and in addition to the steps outlined here, you will need to do some additional counts of data to make sure they match. This will be touched on in the section on non-distinguishable sets.

Determining if two sets are exactly alike

Let’s say we want to see if the contents of table A and A1 are the same. We can check this pretty easily with a query like this, if you know the number of rows in your tables. If all three results are the same, you are golden. To do this, you can use the set operator INTERSECT. It will return the rows that are the same and return that set, so if all the rows are the same, we should get the same count:

--note, while * is not good in most code (especially that
--you will use in production), here I am using it so it
--clearly represents ALL columns. A column list is used
--to denote just using a subset of columns.

SELECT COUNT(*) as A_Count from A;
SELECT COUNT(*) as A1_Count from A1;

SELECT COUNT(*) as INTERSECT_Count
FROM
(
SELECT *
FROM   A

INTERSECT

SELECT *
FROM   A1
) as CompareSets

This returns:

A_Count
-----------
4

A1_Count
-----------
4

INTERSECT_Count
---------------
2

Now we can tell that they have the same number or rows, but they are not fully alike. Note too that in your actual tables, there may be rows where it is fine for them to be different, but still consider them the same. For example, if your table is like this:

CREATE TABLE B
(
  Bid int CONSTRAINT PKA PRIMARY KEY,
  Value varchar(10),
  RowCreatedTime datetime2(7)
);

If there exists a B1 that is a replicated copy, the RowCreatedTime may be different. I typically add a column like this that can never change, but is also set when the row is created on the database, not when whatever B represents. If I need to know when whatever the row in B represents, that might get another column. The key here is that in almost all cases, B is not created when the row that represents it (it may also be recreated later, or in another copy). So Bid and Value match, whenever they were created on their unique server.

In other cases, you may also only want to see a subset of the data. Like Customers A and B should be the same at a given point in time, no matter what Customers C and D look like.

In any case, we can reduce the column and row counts we can again use the set operator INTERSECT. If the two pared down views of the table are equivalent, we get the same number of rows in each table and the intersected rows:

SELECT COUNT(*) as A_Count FROM A WHERE Aid IN (1,2,4);
SELECT COUNT(*) as A1_Count FROM A1 WHERE Aid IN (1,2,4);

SELECT COUNT(*) AS INTERSECT_Count
FROM
(
SELECT Aid
FROM   A
WHERE Aid IN (1,2,4)

INTERSECT

SELECT Aid
FROM   A1
WHERE Aid IN (1,2,4)
) as CompareSets;

The outputs here are all 3. The first two queries return the count of rows based on the WHERE clause, and then the INTERSECT only looks at those rows and the Aid column, which matches for these 3 rows.

Note, UNION will work fine for this endeavor too, but INTERSECT will make fewer rows to count for sets with a lot of non-matching rows. Both INTERSECT and UNION will dedup sets, hence the concern with the PRIMARY KEY.

Finding rows that are different between sets

Okay, so now we are sure that A and A1 have differences. Let’s look at some of the quick ways to find some of the differences.

SET Operators

To find rows that are in one set and not in another, there is another SET operator that will will use along with INTERSECT and it is: EXCEPT (or sometimes MINUS in other SQL dialects like Oracle. They are said to be equivalent, but much like I am doing here, test, test, test before you use any code to make decisions).

SELECT *
FROM   A

EXCEPT

SELECT *
FROM A1;

This returns:

Aid         Value
----------- ----------
3           Three
4           4

As discussed previously, 3 doesn’t exist and 4 is different in the spelling of the column value. You can see the differences from A1 to A using:

SELECT *
FROM   A1

EXCEPT

SELECT *
FROM A;

Which returns:

Aid         Value
----------- ----------
4           Four
5           Five

And you can find the similar rows (as noted previously) using:

SELECT *
FROM   A1

INTERSECT

SELECT *
FROM A;

Which returns:

Aid         Value
----------- ----------
1           One
2           Two

These are fantastic operators when you need them, especially when you want to quickly find differences. However, I find most of the time, I want to know more about it than just is a set/subset of columns/rows the same or different. Like for the two EXCEPT queries, I want to know if the row is in the set or different. As the techniques continue, I will get closer and closer to being able to get more and more specifics on the comparisons.

More control with SubQueries

When you need a bit more control over the output, but only need the data from one table, it is common to use a correlated subquery. You can use joins, but when you only need the data from one table, using a subquery is best.

Note: I am only going to demonstrate using EXISTS, but the IN operator can be useful as well, when you don’t have a correlation between rows, and you are working with one column as SQL Server does not yet allow row operators like:
IN (c1, c2) = SELECT c1, c2
EXISTS is much safer with NULL outputs, though it still can be an issue in the WHERE clause of the subquery.

If I want the rows in A that don’t exist in A1 based solely on key (and you want the rest of the data in A too), you can use a query such as:

SELECT *
FROM   A
WHERE  NOT EXISTS (SELECT *
                   FROM   A1
                   WHERE  A.Aid = A1.Aid);

This returns:

Aid         Value
----------- ----------
3           Three

Because like we noted earlier, 3 is the row I left off the A1 “copy”. We can tailor the conditions to get rows that are different for one or more columns too:

SELECT *
FROM A
WHERE EXISTS (SELECT *
              FROM A1
              WHERE A.Aid = A1.Aid
              AND A.Value IS DISTINCT FROM A1.Value);

This returns the row that in A is spelled differently than in A1:

Aid         Value
----------- ----------
4           4

I love IS DISTINCT FROM and one of my earliest blogs on this site was about it. The Value column is defined as allowing NULL values, so this makes sure if one is null and not the other, we get the output we wanted).

That’s the next section.

A lot of times, people write this code using an OUTER JOIN. Like if you want to find the rows in A1 that don’t have a match in A:

SELECT *
FROM   A
        LEFT OUTER JOIN A1
          ON A.Aid = A1.Aid
WHERE  A1.Aid IS NULL;

This does give you the answer, but if you look at the output of the * version of the output:

Aid         Value      Aid         Value
----------- ---------- ----------- ----------
3           Three      NULL         NULL

You get the extra columns for A1 that you have already said are null. I am not going to dig into plans too deeply, but this pattern can give you a less optimum query plan. (Not always, in the SQL Server 2025 Developer Edition server I am using, they are the same, since the optimizer can tell that the A1.Aid values must be non-existent in the output. Just like testing for correctness, testing for performance is not a bad thing.)

Attributing the differences in sets (Using Joins)

In this section, I want to go directly to my favorite solution. It is harder to code, so the full versions is usually only done when I have a repeatable need to compare sets. But if you need to look at this data more than a quick overview, this is the easiest way to do it.

The idea is to use some form of OUTER JOIN so we get the rows from one or both sets. When we have the rows that do or do not match by primary key value, we can then to some things with the data to compare. You can use a LEFT or RIGHT OUTER JOIN for this, but I am going to show the one that gives us a full comparison of our two tables. the FULL OUTER JOIN.

To see rows that are different, we can just do this:

SELECT *
FROM   A
        FULL OUTER JOIN A1
          ON A.Aid = A1.Aid;

The raw output is kind of hard to read for tables that are much larger than what I have created here but basically all rows from A and A1 are returned, side by side, with matching rows returned on the same row:

Aid  Value      Aid   Value
---- ---------- ----- ----------
1    One        1     One
2    Two        2     Two
3    Three      NULL  NULL
4    4          4     Four
NULL NULL       5     Five

Just visually (for this small set), you can see rows will all NULL values in a row (and especially where the primary key values are), are the ones that are not in both sets. The other rows you an see some the difference in that A's Aid:4's Value value is different than A1’s Aid:4’s Value value. (That phrase was a bit silly, but I thought it was fun anyhow :))

But the real value is that we can create something like this. Where I do the join, and then since I have the rows from the two tables together I can:

  • Just show the key once (since either A or A1 will have the value)
  • Determine missing rows based on the existence of a the primary key
  • Compare the columns that are different in whatever way makes sense

In this query I have created that setup, so you can see all the things we have discussed so far, all in the one query (along with a reminder of the complete tables):

SELECT *
FROM   A;

SELECT *
FROM   A1;

SELECT --key from join
        COALESCE(A1.Aid,A.Aid) AS Aid,
       --row existence
        CASE WHEN A.Aid IS NULL THEN 'A1'
             WHEN A1.Aid IS NULL THEN 'A'
             ELSE 'Both' END AS RowExistsIn,
        --column comparison
        CASE WHEN A1.Value is distinct from A.Value
             THEN 'Different'
             ELSE 'Same'
        END Value_Comparison,
        --values for you to look at
        A.Value AS A_Value,
        A1.Value AS A1_Value
FROM   A
        FULL OUTER JOIN A1
            ON A.Aid = A1.Aid;

The output is this:

Aid         Value
----------- ----------
1           One
2           Two
3           Three
4           4

Aid         Value
----------- ----------
1           One
2           Two
4           Four
5           Five

Aid         RowExistsIn Value_Comparison A_Value    A1_Value
----------- ----------- ---------------- ---------- ----------
1           Both        Same             One        One
2           Both        Same             Two        Two
3           A           Different        Three      NULL
4           Both        Different        4          Four
5           A1          Different        NULL       Five

Using this set of data (in a CTE/VIEW), I can see anything about the differences I want. And then I can filter out the differences I want to see or not see and especially filter out the rows that are the same.

If we need to do this regularly, this code can easily be made into a procedure or view object.

Finally, you can change the join from FULL OUTER to LEFT, RIGHT or even INNER. INNER would just give you the matching rows by key, then show you the differences, and RIGHT or LEFT would ignore rows that are missing from one side of the comparison.

A quick example of non-distinguishable rows.

In every example so far, we have had a unique set thanks to a primary key. This is not something I would hope a normal database would have a large need for, but situations happen. So , what if you want to compare these two sets of data:

DROP TABLE IF EXISTS C, C1;

CREATE TABLE C
(
   Value varchar(10)
);
CREATE TABLE C1
(
   Value varchar(10)
);

INSERT INTO C(Value)
VALUES ('1'),('1'),('1'),('2'),('2'),('2');
INSERT INTO C1(Value)
VALUES ('2'),('2'),('1'),('1'),('1'),('3');
GO

SELECT *
FROM   C;

SELECT *
FROM   C1;

From the output of the query, there are a few things you can say:

C:

Value
----------
1
1
1
2
2
2

C1:

Value
----------
2
2
1
1
1
3

You could do several things, though none of them super easy. Order doesn’t matter in a table (certainly not if you don’t include a column to order on uniquely), so you can reorder the rows for comparison.

You could transform the rows and add a surrogate key… but that would need to be by group because the first difference will cause the comparison to fail. The easiest way I have found is to compare groups of like rows.

In this case (and if there were 2 or more columns this would be exactly the same, just grouping on multiple columns instead of one). Group the like data and add a count of the instances of the values.

WITH C_Flat AS (
  SELECT Value, COUNT(*) AS GroupCount
  FROM   C
  GROUP BY Value),

 C1_Flat AS (
   SELECT Value, COUNT(*) AS GroupCount
   FROM   C1
   GROUP BY Value)

SELECT COALESCE(C_Flat.Value,C1_Flat.Value) AS Value,
       CASE WHEN C_Flat.GroupCount = C1_Flat.GroupCount
            THEN 'Same' ELSE 'Different' END AS Status,
       C_Flat.GroupCount AS C_GroupCount,
       C1_Flat.GroupCount AS C1_GroupCount
FROM  C_Flat
        FULL OUTER JOIN C1_Flat
           ON C_Flat.Value = C1_Flat.Value;

The output is this:

Value       Status    C_GroupCount C1_GroupCount
----------- --------- ------------ -------------
1           Same      3            3
2           Different 3            2
3           Different NULL         1

Obviously (as I will reiterate here), I am not advocating for such a table or set of data, but it happens. It all happens.

Summary

In this blog, I went over quite a few ways you can compare two tables using SQL set operators or joins. This is by no means every way you can assemble code together that does this, but in almost every case, you will find some of these techniques involved.

5 responses to “SQL Strategies for Comparing Sets of Data”

  1. […] Louis Davidson figured out which of these was not like the others, which of these just didn’t …: […]

    Like

  2. Scott Peters Avatar

    Thank you for the article! I’ve spent a lot of time comparing data over the decades, so I feel everyone’s pain trying to track down differences in two sets of data. I created the following if automation script to dynamically perform a FULL OUTER JOIN between two tables and report the differences.

    Microsoft-SQL-Server-Scripts/Tools/Table Validation at main · smpetersgithub/Microsoft-SQL-Server-Scripts

    And for prosperity, if anyone is reading this in the future and the above ever becomes a dead link, I probably have moved it to my main GitHub here.

    smpetersgithub/AdvancedSQLPuzzles: Welcome to my GitHub repository. I hope you enjoy solving these puzzles as much as I have enjoyed creating them.

    Like

  3. Philip Avatar
    Philip

    Typos:
    “In this section, I want to go directly to my favorite solution. It is harder to code, so the full verions is usally 

    Like

  4. AI, Crazy Quoted Identifiers, and comparing column names between tables – Drsql.link Avatar

    […] the output from such a tool isn’t always super simple. So having recently written about comparing sets of data, I decided to just write a quick query, which I will show later in the […]

    Like

Leave a comment

I’m Louis

I have been at this database thing for a long long time, with no plans to stop.

This is my blog site, companion to drsql.org

Recents