The evil (and value) of DISTINCT

Part of the series “SQL Techniques You Should Know”

If there is one SQL keyword that causes more fear than any other, it’s DISTINCT. When I see it in a query, I immediately start to worry about just how much work I am in for to ensure the correctness of that query. I start scanning for comments to describe why it is there, and if none are found, I know the query is probably going to be wrong.

I have seen DISTINCT used to hide bad joins, missing grouping, and even missing WHERE clauses. I have seen developers use it as a “fix-all” for data problems.

In this blog, I will look at the proper use and distinctly dangerous uses of DISTINCT and also show how you might test your query that uses DISTINCT to see what it is actually covering up.

And in every example, I promise a comment explaining why we need DISTINCT.

The good

First off, there are valid uses of DISTINCT. One that everyone can use is to find permutations of values that are unique in a table that don’t really merit their own table. For instance, in the AndventureWorks database, the PRODUCT table has a Color column that contains various colors of the products that the fine folks at the AdventureWorks shop sell..

For example, if I want to get a list of all the colors that are actually used in the table, I could write the following query:

USE AdventureWorks2025
GO
SELECT DISTINCT Color --DISTINCT Color values use in the table
FROM Production.Product;

This construct is used often and is perfectly legitimate. It’s one of the most basic things you can do in SQL and is a great way to get a distinct list of values from a column or set of columns. The following query answers the question “What color products were ordered on certain days?”:

SELECT DISTINCT OrderDate, Color
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
               Sales.SalesOrderDetail.SalesOrderID
        JOIN Production.Product
            ON Sales.SalesOrderDetail.ProductID =
               Production.Product.ProductID
WHERE Color IS NOT NULL;

Sure, the query is sort of nonsensical, but it’s just here to illustrate what you might do with DISTINCT. This pattern of using DISTINCT to remove duplicate rows from a result set is useful in certain scenarios.

Why not GROUP BY?

Sometimes you see a query like the following, which does technically return the same data.

SELECT OrderDate, Color
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                    Sales.SalesOrderDetail.SalesOrderID
        JOIN Production.Product
            ON Sales.SalesOrderDetail.ProductID =
                Production.Product.ProductID
WHERE Color IS NOT NULL
GROUP BY OrderDate, Color;

Both queries return the same rows, and they both have the same query plan:

But GROUP BY should always be used for grouping data, and DISTINCT when you need to see unique rows. The biggest concern when using DISTINCT is what are you distincting away. When you are trying to diagnose duplicated data, you will typically start with a GROUP BY with an aggregate, usually to show how many duplicates:

SELECT OrderDate, Color, COUNT(*) --Usually at least a count
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                SalesOrderDetail.SalesOrderID
        JOIN Production.Product
            ON SalesOrderDetail.ProductID =
                Production.Product.ProductID
WHERE Color IS NOT NULL
GROUP BY OrderDate, Color
HAVING COUNT*) > 1; --and now cases of 1 or more

Use DISTINCT when you know there are duplicates that you need to output. It (plus those comments you are absolutely adding!) signals that you know there are duplicates, they are okay, and you need the set of values. Then use GROUP BY to aggregate, count, etc those duplicates.

This is why we shudder at the sight of `DISTINCT`

In the following query, see if you can easily see the issue?


SELECT OrderDate, Color
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                SalesOrderHeader.SalesOrderID
        JOIN Production.Product
            ON Product.ProductID =
                SalesOrderDetail.ProductID;

It isn’t a super complex query, so you probably see the issue, but I won’t give it away just yet. Grabbing a few rows from this output, you should be able to see pretty quickly that there is some data that is being duplicated.

OrderDate               Color
----------------------- ---------------
2022-05-30 00:00:00.000 Black
2022-05-30 00:00:00.000 Black
2022-05-30 00:00:00.000 Black
2022-05-30 00:00:00.000 Black
2022-05-30 00:00:00.000 Black

So, like any good employee, let’s assume that these are correct results, so let’s add DISTINCT so we can move on.

SELECT DISTINCT OrderDate, Color
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                SalesOrderHeader.SalesOrderID
        JOIN Production.Product
            ON Product.ProductID =
                    SalesOrderDetail.ProductID;

Did I mention that the version of the query without DISTINCT returns 3,817,239,405 rows. (It overflowed an int counting the number of rows so I had to use COUNT_BIG to count the results. The version of the query with DISTINCT? A “feels good” 10116 rows.

OrderDate               Color
----------------------- ---------------
2023-02-09 00:00:00.000 Silver
2023-04-21 00:00:00.000 Silver
2024-01-07 00:00:00.000 Silver
2024-12-04 00:00:00.000 Silver
2023-03-04 00:00:00.000 Silver
2024-01-30 00:00:00.000 Silver
2024-12-27 00:00:00.000 Silver

I got a time value and a color value that were output. Sweet. But…is it right? Not even in the slightest (there are not 3.8 billion rows in this sales data! Testing by “feels” (an upcoming blog for sure) is always the first thing we do when writing queries that return a lot of data. You can easily feel that 3,817,239,405 rows of output is wrong. But the second output feels reasonable.

But that is why we don’t stop testing by feels. If you feel like you need DISTINCT, using GROUP BY with some aggregates can be a good way to verify your results

SELECT OrderDate, Color, COUNT(*)
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                SalesOrderHeader.SalesOrderID
        JOIN Production.Product
            ON Product.ProductID =
                SalesOrderDetail.ProductID
GROUP BY OrderDate, Color
ORDER BY COUNT(*) DESC;

The output from this query outputs the same number of rows as the DISTINCT.. but look at those COUNT(*) values:

OrderDate               Color
----------------------- --------------- -----------
2025-03-30 00:00:00.000 NULL            9253295
2025-01-28 00:00:00.000 NULL            8365525
2024-12-30 00:00:00.000 NULL            8331380
2024-10-29 00:00:00.000 NULL            8263090
2025-03-30 00:00:00.000 Black           8156829
2024-06-29 00:00:00.000 NULL            7921640

Hmm, 9.25 million rows is a few more rows than we have in our Sales.SalesOrderHeader table to start with:

SELECT COUNT(*)
FROM   Sales.SalesOrderDetail;

This returns: 121317

So yeah, always sanity check any query where you see a lot of duplicated rows, especially ones that you don’t really expect. In this case, you need to look at all of the code in the query:

SELECT OrderDate, Color
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                SalesOrderHeader.SalesOrderID
        JOIN Production.Product
            ON Product.ProductID =
                SalesOrderDetail.ProductID;

Then notice this join:

JOIN Sales.SalesOrderDetail
ON SalesOrderHeader.SalesOrderID = SalesOrderHeader.SalesOrderID

This is true in every comparison that is not null (all of them in this case). So you get a Cartesian product between SalesOrderHeader and SalesOrderDetail, meaning for each of the 121317 rows of the SalesOrderDetail table (the number of rows you are expecting), you multiply the number of rows by the rows in SalesOrderHeader:

SELECT COUNT(*)
FROM   Sales.SalesOrderHeader;

That returns 31465 and 121317 * 31465 gets us back to the 3,817,239,405 number from earlier.

So yeah, not good. Fix the JOIN criteria:

SELECT OrderDate, Color
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                SalesOrderDetail.SalesOrderID --fixed!
        JOIN Production.Product
            ON Product.ProductID =
                SalesOrderDetail.ProductID

And now this returns 121,317 rows, and you can see that that matches the 121,317 rows in SalesOrderDetail.

And then you can DISTINCT that:

SELECT DISTINCT OrderDate, Color --Just need permutations of
--OrderDate and Color
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                SalesOrderDetail.SalesOrderID --fixed!
        JOIN Production.Product
            ON Product.ProductID =
                SalesOrderDetail.ProductID;

Which returns 5031 rows, and you can test that with:

SELECT OrderDate, Color, COUNT(*)
FROM   Sales.SalesOrderHeader
        JOIN Sales.SalesOrderDetail
            ON SalesOrderHeader.SalesOrderID =
                SalesOrderDetail.SalesOrderID --fixed!
        JOIN Production.Product
            ON Product.ProductID =
                    SalesOrderDetail.ProductID
GROUP BY GROUPING SETS( (OrderDate, Color),());

Which will return 5032 rows, with the extra row being:

OrderDate               Color
----------------------- --------------- -----------
NULL                    NULL            121317

Summary

Every SQL programmer should know how and why to use DISTINCT. It is a very useful tool when you need it. It should never be used as a tool to eliminate duplicate data you didn’t expect to find in query results. It is meant strictly for when you know what you are expecting. You know there is some combination of column values that you want to get to see the output, and you know why they are repeated.

It is generally not a tool to find or remove duplicates in a set. Usually, duplication is not defined on an entire row. So if you have Customer 100, Name: Bob, ShoeSize 12; you may have Customer 100, Name: Bob, ShoeSize 12.5. DISTINCT won’t help you here if you need all three columns. Finding (and dealing with) duplicates is a topic that I will cover in a future post.

Drsql.link