The myth and the truth of filtering early

An interesting phenomena has been occurring in LinkedIn that I really find interesting. It is the very repeated half truth. It is a concept that is true enough in some ways, but also wrong enough to cause confusion in people who are just learning. Transferring my primary social media use to LinkedIn has been a mixed bag, as you get a lot of noise like that (but also I feel more connected to the community there than anywhere else these days.)

The problem is that the same stuff that goes on with other social media platforms happens here too (you know, because it works). And that is that engagement matters more than correctness or value. Now while this can be fine for entertainment posts where opinions don’t mean much, on a place where technical details are presented as useful for learning… this is a problem. While I don’t want my blog to be all myth buster type posts, some of these half-truths need more than a few hundred characters to discuss over and over.

Which brings me to the point. There is a myth that goes around that you need to place filters in your SQL statements as early in the statement as possible. Most of this is due to the wild misunderstanding of how a query is executed (versus how your query is processed, which I covered last week.) The actual issue here is that the concept of filtering early is actually true, but certainly not in the way it has been taught.

Filter early is true

Filtering early in processing data of any kind is very important. If you remember books like an encyclopedia or even a dictionary, you don’t want to scan the entire encyclopedia to find details on a topic like a zoo, you go to the book with Z in it. Same with a dictionary, you want to filter by first letter. And like anything else, you need to do what you can to make the complex task easier. If you had to write a paper on all of the sports named ‘Football’, you wouldn’t write posts on all sports, and then filter out all but the football ones. (I may have the FIFA World Cup on the TV at my desk as I write this.)

For data processing, this is 100% true and absolutely necessary. It is a fundamental principle in database design and query optimization. You want to compare the least amount of data between different sets as possible, sort the smallest amount of data, and physically look through the least amount of data possible.

The biggest problem is that the neophyte writer often somehow takes this basic concept of filtering early and applies it incorrectly to SQL queries. They think that putting filters in the WHERE clause is slower than putting them in a CTE. Or that a filter in the ON clause somehow performs less well than a filter in the WHERE clause.

Assuming that people actually write these posts themselves and don’t just ask an AI to generate them (I am not against that, as Redgate’s SQL Prompt is hard at work trying to help me write this (as I write most of my posts in SQL Server Management Studio as comments first! I ignore most of the comments, but sometimes it actually tosses in some text that can be useful, yes, even when writing a blog!). But giving the benefit of the doubt, the humans writing these very similar and incorrect blogs need to understand one thing, most of the time, filter placement means nothing.

Except when it does.

A somewhat simple, but not too simple, example

The reason I typically don’t think writers understand what they are talking about is that the examples given are really simplistic. Something like the following (examples will use the AndventureWorks database, the 2025 version.

The superposition is made that this query will somehow be slower than putting the filter in the CTE:

			
SELECT *
FROM Sales.SalesOrderHeader
WHERE OrderDate >= '2013-01-01';

That has this actual query plan:

And this STATISTICS IO output (all other values are 0. I will only include up to logical reads for all examples because none of my testing will care about cached versus non-cached data):

Table 'SalesOrderHeader'. Scan count 1, logical reads 686

Now I move the filter to earlier in the query.

			
WITH SalesOrderHeaderCTE as
(
SELECT *
FROM Sales.SalesOrderHeader
WHERE OrderDate >= '2013-01-01'
)
SELECT *
FROM SalesOrderHeaderCTE;

		

This has the the exact same plan, and the exact same IO. Of course this is a really small query doing only one thing, but I used it for one specific reason, “it is easy enough to show”. After I have finished with the basics, I will show a few other more messy examples late in the blog. But if the CTE changes the plan of this query, we will know something interesting is happening.

The fact is, in all but the rarest cases, there will be no difference whatsoever in a plan between two queries like this, and all the CTE will do is to make your query harder to write, but what will be executed is the same.

I will note, in SQL Server, the CTE is never materialized into storage. It is just the same as a typical non-materialized/indexed view. It gets pulled into the main query and optimized as if it were written inline.

So you might even try something like this that filters the data into a temporary table:

			
DROP TABLE IF EXISTS #SalesOrderHeaderTT
SELECT *
INTO #SalesOrderHeaderTT
FROM Sales.SalesOrderHeader
WHERE OrderDate >= '2013-01-01'
SELECT *
FROM #SalesOrderHeaderTT;

		

This gives us these two plans (not the parallelism that is employed, which could be a good or bad thing depending on the size of the query/dataset.)

And this IO profile:

Table 'SalesOrderHeader'. Scan count 13, logical reads 719

Table '#SalesOrderHeaderTT__00000000005A'. Scan count 1, logical reads 790

Look at that plan and you will see it looks quite a bit more complex because it had to create temporary storage, and save off all the data before it can be read back out. This seems like it would be a much worse practice, and it often is. I admittedly used * in my query so I could reduce that down in most cases, but I might actually need all of the data in the table for this query. For most cases, you would not want to do this process.

In fact, what you are doing is moving the filtering up in the query, then you need to process it again instead of just the one time. In this example, it is so obviously silly, but… the primary difference between this simple case and the more complex cases is that it is obvious. I just rewrote the query into a CTE, and then materialized it so it had to be stored and scanned instead of just scanned.

Are temp tables ever a good idea like this?

Definitely. This is a tool in your toolbox. Like any other tool, it needs to be applied correctly. The fact is, in an OLTP type system, it is rare you need to use temp tables in this way. But in a OLAP type system, some of the queries and aggregations you have to do are very complex.

This is where the filter early concept applies to your code. When you take control of the query execution order you need to make sure that you are getting the value from your queries.

A more typical pattern when you end up using a temp table might be to do something like this:

			
SET STATISTICS IO ON;
GO
DROP TABLE IF EXISTS #SalesOrderHeaderTT
SELECT SalesOrderId
INTO #SalesOrderHeaderTT
FROM Sales.SalesOrderHeader
WHERE OrderDate >= '2013-01-01'
SELECT *
FROM Sales.SalesOrderHeader
       JOIN #SalesOrderHeaderTT
		 ON #SalesOrderHeaderTT.SalesOrderID 
		      = SalesOrderHeader.SalesOrderID;

		

This won’t change much, but you can apply an index like is suggested in the plan:

			
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>] 
      ON [Sales].[SalesOrderHeader] ([OrderDate])

And now if you look at the plan, loading the temp table is a lot less IO:

Table 'SalesOrderHeader'. Scan count 1, logical reads 73

And the second query is a lot less IO from the temp table:

Table 'Workfile'. Scan count 0, logical reads 0, 
Table 'Worktable'. Scan count 0, logical reads 0, 
Table 'SalesOrderHeader'. Scan count 1, logical reads 686, 
Table '#SalesOrderHeaderTT__000000000064'. Scan count 1, logical reads 51

Clearly this is not an improvement for the query over the original ones, but, if you have a table being filtered the same way over and over and an index is not an option, you can use a temp table with just keys, and even add an index to it and help your analytic queries to be faster.

After finishing up the practical advice, I will show a few more queries that are a lot more complex to show that this isn’t just a single table join, but suffice it to say that my advice is to write queries in the simplest form you can and the optmizer will likely do a better job that you will.

I did say likely. As the next section will note, the optmizer does its best, which isn’t always enough.

Filter early when you need to take control

The optimizer will almost always arrange the execution order of your query in the best way. However, sometimes it is actually needed because sometimes… the optimizer doesn’t optimize your query.

It wasn’t long ago that I had to tune a query that was mathematically right, but the plan chosen was devastatingly poor. Thanks to the live query statistics feature, I was able to see that even though this query had executed for 20 hours, it was wasn’t even 1% finished. At this point you have to do something differently, and that something differently is usually one of a few things.

Add an index that satisfies some hidden need
Simplify the query to be simpler so the optimizer will have less trouble optimizing it (often harder than it sounds!)
Add hints to force it into a better plan (indexes, join algorithms, etc)
Start breaking down the query in smaller pieces

Sometimes this is using a CTE, though as I have mentioned and have and will demonstrate, this is often meaningless, especially in a “simple” case. It really only works when your query is so complex that the optimizer would need to do unreasonable amounts of work to figure out the best plan and that usually means you are doing operations that need to be saved to disk. When intermediate results are needed,

Even though there are often reasons to say “the optimizer is being dumb, one thing that is important to understand. And as a writer/teacher I hate to say this, but it is really hard to demonstrate in a blog. Especially since a lot of times the issues happen when you have a complex view referencing a complex view … if you have tuned any queries, you get it, that gets messy. And in many of those cases you could replicate the issue using a CTE, since they are really just inline views in SQL Server.

Summary

This is not the end of the post, but the examples that follow are a bit messy. Hence I wanted to basically stop here and recap.

Most of the time, where you filter data is not important to performance. Unless it is.

Simple enough? No? This is one of the more interesting parts of writing SQL. Most of the time it works like magic. And then sometimes it feels like small battery powered hamsters are doing the work… and their batteries are very dead.

The key point though is writing your query in a single statement with no CTE/Derived Table is usually best for performance and readability. So if you can express your query like in that form, do that. If you need a CTE to shape your dataset in ways other than just filtering, that is great. CTEs are magnificant tools, not for performance, but for expressing some query that can’t be done in one pass.

A few more examples

I didn’t want to leave this will just one simple example, so I decided this should be done with a couple of examples.

Example 1: WHERE’s to CTE’s

For example 1, I decided to have SQL Prompt AI write me a query with several WHERE clause examples. You can see all the queries I used in this file on github.

Example 1 Code

This is the query I started with:

			
SELECT 
    soh.SalesOrderID,
    soh.OrderDate,
    soh.TotalDue,
    p.FirstName + ' ' + p.LastName AS CustomerName,
    sp.FirstName + ' ' + sp.LastName AS SalesPersonName,
    st.Name AS TerritoryName,
    st.[Group] AS TerritoryGroup,
    p.Suffix 
FROM Sales.SalesOrderHeader soh
INNER JOIN Sales.Customer c 
  ON soh.CustomerID = c.CustomerID
INNER JOIN Person.Person p 
  ON c.PersonID = p.BusinessEntityID
INNER JOIN Sales.SalesPerson sps 
  ON soh.SalesPersonID = sps.BusinessEntityID
INNER JOIN Person.Person sp 
  ON sps.BusinessEntityID = sp.BusinessEntityID
INNER JOIN Sales.SalesTerritory st 
  ON soh.TerritoryID = st.TerritoryID
WHERE 
    soh.OrderDate >= '2013-01-01'
    AND soh.TotalDue > 1000
    AND st.[Group] = 'North America'
    AND soh.Status = 5
    AND p.Suffix IS NOT NULL;

		

This is the plan that it created:

And the IO:

			
`Table 'Person'. Scan count 1, logical reads 4318, 
Table 'Workfile'. Scan count 0, logical reads 0, 
Table 'Worktable'. Scan count 0, logical reads 0
Table 'SalesOrderHeader'. Scan count 1, logical reads 686
Table 'SalesPerson'. Scan count 1, logical reads 2
Table 'SalesTerritory'. Scan count 1, logical reads 2
Table 'Customer'. Scan count 1, logical reads 123

		

Then, as you can see if you download the code, I went deliberately, step by step, moving filter by filter through various permutations until I got here:

			
--full filtering of Sales Order Table.    
WITH SalesOrderHeaderCTE as
(
    SELECT SalesOrderID
           ,OrderDate
           ,CustomerID
           ,SalesPersonID
           ,TerritoryID
           ,TotalDue
    FROM   Sales.SalesOrderHeader
    WHERE  TotalDue > 1000
      AND  OrderDate >= '2013-01-01'
      AND  Status = 5
),
SalesTerritoryCTE AS 
(
    SELECT TerritoryID
            ,Name
            ,[Group]
    FROM Sales.SalesTerritory
    WHERE [Group] = 'North America'
),
PersonCTE AS
(
    SELECT BusinessEntityID
           ,FirstName
           ,LastName
        ,Suffix
    FROM Person.Person
    WHERE Suffix IS NOT NULL
)
SELECT 
    soh.SalesOrderID,
    soh.OrderDate,
    soh.TotalDue,
    p.FirstName + ' ' + p.LastName AS CustomerName,
    sp.FirstName + ' ' + sp.LastName AS SalesPersonName,
    st.Name AS TerritoryName,
    st.[Group] AS TerritoryGroup,
    p.Suffix 
FROM SalesOrderHeaderCTE soh
INNER JOIN Sales.Customer c 
  ON soh.CustomerID = c.CustomerID
INNER JOIN PersonCTE  p 
  ON c.PersonID = p.BusinessEntityID
INNER JOIN Sales.SalesPerson sps 
  ON soh.SalesPersonID = sps.BusinessEntityID
INNER JOIN Person.Person sp 
  ON sps.BusinessEntityID = sp.BusinessEntityID
INNER JOIN SalesTerritoryCTE st 
  ON soh.TerritoryID = st.TerritoryID;

		

The plan and IO are duplicates of each other. In fact, all of the plances asked for the same index initially:

			
CREATE NONCLUSTERED INDEX Index1 
  ON [Sales].[SalesOrderHeader] ([Status],[OrderDate],[TotalDue]) 
  INCLUDE ([CustomerID],[SalesPersonID],[TerritoryID])

I added it. Just trying to prove a point, not dig into the values/concerns of indexing and trusting the missing index suggestions. Then the next execution it suggested:

			
CREATE NONCLUSTERED INDEX Index2 
   ON [Person].[Person] ([Suffix]) 
    INCLUDE ([FirstName],[LastName])

After adding that, it suggested yet another index. As a reminder, don’t just trust these indexes, in almost any case. If this was an actual query tuning session, I would have spent a lot more time looking at the index value, especially for an OLTP database!

I was just doing what it told me for this demo!

			
CREATE NONCLUSTERED INDEX Index3 
ON [Sales].[SalesOrderHeader] ([Status],[TerritoryID],[OrderDate],
                               [TotalDue]) 
   INCLUDE ([CustomerID],[SalesPersonID])

It stopped suggesting indexes, and executing the queries gave me the same plan for all permutations:

Looks better than before, and both indexes on the SalesOrderHeader table I created are used… but the filtering early had no effect.

Example 2: Including an aggregate

For this query, I start with the requirements that I want to see the lifetime sales for any cusotmer that had greater than $290,000 of sales in 2022. (As one does.)

Example 2 Code

So I write the following query (ok, SQL Prompt wrote the query, then I made it do what I wanted exactly as sometimes trying to describe what I want is slower than doing it.):

			
SELECT c.CustomerID,
       s.Name AS CompanyName,
       SUM(soh.TotalDue) AS TotalSales
FROM Sales.SalesOrderHeader soh
    INNER JOIN Sales.Customer c
        ON soh.CustomerID = c.CustomerID
    INNER JOIN Sales.Store AS s
        ON s.BusinessEntityID = c.StoreID
WHERE EXISTS (SELECT 1
               FROM  Sales.SalesOrderHeader sohExists
               WHERE sohExists.OrderDate >= '2022-01-01'
                 AND sohExists.OrderDate < '2023-01-01'
                 AND c.CustomerID = sohExists.CustomerID
               GROUP BY sohExists.CustomerID
                HAVING SUM(sohExists.TotalDue) > 290000)
GROUP BY c.CustomerID , s.Name              
ORDER BY TotalSales DESC;

		

Executing this, I get back 2 rows, and the following plan:

I am going to ignore the index this time, but I got this IO output from turning STATISTICS IO on.

Table 'Worktable'. Scan count 0, logical reads 0
Table 'SalesOrderHeader'. Scan count 3, logical reads 183
Table 'Store'. Scan count 0, logical reads 4
Table 'Customer'. Scan count 0, logical reads 4

In the next step, I am going to first move the text filter only into a CTE, then in the following version I move the GROUP BY and HAVING into the CTE:

			
--version 1
WITH SalesOrderHeaderCTE AS (
               SELECT CustomerId, TotalDue
               FROM  Sales.SalesOrderHeader sohExists
               WHERE sohExists.OrderDate >= '2022-01-01'
                 AND sohExists.OrderDate < '2023-01-01'
)
SELECT c.CustomerID,
       s.Name AS CompanyName,
       SUM(soh.TotalDue) AS TotalSales
FROM Sales.SalesOrderHeader soh
    INNER JOIN Sales.Customer c
        ON soh.CustomerID = c.CustomerID
    INNER JOIN Sales.Store AS s
        ON s.BusinessEntityID = c.StoreID
WHERE EXISTS ( SELECT 1
                FROM  SalesOrderHeaderCTE AS sohExists
                WHERE c.CustomerID = sohExists.CustomerID
               GROUP BY sohExists.CustomerID
                HAVING SUM(sohExists.TotalDue) > 290000)
GROUP BY c.CustomerID , s.Name              
ORDER BY TotalSales DESC;
--Version 2
WITH SalesOrderHeaderCTE AS (
               SELECT CustomerId
               FROM  Sales.SalesOrderHeader sohExists
               WHERE sohExists.OrderDate >= '2022-01-01'
                 AND sohExists.OrderDate < '2023-01-01'
               GROUP BY sohExists.CustomerID
                HAVING SUM(sohExists.TotalDue) > 290000
)
SELECT c.CustomerID,
       s.Name AS CompanyName,
       SUM(soh.TotalDue) AS TotalSales
FROM Sales.SalesOrderHeader soh
    INNER JOIN Sales.Customer c
        ON soh.CustomerID = c.CustomerID
    INNER JOIN Sales.Store AS s
        ON s.BusinessEntityID = c.StoreID
WHERE EXISTS ( SELECT 1
                FROM  SalesOrderHeaderCTE AS sohExists
                WHERE c.CustomerID = sohExists.CustomerID
               )
GROUP BY c.CustomerID , s.Name              
ORDER BY TotalSales DESC;

		

Guess what? Same plan, same IO. Filtering in either location did the exact same thing.

The end of this myth

Wouldn’t that be nice. But to reiterate, I am not saying you don’t filter data as early in the process as possible. I am just saying you don’t do that in SQL, because the optmizer’s job is to do that for you.

I equally am not saying you don’t sometimes need to rewrite your query to help the optimizer out.

But don’t think that in most cases that where you put the WHERE clause will mean anything other than more typing for you unless you are doing it for informed reasons.

3 responses to “The myth and the truth of filtering early”

Steve E

2026-06-18

On most of these LinkedIn posts, my response is “prove it” as none of them ever include an execution plan or any type of evidence

Loading…

1. Louis Davidson
  
  2026-06-18
  
  For sure. Weirdly though, sometimes that kind of post can be more persuasive than saying prove it because most people don’t want to get into the weeds.
  
  Loading…
  
What “Filtering Early” Really Means – Curated SQL

2026-06-18

[…] Louis Davidson lays out the facts: […]

Loading…

Drsql's Database Musings