Hash Partitioned Exchange Spills

This blog post contains a few demos for generating hash partitioned exchange spills. It does not attempt to explain why performance is so bad in some cases, but I think that the behavior here is simply interesting to observe. Note that all of the demos were done on SQL Server 2016 SP1 CU4. Some of this may not be reproducible on other versions.

Order Preserving Streams

First I need to say a few words about repartition and gather streams operators. Here’s an example of one:

a14_operator

These operators are easy to misunderstand. Even when they have an “order by” they do not directly do a sort in the traditional sense. Instead, they rely on the ordered input threads to produce 1 or more ordered output threads. There’s no memory grant associated with them. For an illustration of how this could work, consider a very simple example with 4 rows on two threads:

a14_order_preserving

After the first step, values 1 and 2 from thread 1 are processed. There is a switch to thread 2 that moves 3, 4, and 5, and so on. This all explained in a much better way by Paul White in his talk on parallelism at the 2013 PASS Summit:

What is an Exchange Spill?

As usual, the good stuff is hidden in extended event descriptions:

Occurs when the memory communication buffers for a query with multiple Parallelism operators become full, resulting in one of the operators writing to TempDB. If this happens multiple times in a single query plan the query performance is impacted. Use this event in conjunction with any of the *_showplan events to determine which operation in the generated plan is causing the exchange spill using the node_id field

According to Paul White, one way to get a deadlock is when the buffers are full but there aren’t any rows on one of the threads. There is a brilliant demo that involves partitioning by round robin near the end of the talk that starts here:

This blog post focuses on deadlocks that occur with hash partitioning.

The Test Query

Only one table is needed to see exchange spills caused by hash partitioning. The first column stores the ID used for the join and the second column is used to pad out the pages. The clustered index isn’t a primary key to allow for duplicate values. Table definition:

DROP TABLE IF EXISTS DEADLOCK;

CREATE TABLE DEADLOCK (
	ID BIGINT NOT NULL,
	FLUFF VARCHAR(100)
);

CREATE CLUSTERED INDEX CI__DEADLOCK ON DEADLOCK (ID);

The query that I’ll run forces a parallel merge join with varying MAXDOP:

SELECT t1.ID
FROM DEADLOCK t1
WHERE EXISTS (
       SELECT 1
       FROM DEADLOCK t2
       WHERE t1.ID = t2.ID
)
ORDER BY t1.ID
OPTION (QUERYTRACEON 8649, MERGE JOIN, MAXDOP 2);

With this query, we can force an order preserving repartition streams to be hashed against a column with as few distinct values as we like. Note that there is an element of chance to this. For some data distributions a deadlock may not always occur. The performance of the same query can vary to an extreme degree as well.

Getting a Deadlock

One way to see a deadlock is by putting 50k rows into the table with four distinct values for ID:

TRUNCATE TABLE DEADLOCK;

INSERT INTO DEADLOCK WITH (TABLOCK)
SELECT
  (RN - 1) / 12500
, REPLICATE('Z', 100)
FROM (
       SELECT TOP (50000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
       FROM master..spt_values t1
       CROSS JOIN master..spt_values t2
	   CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

UPDATE STATISTICS DEADLOCK CI__DEADLOCK WITH FULLSCAN;

Running the SELECT query from before with MAXDOP 2 seems to pretty reliably produce a deadlock. The query typically takes around 7 seconds to run at first but it usually finishes much quicker after the deadlock checker has woken up. The deadlock can be seen with the exchange_spill extended event or by the tempdb spill in the repartition streams operator:

a14_deadlock

Putting the Dead in Deadlock

Some queries have extremely variable performance. They can run for seconds, minutes, hours, or even longer than a day. They can eventually be killed by the deadlock monitor. I had one such query running for longer than 24 hours, but apparently Microsoft got embarrassed and killed SSMS:

a14_SSMS

There are many ways to see this behavior. Inserting alternating 0s and 1s seems to do the trick:

TRUNCATE TABLE DEADLOCK;

INSERT INTO DEADLOCK WITH (TABLOCK)
SELECT
  RN % 2
, REPLICATE('Z', 100)
FROM (
       SELECT TOP (100000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
       FROM master..spt_values t1
       CROSS JOIN master..spt_values t2
	   CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

UPDATE STATISTICS DEADLOCK CI__DEADLOCK WITH FULLSCAN;

The first 49000 rows or so are displayed fairly consistently in SSMS. After that the query slows to a crawl. It only used 140 ms of CPU time after five minutes of execution. I wasn’t able to get this query to finish on my machine, but other similar queries finished after many hours. The data in sys.dm_exec_query_profiles is interesting:

a14_dmv

Assuming a packet size of 4500 rows, the scan at node id 8 is just one packet away from finishing. Thread 1 for the repartition streams is finished along with thread 1 of the merge join. All 50k rows with a value of 0 have been processed by the merge join but only 49898 rows made it to the gather streams at the end of the plan. I’ve seen this kind of behavior with the performance issue that affects some parallel queries with a TOP operator.

All six rows from sys.dm_os_waiting_tasks have a wait type of CXPACKET. There are resource descriptions of WaitType=e_waitPortClose. Ultimately, it’s not clear to me why this query appears to run “forever”, but one way or another it should eventually finish.

Final Thoughts

The same behavior can be seen in 2017 RC2. I couldn’t get either of the two example queries to finish on that version. Some of my test cases don’t cause deadlocks in 2016 SP1 CU2. It appears that Microsoft has done work in this area with negative consequences for some data distributions. A theory for why this happens can be found here. Microsoft appears to have fixed this in 2016 SP1 CU6. Thanks for reading!

Dumping Call Stacks

If you’re like me, you’ve seen people posting information found through debugging SQL Server and thought “Yeah, I could be that cool”. If you want to see call stacks but failed to get anywhere with the debugger then this post might be for you. It contains step-by-step instructions for viewing SQL Server call stacks by creating a minidump.

What is a Minidump?

I will borrow a definition from Thomas Kejser’s blog post:

What is a dump? It is a file containing a snapshot of the running process – and parts or all of the memory space of that process. The snapshot also contains the call stack of every thread the process has created.

That last sentence is relevant to our interests. We can use minidumps to generate small files that contain SQL Server call stacks. Note that if you aren’t careful you can end up writing the full contents of memory to a dump file. I believe that writing the file is a single-threaded process, so this can take hours and hundreds of GBs on a real server. Also SQL Server is frozen while the dump is happening, so don’t do it in production.

Dumps are most useful when you want to examine a long running, very simple query or a query that seems to be stuck at a certain point. They aren’t useful to see a full call stack of a non-simple query or to see something that happens just once during execution, like a specific task during query compilation. All that you can get is a snapshot and the snapshot may not contain the information that you’re looking for.

There are multiple ways to generate and read dump files. For this blog post I’ll be using sqldumper.exe and WinDbg.

Using sqldumper.exe

First go to the directory that contains sqldumper.exe. On my machine for SQL Server 2016 the directory is C:\Program Files\Microsoft SQL Server\130\Shared . Open an admin command prompt and point it to that directory. Here are the arguments that we need to look at call stacks:

C:\Program Files\Microsoft SQL Server\130\Shared>sqldumper
Usage: sqldumper [ProcessID [ThreadId [Flags[:MiniDumpFlags] [SqlInfoPtr [DumpDir

There are many ways to get the ProcessID for SQL Server. One way is to run the following SQL query:

SELECT SERVERPROPERTY('PROCESSID');

At the time of writing this post I have a process ID of 2364.

If ThreadId is set to 0 you’ll get information about all threads, including system threads that you might not be interested in. Sometimes you can get everything that you’re interested in by looking at a single thread for serial queries, or at a single thread if a parallel query appears to be throwing too much work at that thread. There’s probably a better way to write this query, but the query that I use to find the thread that I’m interested in is below:

SELECT th.os_thread_id, wta.wait_type
FROM sys.dm_os_threads th
INNER JOIN sys.dm_os_workers w
	ON th.worker_address = w.worker_address
INNER JOIN sys.dm_os_tasks tk
	ON w.task_address = tk.task_address
INNER JOIN sys.dm_os_waiting_tasks wta
	ON wta.waiting_task_address = tk.task_address
where tk.session_id = 56;

wait_type is there to give a clue about which threads are interesting. More columns can be added to the query as needed.

The Flags parameter controls which information is written to the dump file. There are many options and some of them write the full contents of memory to disk. The most useful ones that I’ve found are 0x0120 (dump minimal information about all threads) and 0x0100 (dump information about a single specified thread).

I always set SqlInfoPtr to 0 and don’t care to know what it does.

DumpDir is where the dump files are written to. Point it to your preferred place for leaving dumps, like Erik’s doorstep.

Getting the Call Stack

If your dump is successful then you’ll end up with a .mdmp file. This can be opened with your favorite debugging program. I use WinDbg because there are instructions for it and I generally don’t know what I’m doing. Open the program and go to File -> Open Crash Dump. Open your file and type the magic command:

~*kn

After a short while you’ll see call stack information:

a13_magic

As far as I can tell, you have to close and reopen the program in order to open a new crash dump file. This is unfortunate, but it makes viewing the call stack that much sweeter.

Insert Example

Let’s start with a simple example. How do call stacks differ when inserting into a heap vs a clustered index? For the source data I’ll put 2 GB of pages into a clustered index:

DROP TABLE IF EXISTS dbo.source_ci;

create table dbo.source_ci (
	ID BIGINT NOT NULL,
	FILLER VARCHAR(7777) NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO source_ci WITH (TABLOCK)
SELECT TOP (250000)
	ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
	, REPLICATE('Z', 7777)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

Here’s the definition for the target table with a clustered index:

DROP TABLE IF EXISTS dbo.target_ci;

create table dbo.target_ci (
	ID BIGINT NOT NULL,
	FILLER VARCHAR(7777) NOT NULL,
	PRIMARY KEY (ID)
);

And here’s the query that I want to get a dump of:

INSERT INTO dbo.target_ci WITH (TABLOCK)
SELECT *
FROM dbo.source_ci WITH (TABLOCK);

I previously wrote the source data to source_ci in order to make the insert query into target_ci as simple as possible. I want to maximize the chances that the call stack associated with the step of inserting rows into the table is present in the dump file. If the query to select the data for the insert is too complicated then I might not get what I’m looking for when I take a snapshot.

I started the insert, identified the thread of interest (8324), and dumped information for the single thread:

C:\Program Files\Microsoft SQL Server\130\Shared>sqldumper 2364 8324 0x0100 0 c:\sql_dumps
Parsed parameters:
    ProcessID = 2364
    ThreadId = 8324
    Flags = 0x100
    MiniDumpFlags = 0x1160
    SqlInfoPtr = 0x0000000000000000
    DumpDir = c:\sql_dumps
    ExceptionRecordPtr = 0x0000000000000000
    ContextPtr = 0x0000000000000000
    ExtraFile = <NULL>
    PatternForExtraFiles = <NULL>
    InstanceName = <NULL>
    ServiceName = <NULL>
Remote process didn't specify a dump file name
Target suspended
Callback type 11 not used
Callback type 15 not used
Callback type 7 not used
MiniDump completed: c:\sql_dumps\SQLDmpr0024.mdmp
Total Buffer pool data pages filtered out: 0 KB
Total Hekaton data pages filtered out: 0 KB
Total Free memory (from non top level allocators) filtered out: 0 KB
Total top level free memory filtered out: 0 KB
Total Log pool memory filtered out: 0 KB
Location of module 'dbghelp.dll' : 'C:\Program Files\Microsoft SQL Server\130\Shared\dbghelp.dll'
File version of module 'C:\Program Files\Microsoft SQL Server\130\Shared\dbghelp.dll' : '6.12:2.633'
Product version of module 'C:\Program Files\Microsoft SQL Server\130\Shared\dbghelp.dll' : '6.12:2.633'
Location of module 'sqldumper.exe' : 'C:\Program Files\Microsoft SQL Server\130\Shared\SqlDumper.exe'
File version of module 'C:\Program Files\Microsoft SQL Server\130\Shared\SqlDumper.exe' : '2015.130:1601.5'
Product version of module 'C:\Program Files\Microsoft SQL Server\130\Shared\SqlDumper.exe' : '13.0:1601.5'
Watson Invoke: No

I did the same thing with a heap target table (with a MAXDOP 1 hint), and diffed the call stacks:

a13_call_stacks

Some of the function names are the same, which makes a lot of sense. We’re reading from the same source table. Of course there are differences as well. For example, for the heap we see sqlmin!CHeapBuild::InsertRow+0x151 and for the clustered index we see sqlmin!CIndBuild::InsertRow+0xd84. That’s pretty neat.

Stuck Query Example

For the next example I’ll use my favorite query that never finishes. First we need to create a few tables:

DROP TABLE IF EXISTS dbo.TestDriver;

CREATE TABLE dbo.TestDriver
(
    n integer NOT NULL,
    n2 integer NOT NULL
);

-- 100k rows
INSERT dbo.TestDriver WITH (TABLOCK) (n, n2)
SELECT TOP (100000)
CHECKSUM(sv1.number, NEWID()), CHECKSUM(sv1.number, NEWID())
FROM master.dbo.spt_values AS SV1
CROSS JOIN master.dbo.spt_values AS SV2
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS dbo.TestCCI;

CREATE TABLE dbo.TestCCI
(
    n integer NOT NULL,
    INDEX ccsi CLUSTERED COLUMNSTORE
);

-- 10 M rows
INSERT dbo.TestCCI WITH (TABLOCK) (n)
SELECT TOP (10 * 1000 * 1000)
CHECKSUM(sv1.number, NEWID())
FROM master.dbo.spt_values AS SV1
CROSS JOIN master.dbo.spt_values AS SV2
CROSS JOIN master.dbo.spt_values AS SV3
OPTION (MAXDOP 1);

The following query seemingly runs forever:

SELECT CA.x
FROM
(
    SELECT TOP (1) n2
    FROM dbo.TestDriver
    ORDER BY n ASC, n2 DESC
) AS T1 (id2)
CROSS APPLY
(
    SELECT COUNT_BIG(*)
    FROM dbo.Test AS T2
    WHERE T2.n <= T1.id2
) AS CA (x);

Viewing a call stack could be helpful once you’ve exhausted the usual ways of trying to figure out why the query isn’t finishing. If I run the query to get thread information I see that os_thread_id 4176 has a wait type of CXPACKET (zzzzz) and os_thread_id 3076 has a wait type of HTBUILD. Time to take a dump:

sqldumper 2364 3076 0x0100 0 c:\sql_dumps

After running the magic command:

ntdll!NtSignalAndWaitForSingleObject+0x14
KERNELBASE!SetHandleCount+0x1f850
sqldk!SOS_Scheduler::Switch+0x106
sqldk!SOS_Scheduler::SuspendNonPreemptive+0xd3
sqlmin!EventInternal<SuspendQueueSLock>::Wait+0x1e7
sqlmin!CSyncPoint::WaitAtNthGate+0x1ac
sqlmin!CSyncPoint::Wait+0x13e
sqlmin!CBpSpillProcessor::Main+0xf8
sqlmin!CBpQScanHashAggNew::BpGetNextBatch+0x52
sqlmin!CQScanBatchHelper::GetRow+0x97
sqlmin!CQScanNLJoinTrivialNew::GetRow+0x12c
sqlmin!CQScanProfileNew::GetRowImp<0>+0x11d
sqlmin!CQScanXProducerNew::GetRowHelper+0x63
sqlmin!CQScanXProducerNew::GetRow+0x15
sqlmin!FnProducerOpen+0x5b
sqlmin!FnProducerThread+0x7a9
sqlmin!SubprocEntrypoint+0x10ab
sqldk!SOS_Task::Param::Execute+0x231
sqldk!SOS_Scheduler::RunTask+0xaa
sqldk!SOS_Scheduler::ProcessTasks+0x3cd
sqldk!SchedulerManager::WorkerEntryPoint+0x2a1
sqldk!SystemThread::RunWorker+0x8f
sqldk!SystemThreadDispatcher::ProcessWorker+0x2de
sqldk!SchedulerManager::ThreadEntryPoint+0x1d8
kernel32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

Now you can write even more detailed connect items that will never get fixed!

Final Thoughts

Now you can impress your friends with call stacks, as long as you have easily impressed friends. Thanks for reading!

Dynamic Data Unmasking

Dynamic data masking is a SQL Server 2016 feature to mask sensitive data at the column level from non-privileged users. Hiding SSNs is a common example in the documentation. However, the documentation also gives the following warning:

The purpose of dynamic data masking is to limit exposure of sensitive data, preventing users who should not have access to the data from viewing it. Dynamic data masking does not aim to prevent database users from connecting directly to the database and running exhaustive queries that expose pieces of the sensitive data.

How bad can it be? This post explores how quickly a table of SSNs can be unmasked by a non-privileged user.

Simple Demo

Let’s use a table structure very similar to the example in the documentation:

DROP TABLE IF EXISTS dbo.People;

CREATE TABLE dbo.People (
	PersonID bigint PRIMARY KEY,
	FirstName varchar(100) NOT NULL,
	LastName varchar(100) NOT NULL,
	SSN varchar(11)
		MASKED WITH (FUNCTION = 'default()') NULL
);

INSERT INTO dbo.People
VALUES (1, 'Pablo', 'Blanco','123-45-6789');

Here’s what the data looks like for a privileged user, such as a user with sa:

a12_sa_results

However, if I login with my lowly erik SQL Server login I can no longer see Pablo Blanco’s SSN:

a12_erik_results

Test Data

To make things more interesting let’s load a million rows into the table. SSNs will be randomized but I didn’t bother randomizing the first and last names.

DROP TABLE IF EXISTS dbo.People;

CREATE TABLE dbo.People (
	PersonID bigint PRIMARY KEY,
	FirstName varchar(100) NOT NULL,
	LastName varchar(100) NOT NULL,
	SSN varchar(11)
		MASKED WITH (FUNCTION = 'default()') NULL
);

INSERT INTO dbo.People WITH (TABLOCK)
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('A', 10)
, REPLICATE('Z', 12)
, RIGHT('000' + CAST(ABS(CHECKSUM(NewId())) AS VARCHAR(11)), 3)
	 + '-' + RIGHT('00' + CAST(ABS(CHECKSUM(NewId())) AS VARCHAR(11)), 2)
	 + '-' + RIGHT('0000' + CAST(ABS(CHECKSUM(NewId())) AS VARCHAR(11)), 4)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

How quickly can the malicious end user erik decode all of the data? Does he really require a set of exhaustive queries? To make things somewhat realistic, setting trace flags and creating objects is off limits. Only temp tables can be created, since all users can do that.

Decoding the SSN Format

The WHERE clause of queries can be used to infer information about the data. For example, the following query is protected by data masking because all of the action is in the SELECT clause:

SELECT PersonId
, FirstName
, LastName
, CASE LEFT(SSN, 1)
	WHEN '0' THEN '0'
	WHEN '1' THEN '1'
	WHEN '2' THEN '2'
	WHEN '3' THEN '3'
	WHEN '4' THEN '4'
	WHEN '5' THEN '5'
	WHEN '6' THEN '6'
	WHEN '7' THEN '7'
	WHEN '8' THEN '8'
	WHEN '9' THEN '9'
	ELSE NULL
  END D1
FROM dbo.People;

However, the following query will only return the subset of rows with 1 as the first digit in their SSNs:

SELECT PersonId
, FirstName
, LastName
FROM dbo.People
WHERE LEFT(SSN, 1) = 1;

With 90 queries we could get all of the information that we need, but that’s too much work. First we need to verify the format of the SSN in the column. Perhaps it has dashes and perhaps it doesn’t. Let’s say that our malicious end user gets lucky and both of the following queries return a count of one million rows:

SELECT COUNT(*)
FROM dbo.People
WHERE LEN(SSN) = 11;

SELECT COUNT(*)
FROM dbo.People
WHERE LEN(REPLACE(SSN, '-', '')) = 9;

It’s a reasonable assumption that the SSN is in a XXX-XX-XXXX format, even though the data mask doesn’t tell us that directly.

Looping to Victory

Armed with our new knowledge, we can create a single SQL query that decodes all of the SSNs. The strategy is to define a single CTE with all ten digits and to use one CROSS APPLY for each digit in the SSN. Each CROSS APPLY only references the SSN column in the WHERE clause and returns the matching prefix of the SSN that we’ve found so far. Here’s a snippet of the code:

SELECT
	 p.PersonID
       , d9.real_ssn
FROM dbo.People p
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d0.DIGIT + '%'
) d1 (prefix)
CROSS APPLY (
       SELECT TOP 1 d1.prefix + d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d1.prefix + d0.DIGIT + '%'
) d2 (prefix)

In the d1 derived table the first digit is found. That digit is passed to the d2 derived table and the first two digits are returned from d2. This continues all the way to d9 which has the full SSN. The full query is below:

DROP TABLE IF EXISTS #t;

WITH DIGITS (DIGIT)
AS
(
	SELECT *
	FROM (
		VALUES ('0'), ('1'), ('2'), ('3'), ('4')
	       , ('5'), ('6'), ('7'), ('8'), ('9')
	) v(x)
)
SELECT
	 p.PersonID
       , p.FirstName
       , p.LastName
       , d9.real_ssn
	into #t
FROM dbo.People p
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d0.DIGIT + '%'
) d1 (prefix)
CROSS APPLY (
       SELECT TOP 1 d1.prefix + d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d1.prefix + d0.DIGIT + '%'
) d2 (prefix)
CROSS APPLY (
       SELECT TOP 1 d2.prefix + d0.DIGIT + '-'
       FROM DIGITS d0
       WHERE p.SSN LIKE d2.prefix + d0.DIGIT + '%'
) d3 (prefix)
CROSS APPLY (
       SELECT TOP 1 d3.prefix + d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d3.prefix + d0.DIGIT + '%'
) d4 (prefix)
CROSS APPLY (
       SELECT TOP 1 d4.prefix + d0.DIGIT + '-'
       FROM DIGITS d0
       WHERE p.SSN LIKE d4.prefix + d0.DIGIT + '%'
) d5 (prefix)
CROSS APPLY (
       SELECT TOP 1 d5.prefix + d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d5.prefix + d0.DIGIT + '%'
) d6 (prefix)
CROSS APPLY (
       SELECT TOP 1 d6.prefix + d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d6.prefix + d0.DIGIT + '%'
) d7 (prefix)
CROSS APPLY (
       SELECT TOP 1 d7.prefix + d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d7.prefix + d0.DIGIT + '%'
) d8 (prefix)
CROSS APPLY (
       SELECT TOP 1 d8.prefix + d0.DIGIT
       FROM DIGITS d0
       WHERE p.SSN LIKE d8.prefix + d0.DIGIT + '%'
) d9 (real_ssn);

On my machine, this query takes an average of 5952 ms to finish. Here’s a sample of the results:

a12_sample_results

Not bad to unmask one million SSNs.

Looping Even Faster to Victory

The LIKE operator is a bit heavy for what we’re doing. Another way to approach the problem is to have each derived table just focus on a single digit and to concatenate them all together at the end. I found SUBSTRING to be the fastest way to do this. The full query is below:

DROP TABLE IF EXISTS #t;

WITH DIGITS (DIGIT)
AS
(
	SELECT *
	FROM (
		VALUES ('0'), ('1'), ('2'), ('3'), ('4')
	       , ('5'), ('6'), ('7'), ('8'), ('9')
	) v(x)
)
SELECT
	 p.PersonID
       , p.FirstName
       , p.LastName
       , d1.DIGIT + d2.DIGIT + d3.DIGIT
		+ '-' + d4.DIGIT + d5.DIGIT
		+ '-' + d6.DIGIT + d7.DIGIT
		+ d8.DIGIT + d9.DIGIT AS real_ssn
	   into #t
FROM dbo.People p
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 1, 1) = d0.DIGIT
) d1 (DIGIT)
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 2, 1) = d0.DIGIT
) d2 (DIGIT)
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 3, 1) = d0.DIGIT
) d3 (DIGIT)
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 5, 1) = d0.DIGIT
) d4 (DIGIT)
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 6, 1) = d0.DIGIT
) d5 (DIGIT)
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 8, 1) = d0.DIGIT
) d6 (DIGIT)
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 9, 1) = d0.DIGIT
) d7 (DIGIT)
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 10, 1) = d0.DIGIT
) d8 (DIGIT)
CROSS APPLY (
       SELECT TOP 1 d0.DIGIT
       FROM DIGITS d0
       WHERE SUBSTRING(p.SSN, 11, 1) = d0.DIGIT
) d9 (DIGIT);

This query runs in an average on 1833 ms on my machine. The query plan looks as you might expect. Each cross apply is implemented as a parallel nested loop join against a constant scan of 10 values. On average each constant scan operator produces roughly 5.5 million rows. This makes sense, since for each loop we’ll need to check an average of 5.5 values before finding a match, assuming perfectly distributed random digits. Here’s a representative part of the plan:

a12_query1

Letting SQL Server do the Work

With nine digits we end up reading almost 50 million values from the constant scan operators. That’s a lot of work. Can we write a simpler query and let SQL Server do the work for us? We know that SSNs are always numeric, so if we had a table full of all billion possible SSNs then we could join to that and just keep the value from the table. Populating a temp table with a billion rows will take too long, but we can simply split up the SSN into its natural three parts and join to those tables. One way to do this is below:

SELECT TOP (100)
	RIGHT('0' + CAST(t.RN AS VARCHAR(10)), 2) NUM
INTO #t_100
FROM
(
	SELECT -1 + ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

SELECT TOP (1000)
	RIGHT('00' + CAST(t.RN AS VARCHAR(10)), 3) NUM
INTO #t_1000
FROM
(
	SELECT -1 + ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

SELECT TOP (10000)
	RIGHT('000' + CAST(t.RN AS VARCHAR(10)), 4) NUM
INTO #t_10000
FROM
(
	SELECT -1 + ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

DROP TABLE IF EXISTS #t;

SELECT
	 p.PersonID
       , p.FirstName
       , p.LastName
       , t1000.NUM
		+ '-' + t100.NUM
		+ '-' + t10000.NUM AS SSN
	into #t
FROM dbo.People p
LEFT OUTER JOIN #t_1000 t1000
	ON SUBSTRING(p.SSN, 1, 3) = t1000.NUM
LEFT OUTER JOIN #t_100 t100
	ON SUBSTRING(p.SSN, 5, 2) = t100.NUM
LEFT OUTER JOIN #t_10000 t10000
	ON SUBSTRING(p.SSN, 8, 4) = t10000.NUM;

The query now runs in an average of 822 ms. Note that I didn’t try very hard to optimize the inserts into the temp tables because they finish almost instantly. Taking a look at the plan, we see a lot of repartition stream operators because the column for the hash join is different for each query:

a12_repartition

Can we go faster?

Batch Mode to the Rescue

With parallel batch mode hash joins we don’t need to repartition the streams of the larger outer result set. I changed the query to only look at the table with 10000 rows to get more consistent and even parallel row distribution on the temp tables. I also added a clustered index on the temp table for the same reason. In addition to that, maybe we can expect joins to be faster with INT join columns as opposed to VARCHAR. With the canonical #BATCH_MODE_PLZ temp table to make the query eligible for batch mode, the query now looks like this:

SELECT TOP (100000)
    ISNULL(CAST(RN AS INT), 0) NUM
INTO #t_10000
FROM
(
    SELECT -1 + ROW_NUMBER()
        OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) t;

CREATE CLUSTERED INDEX CI ON #t_10000 (NUM);
 
CREATE TABLE #BATCH_MODE_PLZ (
    I INT
    , INDEX C CLUSTERED COLUMNSTORE
);
 
DROP TABLE IF EXISTS #t;
 
SELECT
     p.PersonID
       , p.FirstName
       , p.LastName
       , t1000.NUM
        + '-' + t100.NUM
        + '-' + t10000.NUM AS SSN
    into #t
FROM dbo.People p
LEFT OUTER JOIN #t_10000 t1000
    ON CAST(SUBSTRING(p.SSN, 1, 3) AS INT) = t1000.NUM
LEFT OUTER JOIN #t_10000 t100
    ON CAST(SUBSTRING(p.SSN, 5, 2) AS INT) = t100.NUM
LEFT OUTER JOIN #t_10000 t10000
    ON CAST(SUBSTRING(p.SSN, 8, 4) AS INT) = t10000.NUM
LEFT OUTER JOIN #BATCH_MODE_PLZ ON 1 = 0;

The query now runs in an average of 330 ms. The repartition stream operators are no longer present:

a12_no_repart

It wasn’t clear to me how to speed this query up further. The probe residuals in the hash joins are one target:

a12_probe

These appear because SQL Server cannot guarantee that hash collisions won’t occur. Paul White points out the following:

If the join is on a single column typed as TINYINT, SMALLINT or INTEGER and if both columns are constrained to be NOT NULL, the hash function is ‘perfect’ – meaning there is no chance of a hash collision, and the query processor does not have to check the values again to ensure they really match.

Unfortunately, the probe residual remains even with the right temp table definition and adding explicit casts and non-null guarantees to the SUBSTRING expression. Perhaps the type information is lost in the plan and cannot be taken advantage of.

Final Thoughts

I don’t think that there’s really anything new here. This was mostly done for fun. Decoding a million SSNs in half a second is a good trick and a good reminder to be very careful with expectations around how much security data masking really gives you. Thanks for reading!

The Trillion Row Table

I loaded one trillion rows into a nonpartitioned table just to see what would happen. Spoiler: bad things happen.

Hardware and Table Plan

I did all of my testing on my home desktop which has an i5-4670 CPU (quad core), 5 GB of RAM for SQL Server, and a Samsung SSD 850 EVO 1 TB. Obviously not ideal to do testing like this but it was enough to get the job done. In order to fit a trillion rows into a table I had to use a CCI. Rowstore tables simply don’t offer good enough compression. For example, consider a page compressed heap with nothing but NULL values. One million rows takes up 10888 KB of disk space:

DROP TABLE IF EXISTS dbo.TEST_HEAP;
CREATE TABLE dbo.TEST_HEAP (
ID BIGINT
) WITH (DATA_COMPRESSION = PAGE);

INSERT INTO dbo.TEST_HEAP WITH (TABLOCK)
SELECT TOP (1000000) NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

EXEC sp_spaceused 'TEST_HEAP';

Therefore, such a table with a trillion rows would require around 10 TB of disk space. That won’t fit on a 1 TB HDD, but the better compression of CCIs gives us some options. Ultimately I decided on building completely full rowgroups of 1048576 rows with each rowgroup only storing a single value. Estimating space for the final table is difficult, but we can hopefully get an upper bound using the following code:

DROP TABLE IF EXISTS dbo.TRIAL_BALLOON;
CREATE TABLE dbo.TRIAL_BALLOON (
ID BIGINT,
INDEX CCI_TRIAL_BALLOON CLUSTERED COLUMNSTORE
) 

INSERT INTO dbo.TRIAL_BALLOON WITH (TABLOCK)
SELECT TOP (10 * 1048576) NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

EXEC sp_spaceused 'TRIAL_BALLOON';

The table has 776 KB reserved space and 104 KB space for data. In the worst case we might expect the final table to require 75 GB space on disk.

Serial Population Strategy

My strategy for populating the table was to run the same code in four different SQL Server sessions. Each session grabs the next ID from a sequence and does a MAXDOP 1 insert of 1048576 rows into the CCI. With only four concurrent sessions I didn’t expect to run into any locking issues such as the mysterious ROWGROUP_FLUSH wait event. The sequence definition is about as simple as it gets:

DROP SEQUENCE IF EXISTS dbo.CCI_Sequence;
CREATE SEQUENCE dbo.CCI_Sequence AS BIGINT
START WITH 1
INCREMENT BY 1
NO CACHE;

Here’s the code that I used to add rowgroups to the table:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU=0;

DECLARE @i BIGINT

SET NOCOUNT ON;

IF EXISTS (SELECT 1 FROM ##stop_table)
BEGIN
	SET @i = 9999999999999;
END
ELSE
BEGIN
	SELECT @i = NEXT VALUE FOR dbo.CCI_Sequence;
END; 

WHILE @i <= 953674
BEGIN
	WITH NUM (n) AS (
	SELECT n
	FROM
	(
	VALUES
	 (@i),(@i),(@i),(@i),(@i),(@i),(@i),(@i)
	,(@i),(@i),(@i),(@i),(@i),(@i),(@i),(@i)
	,(@i),(@i),(@i),(@i),(@i),(@i),(@i),(@i)
	,(@i),(@i),(@i),(@i),(@i),(@i),(@i),(@i)
	) v(n)
	)
	INSERT INTO dbo.BIG_DATA
	SELECT n1.n
	FROM NUM n1
	CROSS JOIN NUM n2
	CROSS JOIN NUM n3
	CROSS JOIN NUM n4
	OPTION (MAXDOP 1);

	IF EXISTS (SELECT 1 FROM ##stop_table)
	BEGIN
		SET @i = 9999999999999;
	END
	ELSE
	BEGIN
		SELECT @i = NEXT VALUE FOR dbo.CCI_Sequence;
	END;
END;

The affinity stuff was to get the work spread out evenly over my four schedulers. Each session was assigned to a different CPU from 0-3. It was also important to run the following in a new session after all four sessions started working:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU=AUTO;

It wasn't clear to me why that was needed to get good throughput. Perhaps part of the work of building a compressed rowgroup is offloaded to a system process?

The references to the ##stop_table temp table are just a way to pause the work as needed without skipping numbers in the sequence. I think that the code to generate 1048576 rows is fairly optimized. I did try to optimize it since this code was going to be run over 950000 times, but I still suspect that there was a better way to do it that I missed.

The jobs finished after about 2 days of running on 4 CPUs. That’s a rate of around 86 million rows per core per minute which I was pretty happy with. After the sessions finished I needed to do one final, very nervous, insert into the table:

INSERT INTO dbo.BIG_DATA
SELECT TOP (331776) 953675
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

Running sp_spaceused has never felt so satisfying:

a11_trillion

Under a GB for one trillion rows. Not bad.

Parallel Population Strategy

Taking advantage of natural parallelism within SQL Server may also be a viable method to populate a trillion row table (especially on an actual server), but I didn’t test it fully. In order to get perfect rowgroups, you need round robin parallelism with a driving table that’s a multiple of the query’s MAXDOP. For example, here’s a suitable plan:

a11_CCI_parallel_insert

The Constant Scan contains exactly four rows because I’m running MAXDOP 4. The source CCI has exactly 1048576 rows. The parallel insert happens without a repartition streams so we end up with exactly 1048576 rows on each thread and in each compressed rowgroup. Below is one way to generate such a plan:

DROP TABLE IF EXISTS dbo.SOURCE_CCI;
CREATE TABLE dbo.SOURCE_CCI (
ID BIGINT,
INDEX CCI_SOURCE_CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.SOURCE_CCI WITH (TABLOCK)
SELECT TOP (1048576) 0
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS dbo.TRIAL_BALLOON;
CREATE TABLE dbo.TRIAL_BALLOON (
ID BIGINT,
INDEX CCI_TRIAL_BALLOON CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.TRIAL_BALLOON WITH (TABLOCK)
SELECT  driver.n
FROM (
SELECT TOP (4) v.n
FROM (
	VALUES
		(1),(2),(3),(4)
	) v(n)
) driver
INNER JOIN dbo.SOURCE_CCI sc ON sc.ID < driver.n
OPTION (MAXDOP 4, NO_PERFORMANCE_SPOOL);

With nothing else running on my machine I estimate that this method would take about 2 days to complete. The problem is that if one core is busy with something else, such as watching terrible Youtube videos, then the entire insert could be slowed down.

Updating Stats

To do anything interesting on a table we want statistics. Gathering statistics for extremely compressed data can be challenging in SQL Server. That is because the target sampled rate is based on the total size of the table as opposed to the number of rows in the table. Consider an 8 billion row table built in the same way as the one trillion row table above. SQL Server generates the following query to gather sampled stats against the table:

SELECT StatMan([SC0])
FROM (
SELECT TOP 100 PERCENT [ID] AS [SC0]
FROM [dbo].[BIG_DATA] WITH (READUNCOMMITTED)
ORDER BY [SC0]
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1)

You may notice the lack of TABLESAMPLE as well as the MAXDOP 1 hint. To gather sampled stats SQL Server will get all eight billion rows from the table, sort them, and build the statistics object using the eight billion rows. On my machine, this took almost 3 hours to complete and tempdb grew to 85 GB.

There is a trick to get more reasonable sampled stats. All that’s needed is to increase the table size while keeping the same data. Soft deletion of compressed rowgroups is a good way to accomplish this. First find a data distribution that doesn’t compress well in CCIs. Here’s one example:

SELECT 9999999 + SUM(RN) OVER (ORDER BY RN)
FROM (
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

Each rowgroup has a size of 8392 KB, so adding 100 rowgroups will add 839200 KB to the table. Deleting all of the newly added rows can take a little while and will log quite a bit to the transaction log, but the table size won’t change. Gathering sampled stats after the insert and delete took just a few seconds. The sample size was about 1% of the table. After a REORG the fully deleted rowgroups will be marked as TOMBSTONE and cleaned up by a background process.

For the one trillion row table I decided to roll the dice and go for sampled stats without any tricks. I gave tempdb a maximum size of 280 GB in order to not completely fill my hard drive. The stats update took 3 hours and 44 minutes. Surprisingly, the stat update grew tempdb to its maximum size but it didn’t fail. Perhaps I got very lucky. Here is the hard earned stats object:

a11_stats_object

Expected Query Performance

I expected reasonably fast query performance for queries designed to take advantage of rowgroup elimination. After all, the table was built in a way such that every compressed rowgroup only has a single value. I can get a count of rowgroups along with some other metadata in 20 seconds using the DMVs:

SELECT COUNT(*), MIN(css.min_data_id), MAX(css.max_data_id)
FROM sys.objects o
INNER JOIN sys.columns c ON o.object_id = c.object_id
INNER JOIN sys.partitions p ON o.object_id = p.object_id
INNER JOIN sys.column_store_segments css
    ON p.hobt_id = css.hobt_id
    AND css.column_id = c.column_id
WHERE o.name = 'BIG_DATA'
AND c.name = 'ID';

Getting the relevant segment_ids for a particular filter finishes in under a second:

SELECT segment_id
FROM sys.objects o
INNER JOIN sys.columns c ON o.object_id = c.object_id
INNER JOIN sys.partitions p ON o.object_id = p.object_id
INNER JOIN sys.column_store_segments css
    ON p.hobt_id = css.hobt_id
    AND css.column_id = c.column_id
WHERE o.name = 'BIG_DATA'
AND c.name = 'ID'
AND 500000 BETWEEN css.min_data_id AND css.max_data_id;

I’m dealing with DMVs so I would expect SQL Server to be able to do rowgroup elimination in a much more efficient way than the above queries. Therefore, 30 seconds seemed like a reasonable upper bound for the following query:

SELECT COUNT(*)
FROM dbo.BIG_DATA
WHERE ID = 500000;

The query takes over 15 minutes to complete despite reading only a single segment:

Table ‘BIG_DATA’. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6, lob physical reads 1, lob read-ahead reads 0.
Table ‘BIG_DATA’. Segment reads 1, segment skipped 953674.

SQL Server Execution Times:
CPU time = 905328 ms, elapsed time = 917478 ms.

The Evil Wait Event

The query had a max wait time of 915628 ms for QUERY_TASK_ENQUEUE_MUTEX. This is suspiciously close to the elapsed time of 917478 ms. Unfortunately this is a very unpopular wait event in the industry. The wait event library has almost no information about it as well.

I call this the evil wait event because while it’s happening queries cannot be canceled through SSMS and many unrelated queries won’t even run. Most of the time no useful work can be done on the instance. I can’t read Russian so I’m not sure what the wait event is about. After I restarted SQL Server the wait event no longer appeared as consistently, but query performance did not improve as far as I could tell.

Other Queries

I ran a few other tests queries as well, although I was limited in what I could do by the evil wait event. The following query is the only one that I found that wasn’t affected:

SELECT TOP 1 ID
FROM dbo.BIG_DATA;

For reasons I don’t understand the query still took a long time:

Table ‘BIG_DATA’. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6, lob physical reads 1, lob read-ahead reads 0.
Table ‘BIG_DATA’. Segment reads 1, segment skipped 0.

SQL Server Execution Times:
CPU time = 791625 ms, elapsed time = 811202 ms.

Counting the rows in the table took almost half an hour:

SELECT COUNT_BIG(*)
FROM dbo.BIG_DATA;

Statistics output:

Table ‘BIG_DATA’. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 5722050, lob physical reads 819345, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 3734515 ms, elapsed time = 1635348 ms.

A query to sum every ID in the table took over 40 minutes:

SELECT SUM(ID)
FROM dbo.BIG_DATA;

Statistics output:

SQL Server Execution Times:
CPU time = 4810422 ms, elapsed time = 2433689 ms.

As expected performance gets worse without aggregate pushdown. Picking a simple query that’s not supported:

SELECT MAX(ID / 1)
FROM dbo.BIG_DATA;

This query took over an hour:

SQL Server Execution Times:
CPU time = 10218343 ms, elapsed time = 3755976 ms.

Queries against some of the CCIs DMVs don’t do very well either. A simple count took almost ten minutes:

SELECT COUNT(*)
FROM sys.dm_db_column_store_row_group_physical_stats
where OBJECT_ID = OBJECT_ID('BIG_DATA');

All of the work appears to be in the COLUMNSTORE_ROW_GROUPS table-valued function but I didn’t dig any more into it.

If you’re interested in how a query performs let me know in the comments. I will try anything that’s somewhat reasonable.

Final Thoughts

Now I can add working with trillion row tables to my resume. The compression for the one trillion row table was very impressive but everything else was decidedly less impressive. The very long QUERY_TASK_ENQUEUE_MUTEX wait times for nearly all queries were especially disappointing. I plan to do more testing at a later date with a partitioned table to see if that helps at all. Thanks for reading!

Rowgroup Elimination

Rowgroup elimination is a performance optimization based on compressed rowgroup metadata that can allow rowgroups to be skipped during query execution. It’s likely that all of the metadata used for the optimization is exposed in the sys.column_store_segments DMV. This blog post explores some of the less well known rules and limitations for rowgroup elimination.

Test Data

To keep things very simple we’ll build 100 rowgroups with exactly 1 million rows in each of them. ID and ID2 increase from 1 to 10000000 and ID_NULL is always NULL. Code to create and populate the table:

DROP TABLE IF EXISTS dbo.MILLIONAIRE_CCI;

CREATE TABLE dbo.MILLIONAIRE_CCI (
	ID BIGINT NULL,
	ID2 BIGINT NULL,
	ID_NULL BIGINT NULL,
	INDEX CCI_MILLIONAIRE_CCI CLUSTERED COLUMNSTORE
);

DECLARE @loop INT = 0;
BEGIN
	SET NOCOUNT ON;
	WHILE @loop < 100
	BEGIN
		INSERT INTO dbo.MILLIONAIRE_CCI WITH (TABLOCK)
		SELECT t.RN, t.RN, NULL
		FROM (
			SELECT TOP (1000000)
				(1000000 * @loop)
				+ ROW_NUMBER()
					OVER (ORDER BY (SELECT NULL)) RN
			FROM master..spt_values t1
			CROSS JOIN master..spt_values t2
			ORDER BY RN
		) t
		OPTION (MAXDOP 1);

		SET @loop = @loop + 1;
	END;
END;

We can expect very good rowgroup elimination on the ID and ID2 columns based on how we built them. That can be verified by calculating the REFF or by looking at sys.column_store_segments:

a10_not_null_DMV

Code to generate the above result set:

SELECT css.min_data_id, css.max_data_id, css.has_nulls
FROM sys.objects o
INNER JOIN sys.columns c ON o.object_id = c.object_id
INNER JOIN sys.partitions p ON o.object_id = p.object_id
INNER JOIN sys.column_store_segments css
    ON p.hobt_id = css.hobt_id
    AND css.column_id = c.column_id
INNER JOIN sys.dm_db_column_store_row_group_physical_stats s
    ON o.object_id = s.object_id
    AND css.segment_id = s.row_group_id
    AND s.partition_number = p.partition_number
WHERE o.name = 'MILLIONAIRE_CCI'
AND c.name = 'ID'
AND s.[state] = 3
ORDER BY css.min_data_id, css.segment_id;

Many of the test queries below select a single aggregate value. This isn’t done for any special reason other than to limit the size of the result set. The easiest way to see how many rowgroups were skipped is to use SET STATISTICS IO ON and that requires that the results be returned to the client.

Single Column Filtering

Consider the following query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IN (1000000, 2000001);

Based on how we built the segments for the ID column we might expect that only two segments will need to be read: segment 1 with ID values of 1-1000000 and segment 3 with ID values of 2000001-3000000. As usual, SQL Server does not care about our expectations:

Table ‘MILLIONAIRE_CCI’. Segment reads 3, segment skipped 97.

Why did the storage engine scan two segments instead of three? Running another test makes the problem more clear:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IN (1, 100000000);

For this query we end up scanning the entire table:

Table ‘MILLIONAIRE_CCI’. Segment reads 100, segment skipped 0.

It seems as if the query optimizer reduces the predicate against the filtered column to be a range of IDs. That range of IDs is used for rowgroup elimination. In some cases it’s possible to write a WHERE clause that won’t return any rows but still isn’t eligible for rowgroup elimination. The storage engine is not able to skip any segments while executing the below query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID < 0 OR ID > 100000000;

There isn’t an issue when the where clause is filtering on a contiguous range. For example, the following query skips 98 segments as expected:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 2000000;

There also isn’t an issue when filtering down to multiple values as long as those values are sufficiently close together, as shown with the first example. I also wasn’t able to find any liminations around the number of values in the IN clause. The query below reads 1 segment and skips 99 as we might hope:

SELECT MAX(l.ID)
FROM dbo.MILLIONAIRE_CCI l
WHERE l.ID IN (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50
, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
, 61, 62, 63, 64
);

If we add one more filter value then the query optimizer changes the plan to use a join:

a10_part1_to_join

The above query is eligible for rowgroup elimination but it follows slightly different ruless as covered later in this post.

It is possible to disable the transformation to a join by using the undocumented query hint QueryRuleOff SelToLSJ. With 976 entries in the IN clause I still get rowgroup elimination as expected. With 977 entries nothing was pushed to the scan at all, and we get a truly horrible plan:

a10_terrible_plan

This doesn’t appear to be a columnstore limitation. The same behavior can be observed with a clusted rowstore index.

Getting back on track, the internal calculation around which rowgroups to skip isn’t always as simple as calculating the minimum and maximum in the range and using those values to do elimination. It’s possible to end up with no rowgroup elimination even when the maximum and minimum ID in the WHERE clause are close to each other. Consider the following query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 2
OR ID BETWEEN 2 AND 3;

The storage engine only has to read a single segment. We can see in the query plan that the optimizer was able to simplify the expression into something that happens to qualify for rowgroup elimination:

a10_part1_rewrite

Now consider the following query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 2
OR ID BETWEEN 3 AND 4;

It would be helpful if the query optimizer changed the predicate to ID BETWEEN 1 AND 4 when doing calculations around which rowgroups can be skipped. This does not happen, and as a result all 100 rowgroups are scanned. Rowgroup elimination won’t be available when the WHERE clause is a sufficiently complicated mix of AND and OR conditions, even when filtering on just one column.

NULLs

Information about NULLs is stored internally and can be used for rowgroup elimination. SQL Server knows that none of the compressed segments for the ID column contain NULL, so the storage engine can skip all 100 segments for the following query:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IS NULL;

Naturally, reversing the filter for this query will require the storage engine to scan the entire table.

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IS NOT NULL;

We might expect that query to skip all segments if we change the filter column to ID_NULL. All rows in the rowgroups for ID_NULL are NULL and SQL Server ought to be aware of that fact. However, the storage engine still scans the entire table even for the query below:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID_NULL IS NOT NULL;

The DMV for ID_NULL doesn’t look as we might expect:

a10_NULL_DMV

sys.column_store_dictionaries has a value of 0 for the entry_count column. It seems likely that the fact that the segments only contain NULL can be deduced from information already tracked by SQL Server. Rowgroup elimination for IS NOT NULL may have not been added because it was thought to be too unlikely of a use case.

Filters on Multiple Columns

To state it simply, rowgroup elimination can work quite well with AND predicates on different columns. It will not work with OR predicates on different columns unless the query optimizer can simplify the expression to something that’s eligible for rowgroup elimination.

The following queries are all able to skip 99 rowgroups:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1 AND ID2 = 1;

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 2
AND ID2 BETWEEN 3 AND 4;

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 100000000
AND ID2 BETWEEN 1000001 AND 2000000;

This query skips all 100 rowgroups:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1 AND ID2 = 1000001;

The storage engine doesn’t take the union of rowgroups that could be relevant. It instead takes the intersection, so adding AND predicates won’t increase the number of segments scanned, unless perhaps if you do something very unreasonable. The following query scans one rowgroup as expected:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID BETWEEN 1 AND 100000000
AND ID2 BETWEEN 1000001 AND 2000000
AND ID > ID2;

The final part of the WHERE clause is implemented in a filter operator. The rest of the WHERE clause remains eligible for rowgroup elimination.

Now let’s try a simple query with an OR predicate:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1 OR ID2 = 1;

We might hope that the storage engine is able to deduce that only the first segment is relevant. Instead, rowgroup elimination isn’t even attempted. The predicate is implemented as a filter:

a10_FILTER

The only situation with OR filters that I’ve found to work with rowgroup elimination is when the optimizer can eliminate one of them. For example, the following query scans 5 segments because the optimizer is able to eliminate the condition on the ID2 column:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID IN (1, 5000000) OR ID2 BETWEEN 1 AND 0;

Joins

The query optimizer is able to copy predicates when filtering and joining on the same column. The copied predicates are eligible for rowgroup elimination. Consider the query below:

SELECT MAX(l1.ID)
FROM dbo.MILLIONAIRE_CCI l1
INNER JOIN dbo.MILLIONAIRE_CCI l2 ON l1.ID = l2.ID
WHERE l1.ID BETWEEN 1 AND 1000000;

Only two segments are read because the filter on ID can be applied to both sides of the join. The same behavior can be observed when forcing a merge join. Loop join is a bit different. As covered in the post on CCI string aggregation, rowgroup elimination does not occur on the inner side of a loop. Consider the following query:

SELECT MAX(l1.ID)
FROM dbo.MILLIONAIRE_CCI l1
INNER JOIN dbo.MILLIONAIRE_CCI l2 ON l1.ID = l2.ID
WHERE l1.ID BETWEEN 1 AND 1000
OPTION (LOOP JOIN, NO_PERFORMANCE_SPOOL);

The inner side is scanned 1000 times and the outer side is scanned once. The filter on ID allows all segments to be skipped besides one. So we should read 1001 segments and skip 1001 * 100 – 1001 = 99099 segments. This is what happens:

Table ‘MILLIONAIRE_CCI’. Segment reads 1001, segment skipped 99099.

More segments will be read depending on how many rowgroups the filter crosses. Suppose that we include rows with an ID that’s between 999501 and 1000500:

SELECT MAX(l1.ID)
FROM dbo.MILLIONAIRE_CCI l1
INNER JOIN dbo.MILLIONAIRE_CCI l2 ON l1.ID = l2.ID
WHERE l1.ID BETWEEN 999501 AND 1000500
OPTION (LOOP JOIN, NO_PERFORMANCE_SPOOL);

Now each scan on both the inner and outer side will need to read two segments:

Table ‘MILLIONAIRE_CCI’. Segment reads 2002, segment skipped 98098.

It’s possible to get rowgroup elimination even when filtering and joining on different columns. Consider the following query that joins on ID but filters on ID2:

SELECT MAX(l1.ID)
FROM dbo.MILLIONAIRE_CCI l1
INNER JOIN dbo.MILLIONAIRE_CCI l2 ON l1.ID = l2.ID
WHERE l1.ID2 BETWEEN 1 AND 1000000;

We still get rowgroup elimination against both sides of the join:

Table ‘MILLIONAIRE_CCI’. Segment reads 2, segment skipped 198.

The key is the optimized bitmap:

a10_opt_bitmap

That allows rowgroup elimination to happen on both sides. Bitmap optimization can only occur with hash joins, so queries written in this way that do a merge or loop join won’t be able to take advantage of rowgroup elimination against both tables.

Less Reasonable Queries

Below is a set of sometimes unreasonable queries to test some of the limits around rowgroup elimilation. It was surprising how often the queries remained eligible for rowgroup elimination. For example, local variables seem to cause no issues, even without PEO. The following query reads just one segment:

DECLARE @ID_FILTER BIGINT = 1;
SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = @ID_FILTER;

Data type conversions on the filtered expression don’t make the query ineligible for rowgroup elimination:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID = '1';

Casting on the filtered column is going to prevent rowgroup elimination. As will “optimizer tricks” like adding zero to the column:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID + 0 = 1;

We read all rowgroups:

Table ‘MILLIONAIRE_CCI’. Segment reads 100, segment skipped 0.

The query below is eligible for rowgroup elimination:

SELECT MAX(ID)
FROM dbo.MILLIONAIRE_CCI
WHERE ID <= CEILING(RAND());

Using scalar UDFs in queries is a terrible idea, but let’s create one for testing purposes:

CREATE OR ALTER FUNCTION dbo.CHEAP_UDF() RETURNS BIGINT
AS
BEGIN
	RETURN 1;
END;

As you might expect, the following query runs without parallelism and cannot skip any segments:

SELECT MAX(l.ID)
FROM dbo.MILLIONAIRE_CCI l
WHERE l.ID = dbo.CHEAP_UDF();

However, if we add SCHEMABINDING to the function definition then we get rowgroup elimination:

Table ‘MILLIONAIRE_CCI’. Segment reads 1, segment skipped 99.

The query below gets rowgroup elimination with and without SCHEMABINDING:

SELECT MAX(l.ID)
FROM dbo.MILLIONAIRE_CCI l
WHERE l.ID = (SELECT MAX(dbo.CHEAP_UDF()));

Query Rewrites for Better Rowgroup Elimination

In some cases it’s possible to rewrite queries to get better rowgroup elimination. This requires knowing your data and awareness of the rules around rowgroup elimination. Going back to an earlier example, the following query isn’t eligible for rowgroup elimination (without very convenient constraints):

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1 OR ID2 = 1;

It can be written to use UNION or UNION ALL. Here’s the UNION query:

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1

UNION 

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID2 = 1;

Now the storage engine skips 198 segments and only reads 2:

Table ‘MILLIONAIRE_CCI’. Segment reads 2, segment skipped 198.

In some cases it may be advantageous to avoid the sort. The query below has the same rowgroup elimination:

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID = 1

UNION ALL

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID2 = 1 AND ID <> 1;

Here’s the query plan:

a10_section_rewrite_UNION_ALL

Consider another query with a wide range of values in the IN clause, but filtered against a single column. The query below won’t be able to skip any rowgroups because we’re including the minimum and maximum value of ID in the query’s results:

SELECT *
FROM dbo.MILLIONAIRE_CCI
WHERE ID IN (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
, 100000000
);

It may be impractical to write out the query using UNION. Instead, we can essentially force a join by putting the filter values into a derived table. The optimizer is likely to move the IN list to a constant scan and do a hash join to the CCI. We can get rowgroup elimination through the bitmap filter on the hash join. Here’s one way to rewrite the query:

SELECT c.*
FROM
(
	VALUES (1), (2), (3), (4), (5)
	, (6), (7), (8), (9), (10)
	, (100000000)
) v(x)
INNER JOIN dbo.MILLIONAIRE_CCI c ON c.ID = v.x;

Here’s the plan:

a10_section_rewrite_hash

As expected, we only need to scan 2 rowgroups:

Table ‘MILLIONAIRE_CCI’. Segment reads 2, segment skipped 98.

SQL Server 2017 Changes

I ran all of the test queries against SQL Server 2017 RC2. I was not able to observe any differences. It may be that Microsoft did not choose to make improvements in this area, or any improvements were missed by my test cases.

Final Thoughts

Rowgroup elimination seems designed to reduce IO requirements for queries that filter against contiguous ranges against a column, like filtering against a single month of data from a table, or when joining to the CCI through a hash join. It’s possible to write queries for which rowgroup elimination does not occur, even though SQL Server in theory has all of the information that it would need to perform rowgroup elimination. From a practical point of the view, the biggest limitation is probably around OR logic. Thanks for reading!

Aggregate Pushdown Limitations

Aggregate pushdown is an optimization for aggregate queries against columnstore tables that was introduced in SQL Server 2016. Some aggregate computations can be pushed to the scan node instead of calculated in an aggregate node. I found the documentation to be a little light on details so I tried to find as many restrictions around the functionality as I could through testing.

Test Data

Most of my testing was done against a simple CCI with a single compressed rowgroup:

DROP TABLE IF EXISTS dbo.AP_1_RG;
CREATE TABLE dbo.AP_1_RG (
	ID1 bigint NULL,
	ID2 bigint NULL,
	AGG_COLUMN BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.AP_1_RG WITH (TABLOCK)
SELECT
t.RN % 8000
, t.RN % 8000
, 0
FROM
(
	SELECT TOP (1048576)
	ROW_NUMBER() OVER
		(ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

I found this table structure and data to be convenient for most of the test cases, but all tests can be reproduced with different data or a different table structure.

Restrictions without GROUP BY

Aggregate pushdown is supported both with and without a GROUP BY clause. I found it convenient to test those two cases separately. Below is a list of restrictions that I found or verified during testing.

Data Type of Aggregate Column

The documentation says:

Any datatype <= 64 bits is supported. For example, bigint is supported as its size is 8 bytes but decimal (38,6) is not because its size is 17 bytes. Also, no string types are supported.

However, this isn’t quite accurate. Float data types are not supported. numeric(10,0) is supported despite requiring 9 bytes for storage. Here’s a full table of results:

a9_data_type_table

* bit is not supported for aggregates in general
** not all data types were tested

I would summarize support as all date and time data types are supported except datetimeoffset. All exact numeric data types are supported if they are under 10 bytes. Approximate numerics, strings, and other data types are not supported.

Data Type Conversions

Aggregate pushdown does not appear to be supported if there is any data type conversion in any part of the aggregate expression. Both implicit and explicit data type conversions can cause issues, although it ultimately depends on the rules for data type precedence and if the query optimizer determines if a conversion is needed. For example, the following queries are eligible for pushdown:

SELECT MAX(ID1 + 1)
FROM dbo.AP_1_RG;

SELECT MAX(ID1 + CAST(1 AS BIGINT))
FROM dbo.AP_1_RG;

SELECT MAX(ID1 + CAST(1 AS INT))
FROM dbo.AP_1_RG;

SELECT SUM(ID1 + 1)
FROM dbo.AP_1_RG;

SELECT SUM(1 * ID1)
FROM dbo.AP_1_RG;

SELECT MAX(CAST(ID1 AS BIGINT))
FROM dbo.AP_1_RG;

However, the following queries are not:

SELECT MAX(CAST(ID1 AS INT))
FROM dbo.AP_1_RG;

SELECT SUM(1.5 * ID1)
FROM dbo.AP_1_RG;

SELECT SUM(ID1 + CAST(1 AS BIGINT))
FROM dbo.AP_1_RG;

Sometimes the compute scalar appears with the conversion even when we might not expect it, like for the last query:

a9_compute_scalar

Unsupported Operators

Division and modulus prevent aggregate pushdown even when they wouldn’t change the result or if there isn’t a data type conversion in the plan. There are likely other unsupported operators as well. The following queries are eligible for pushdown:

SELECT MAX(ID1 + 1)
FROM dbo.AP_1_RG;

SELECT MAX(ID1 - 1)
FROM dbo.AP_1_RG;

SELECT MAX(ID1 * 2)
FROM dbo.AP_1_RG;

The following queries are not eligible for pushdown:

SELECT MAX(ID1 / 1)
FROM dbo.AP_1_RG;

SELECT MAX(ID1 % 9999999999999)
FROM dbo.AP_1_RG;

Aggregate Cannot be Applied to Scan

The aggregate expression must be applied directly to the scan. If there’s a filter between the aggregate and the scan then it won’t work. A compute scalar node between the aggregate and the scan can be okay.

Filter expressions involving OR on different columns tend to be calculated as a filter. This means that the following query isn’t eligible for pushdown:

SELECT MIN(ID1)
FROM dbo.AP_1_RG
WHERE ID1 > 0 OR ID2 > 0;

It’s likely that this restriction will affect many queries with joins to other tables.

Local Variables Without PEO

Aggregate pushdown is not available if there is a local variable in the aggregate expression unless the optimizer is able to embed the literal parameter value in the query, such as with a RECOMPILE hint. For example, the first query is not eligible for pushdown but the second query is:

DECLARE @var BIGINT = 0;
-- no
SELECT MIN(ID1 + @var)
FROM dbo.AP_1_RG;

-- yes
SELECT MIN(ID1 + @var)
FROM dbo.AP_1_RG
OPTION (RECOMPILE);

Trivial Plans

Simple queries that use SUM, AVG, COUNT, or COUNT_BIG against very small tables may get a trivial plan. In SQL Server 2016 that trivial plan will not be eligible for batch mode so aggregate pushdown will not occur. Consider the following CCI with 10000 rows in a compressed rowgroup:

DROP TABLE IF EXISTS AGG_PUSHDOWN_FEW_ROWS;

CREATE TABLE dbo.AGG_PUSHDOWN_FEW_ROWS (
ID BIGINT NOT NULL,
INDEX CCI2 CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.AGG_PUSHDOWN_FEW_ROWS WITH (TABLOCK)
SELECT t.RN
FROM
(
	SELECT TOP (10000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

ALTER INDEX CCI2 ON AGG_PUSHDOWN_FEW_ROWS
REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

With the default cost threshold for parallelism value of 5 I get a trivial plan:

a9_trivial_plan

If I decrease the CTFP value to 0 I get batch mode along with aggregate pushdown. As far as I can tell, queries with MAX or MIN do not have this issue.

Lack of Hash Match

For some queries the query optimizer may cost a stream aggregate as a cheaper alternative to a hash match aggregate. Aggregate pushdown is not available with a stream aggregate. If I truncate the AGG_PUSHDOWN_FEW_ROWS table and load 3002 rows into it I get a stream aggregate and no pushdown. With 3003 rows I get a hash match and pushdown. The tipping point depends on the aggregate function, MAXDOP, the data loaded into the table, and so on. The undocumented query hint QUERYRULEOFF GbAggToStrm can be used to encourage a hash match aggregate. If the aggregated column is a string this gets more complicated.

Non-trivial CASE statements

Trivial CASE statements inside the aggregate are still eligible for pushdown:

SELECT SUM(CASE WHEN 1 = 1 THEN ID1 ELSE ID2 END)
FROM dbo.AP_1_RG;

However, many other CASE statements are not. Here is one example:

SELECT SUM(CASE WHEN ID1 < ID2 THEN ID1 ELSE ID2 END)
FROM dbo.AP_1_RG;

Basic Restrictions

For completeness I’ll list a few more of the more obvious restrictions on pushdown. Many of these are documented by Microsoft. The CCI must contain at least one compressed rowgroup. Delta stores are not eligible for pushdown. There are only six aggregate functions supported: COUNT, COUNT_BIG, MIN, MAX, SUM, and AVG. COUNT (DISTINCT col_name) is not supported.

If a query does not get batch mode due to TF 9453 or for other reasons it will not get aggregate pushdown. Undocumented trace flag 9354 directly disables aggregate pushdown.

Restrictions with GROUP BY

Queries with a GROUP BY have many, if not, all of the same restrictions on the aggregate expressions. As far as I can tell there are no restrictions on the data type of the GROUP BY columns. More than one GROUP BY column is supported as well. However, there are a few restrictions which only apply to queries with a GROUP BY clause.

Non-direct Column References

The columns in the GROUP BY need to be columns. Adding scalars and other nonsense appears to make the query ineligible. The following queries are not eligible:

SELECT ID1 + 0, SUM(AGG_COLUMN)
FROM dbo.AP_1_RG
GROUP BY ID1 + 0;

SELECT ID1 + ID2, SUM(AGG_COLUMN)
FROM dbo.AP_1_RG
GROUP BY ID1 + ID2;

This one is eligible:

SELECT ID1, ID2, SUM(AGG_COLUMN)
FROM dbo.AP_1_RG
GROUP BY ID1, ID2;

Segment Not Compressed Enough?

A rowgroup appears to be ineligible for aggregate pushdown if the GROUP BY column has a segment size which is too large. This can be observed with the following test data:

DROP TABLE IF EXISTS AP_3_RG;
CREATE TABLE dbo.AP_3_RG (
ID1 bigint NULL,
AGG_COLUMN BIGINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.AP_3_RG WITH (TABLOCK)
SELECT
t.RN % 16000
, 0
FROM
(
	SELECT TOP (1 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

INSERT INTO dbo.AP_3_RG WITH (TABLOCK)
SELECT
t.RN % 17000
, 0
FROM
(
	SELECT TOP (1 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

INSERT INTO dbo.AP_3_RG WITH (TABLOCK)
SELECT
t.RN % 16000
, 0
FROM
(
	SELECT TOP (1 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

Rows from the second rowgroup are not locally aggregated for the following query:

SELECT ID1, SUM(AGG_COLUMN)
FROM AP_3_RG
GROUP BY ID1
OPTION (MAXDOP 1);

The following query has 2097152 locally aggregated rows which correspond to the first and third rowgroups. From the actual plan:

a9_local_agg

Poking around in the sys.column_store_segments and sys.column_store_dictionaries DMVs doesn’t reveal any interesting differences other than the rowgroup with more distinct values has a much larger size (2097736 bytes versus 128576 bytes ). We can go deeper with the undocumented DBCC CSINDEX:

DBCC TRACEON (3604);

DBCC CSINDEX (
7, -- DB_ID
72057613148028928, -- hobt_id
2, -- 1 + column_id
0, -- segment_id
1, -- 1 for segment
0 -- print option
);

Among other differences, for the 16000 distinct value segment we see:

Bitpack Data Header:

Bitpack Entry Size = 16
Bitpack Unit Count = 0
Bitpack MinId = 3
Bitpack DataSize = 0

But for the 17000 distinct value segment we see:

Bitpack Data Header:
Bitpack Entry Size = 16
Bitpack Unit Count = 262144
Bitpack MinId = 3
Bitpack DataSize = 2097152

Perhaps bitpack compressed data is not eligible for aggregate pushdown?

It’s important to note that segments are not compressed independently in SQL Server, despite the data being stored at a column level. For the AP_1_RG table we’re eligible for pushdown when aggregating by ID1 or ID2. However, if I truncate the table and change the data slightly:

TRUNCATE TABLE dbo.AP_1_RG;

INSERT INTO dbo.AP_1_RG WITH (TABLOCK)
SELECT
t.RN % 8000
, t.RN % 8001 -- was previously 8000
, 0
FROM
(
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

Now both columns are no longer eligible for aggregate pushdown. A REBUILD operation on the table does not help. Tricking SQL Server into assigning more memory to the columnstore compression also does not help.

Pushdown Surprises

During testing I was surprised by a few queries that supported aggregate pushdown. The most surprising was that the following query can get aggregate pushdown:

SELECT MAX(ID + ID2)
FROM dbo.AP_1_RG;

I have no idea how it works, but it does. For a few others, TABLESAMPLE does not prevent aggregate pushdown from happening. In addition, GROUP BY CUBE, GROUPING SETS, and ROLLUP are supported as well.

Changes with SQL Server 2017

I ran the same series of tests on SQL Server 2017 RC1. The only difference I observed was that I could no longer get a trivial plan without batch mode aggregation. This change was announced here by Microsoft.

Final Thoughts

As you can see, there are many undocumented restrictions around aggregate pushdown. These limits may be changed or go away as Microsoft continues to update SQL Server. Pushdown with GROUP BY is supported for some queries against some tables, but eligibility appears to be based on how the data is compressed, which cannot be predicted ahead of time. In my opinion, this makes it rather difficult to count on in practice. Thanks for reading!

Copying Rowstore Tables

Sometimes there’s a need to copy tables from one database to another. A large INSERT to copy all of the rows at once may not be desired or possible. For example, if the target database has a recovery model of full we may need to avoid filling the log or long rollbacks if the process needs to be canceled. If the database has a recovery model of simple and there’s a lot of other activity going on we may need to avoid filling the log due to a lengthy transaction. Minimal logging won’t help with that. There may be a desire to throttle each loop with a WAITFOR DELAY command, and so on.

Think of the tables as being copied by a background process. We want to copy them in chunks while efficiently using the server’s resources. It’s not a race to copy all of the data for a particular table as quickly as possible. Also, there many be many tables to move and they could have different structures for their clustered indexes.

Test Data

For the test table I deliberately picked a table far too large to fit into the buffer cache. This wasn’t hard because I configured SQL Server to have the minimum required memory of 1 GB. All tests were conducted with a recovery model of simple. To make things interesting, but not too interesting, I gave the SOURCE_TABLE a two column clustered index:

DROP TABLE IF EXISTS dbo.SOURCE_TABLE;
CREATE TABLE dbo.SOURCE_TABLE (
	ID1 BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PAGE_TURNER VARCHAR(170) NOT NULL,
	PRIMARY KEY (ID1, ID2)
);

Aiming for a target size of 20 GB, I inserted 100 million rows:

CREATE TABLE #t (ID BIGINT NOT NULL, PRIMARY KEY (ID));

INSERT INTO #t WITH (TABLOCK)
SELECT TOP (10000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN  master..spt_values t2;

INSERT INTO dbo.SOURCE_TABLE WITH (TABLOCK)
SELECT
  t1.ID
, t2.ID
, REPLICATE('Z', 170)
FROM #t t1
CROSS JOIN #t t2
ORDER BY t1.ID, t2.ID
OPTION (MAXDOP 1, NO_PERFORMANCE_SPOOL);

The table has exactly 20 GB worth of data pages!

a8_pages

Which of course works out to 2.5 million pages. As an aside, I didn’t want to use the temp table to do the data prep but couldn’t find a good way around it. The usual method of spt_values, TOP, and ROW_NUMBER() lead to a pretty large sort. I tried all kinds of tricks but couldn’t make it go away:

a8_unnecessary_sort

The TOP N sort for the outer result set is for the first column of the clustered index. The TOP N sort for the inner result set is for the second column of the clustered index. The final sort before the SELECT is for column 1 and column 2. Clearly, the sort isn’t needed because the data will be already sorted in that order coming out of the join. However, the query optimizer can be very stubborn in these situations.

The code samples in this post are designed to move any table that has a unique, clustered index. If you know something about the data in the table or the structure it may be possible to write more efficient code at the table level. Note that there is no attempted handling of concurrency. If the underlying data in the source table is changing then none of this code should be expected to work. The code can also handle heaps as long as you define a clustered index on them first.

To efficiently use server resources I decided that I wanted to loop in order of the clustered key. This should avoid unnecessary page splitting and lead to a cleaner result table. With the right recovery model, it should also take advantage of reduced logging for all of the inserts in SQL Server 2016. Finally, it would be nice to avoid sorting and excessive tempdb usage. Perhaps many of these jobs will run at once and we don’t want them to fail due to tempdb usage.

Straight Insert

The first thing that I tested was a single INSERT to be used as a benchmark for all of the different algorithms:

INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
SELECT *
FROM dbo.SOURCE_TABLE WITH (TABLOCK);

This isn’t a fair comparison because we can get better minimal logging for every row, avoid query plan compilations, and we’re simply doing less work overall. However, I think it’s still useful for context and to see how close we can get to the ideal case with no overhead. Here is a table of performance numbers from sys.dm_exec_sessions for our first test:

a8_table_1

Full Copy Temp Table

We need to loop over our source table and insert chunks of rows into the target table. One strategy to do this is to insert all of the clustered keys from the source table into a temp table with an IDENTITY column. This approach is easy to understand and the number of keys in the clustered index doesn’t make the SQL more complicated. Here’s one implementation:

DECLARE @total_rows_to_move BIGINT,
@batch_size INT = 1000000,
@batch_number INT = 1;

BEGIN

SET NOCOUNT ON;

DROP TABLE IF EXISTS #TARGET_TABLE_PKS;
CREATE TABLE #TARGET_TABLE_PKS (
	ID BIGINT NOT NULL IDENTITY (1, 1),
	ID1 BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO #TARGET_TABLE_PKS WITH (TABLOCK)
(ID1, ID2)
SELECT ID1, ID2
FROM dbo.SOURCE_TABLE WITH (TABLOCK)
ORDER BY ID1, ID2;

SET @total_rows_to_move = @@ROWCOUNT;

WHILE @batch_number <= CEILING(@total_rows_to_move / @batch_size)
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	INNER JOIN #TARGET_TABLE_PKS t
	ON s.ID1 = t.ID1 AND s.ID2 = t.ID2
	WHERE t.ID BETWEEN
		1 + @batch_size * (@batch_number - 1)
		AND @batch_size * @batch_number
	OPTION (RECOMPILE);

	SET @batch_number = @batch_number + 1;
END;

DROP TABLE #TARGET_TABLE_PKS;

END;

The RECOMPILE hint was included to avoid default estimates caused by the local variables. Even with that, our first attempt does not go so well. The code takes nearly 40 minutes to complete. All performance metrics are bad across the board:

a8_table_2

Looking at the query plans, SQL Server chose a merge join between the source table and the temp table:

a8_merge_join

The merge join won’t always scan every row from the hundred million row table. In fact, the first loop will scan the minimum required number of rows, one million. The next loop scans two million rows, the one after that three million, and so on. The scan can stop early but it always starts with the first row in the table. This strategy is very poorly suited for how we’re processing the table. Performance will be quadratic with the number of loops.

Fix The Join

For this code we know something that the query optimizer doesn’t. A loop join feels like a better choice than merge for this query due to how we’re processing data from the temp table. It’ll do a constant amount of work for each loop. One way to encourage a loop join is to lower the cardinality estimate for the temp table. In the code below I did this by adding a TOP operator and removing the RECOMPILE hint:

DECLARE @total_rows_to_move BIGINT,
@batch_size INT = 1000000,
@batch_number INT = 1;

BEGIN

SET NOCOUNT ON;

DROP TABLE IF EXISTS #TARGET_TABLE_PKS;
CREATE TABLE #TARGET_TABLE_PKS (
	ID BIGINT NOT NULL IDENTITY (1, 1),
	ID1 BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO #TARGET_TABLE_PKS WITH (TABLOCK)
(ID1, ID2)
SELECT ID1, ID2
FROM dbo.SOURCE_TABLE WITH (TABLOCK)
ORDER BY ID1, ID2;

SET @total_rows_to_move = @@ROWCOUNT;

WHILE @batch_number <= CEILING(@total_rows_to_move / @batch_size)
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	INNER JOIN (
		SELECT TOP (@batch_size) *
		FROM #TARGET_TABLE_PKS
		WHERE ID BETWEEN
			1 + @batch_size * (@batch_number - 1)
			AND @batch_size * @batch_number
		ORDER BY ID
	) t
	ON s.ID1 = t.ID1 AND s.ID2 = t.ID2;

	SET @batch_number = @batch_number + 1;
END;

DROP TABLE #TARGET_TABLE_PKS;

END;

Now we get a nested loop join because of the default estimate of 100 rows:

a8_loop_join

As a nice bonus, the unnecessary (from our point of view) sort operator goes away. The data already is in clustered key order due to how we built the temp table. The cardinality estimate for the insert is a bit low, but if the data is already sorted then why should it matter? We see better performance than before:

a8_table_3

With the old code we did 100 loops and read an average of 50.5 million rows from the source table. We also read all 100 million rows to build the temp table, so we read a total of 5.15 billion rows from the table. That adds up to a lot of physical reads. With this code we do 100 million index seeks but we only read a total of 200 million rows from the source table. That might be why the logical reads increased so much but physical reads are way down.

Can we improve the code further? Storing all of the clustered keys in a temp table feels like a bit much. It’s a lot of writes and the operation could fail if the table is too large. It would also be nice to not have to read all 100 million rows from the source table using joins.

Sampled Temp Table

There’s no need to store every row from the source table in the temp table. We can instead store a sample of rows and use that sample to build key ranges to perform clustered index range scans against. That should lead to a dramatic reduction in tempdb space usage and makes the joins to the source table unnecessary. One complication is that the SQL to do efficient range scans becomes more annoying to write as the number of clustered key columns increased. For a table with two clustered key columns we can do the following:

WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
)

I’m not doing this technique justice here, but the important part is that all of the filters are seek predicates instead of predicates:

a8_seek_predicates

Here’s the full code:

DECLARE @batch_size INT = 1000000,
@id1_start BIGINT,
@id2_start BIGINT;

BEGIN

SET NOCOUNT ON;

DROP TABLE IF EXISTS #TARGET_TABLE_SAMPLED_PKS;
CREATE TABLE #TARGET_TABLE_SAMPLED_PKS (
	ID1 BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID1, ID2)
);

INSERT INTO #TARGET_TABLE_SAMPLED_PKS
SELECT
  ID1
, ID2
FROM
(
	SELECT
	  ID1
	, ID2
	, ROW_NUMBER() OVER (ORDER BY ID1, ID2) RN
	FROM dbo.SOURCE_TABLE WITH (TABLOCK)
) t
WHERE RN % @batch_size = 1;

DECLARE cur CURSOR
LOCAL FAST_FORWARD
FOR
SELECT ID1, ID2
FROM #TARGET_TABLE_SAMPLED_PKS
ORDER BY ID1, ID2;

OPEN cur  

FETCH NEXT FROM cur
INTO @id1_start, @id2_start;

WHILE @@FETCH_STATUS = 0
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT TOP (@batch_size) s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start
	OR (s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY s.ID1, s.ID2;

	FETCH NEXT FROM cur
	INTO @id1_start, @id2_start;
END;

CLOSE cur;
DEALLOCATE cur;

DROP TABLE #TARGET_TABLE_SAMPLED_PKS;

END;

Performance improves yet again:

a8_table_4

However, why is the number of logical reads so high? We’ve eliminated the nested loop joins so this is an unexpected result.

Cardinality Estimates for Inserts

It turns out that the cardinality estimate for inserts can matter after all. DMLRequestSort for the insert operator is set to false. That’s bad when we’re inserting a million of rows at a time. I don’t know the details, but a bad cardinality estimate can cause the wrong internal APIs to be used for inserting the data and logging. We no longer need to reduce the cardinality estimate to get the plan that we want, so let’s try replacing the @batch_size variable with a hardcoded value of 1000000 for the TOP operator. After that change we see another big gain in performance:

a8_table_5

The table insert now has the DMLRequestSort property set to true.

No Temp Table

One issue with the code above is that we do a clustered index scan of the entire table at the beginning. Some of that data remains in the buffer cache but very little of it will be useful for the looping part of the procedure. If we can find a way to loop as we need to we may be able to take better advantage of the buffer cache. Also, why use a temp table when you don’t have to?

One way to accomplish this is with the OFFSET keyword introduced in SQL Server 2012. SQL Server always reads and throws away the number of rows specified in the OFFSET clause. It cannot smartly skip ahead in the index. To avoid some of the performance problems with OFFSET we need to use it with an anchor row. Here is an algorithm that uses OFFSET instead of a temp table:

DECLARE @batch_size INT = 1000000,
@id1_start BIGINT,
@id2_start BIGINT;

BEGIN

SET NOCOUNT ON;

SELECT @id1_start = ID1, @id2_start = ID2
FROM SOURCE_TABLE s WITH (TABLOCK)
ORDER BY ID1, ID2
	OFFSET 0 ROWS
	FETCH FIRST 1 ROW ONLY;

WHILE @id1_start IS NOT NULL
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT TOP (1000000) s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY s.ID1, s.ID2;

	SET @id1_start = NULL;
	SET @id2_start = NULL;

	SELECT @id1_start = ID1, @id2_start = ID2
	FROM SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY ID1, ID2
		OFFSET (@batch_size) ROWS
		FETCH FIRST 1 ROW ONLY;
END;

END;

OFFSET to get the row that starts the next key range to process. Performance is a bit better than before:

a8_table_6

However, the number of physical reads didn’t fall as much as it could have. The source table has 2.5 million data pages and the straight insert only needed 2.5 million physical reads to copy the data, which is hardly a coincidence.

Better Buffer Cache

There’s a subtle issue in the last algorithm. Think about the order of scans and inserts. For the Nth loop we scan one million rows from the source table using the anchor row that we found previously. Nearly all of those rows won’t be in the buffer cache so they will be physical reads. The insert happens at the same time and that certainly has an effect on what will remain in the buffer cache. Then we scan the same range again to find the next anchor row. It feels like it would be better to scan the upcoming range first, get the next anchor row, then perform the insert up to the anchor row. That way data that we want to remain in the buffer cache won’t be kicked out as much by the INSERT. The code to do this is a bit more complex:

DECLARE @batch_size INT = 1000000,
@id1_start BIGINT,
@id2_start BIGINT,
@id1_next BIGINT,
@id2_next BIGINT;

BEGIN

SET NOCOUNT ON;

SELECT @id1_start = ID1, @id2_start = ID2
FROM SOURCE_TABLE s WITH (TABLOCK)
ORDER BY ID1, ID2
	OFFSET 0 ROWS
	FETCH FIRST 1 ROW ONLY;

SELECT @id1_next = ID1, @id2_next = ID2
FROM SOURCE_TABLE s WITH (TABLOCK)
WHERE (
s.ID1 > @id1_start OR
(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
)
ORDER BY ID1, ID2
	OFFSET (@batch_size) ROWS
	FETCH FIRST 1 ROW ONLY;

WHILE @id1_start IS NOT NULL
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT TOP (1000000) s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY s.ID1, s.ID2;

	SET @id1_start = @id1_next;
	SET @id2_start = @id2_next;
	SET @id1_next = NULL;
	SET @id2_next = NULL;

	SELECT @id1_next = ID1, @id2_next = ID2
	FROM SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY ID1, ID2
		OFFSET (@batch_size) ROWS
		FETCH FIRST 1 ROW ONLY;
END;

END;

However, performance improved yet again. It’s almost like I planned this blog post out in advance!

a8_table_7

The number of physical reads is now very close to the number required by the straight insert. Note that we could have also reduced the physical read count of the previous algorithm by lowering the batch size.

The Single Source Scan Method

What if we don’t want to read the source table twice? After all, the straight insert only needs to read the source table once. We can’t let SQL Server outsmart us that much. The key to this next algorithm is that the target table starts out empty. Since we’re inserting in clustered key order, getting the next anchor for a loop is as simple as selecting the last clustered key from the target table and changing the predicate a little bit:

DECLARE @batch_size INT = 1000000,
@RC int,
@id1_start BIGINT,
@id2_start BIGINT;

BEGIN

SET NOCOUNT ON;

INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
SELECT TOP (1000000) s.*
FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
ORDER BY s.ID1, s.ID2;

set @RC = @@ROWCOUNT;

SELECT TOP (1) @id1_start = ID1, @id2_start = ID2
FROM TARGET_TABLE s WITH (TABLOCK)
ORDER BY ID1 DESC, ID2 DESC;

WHILE @RC = @batch_size
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT TOP (1000000) s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 > @id2_start)
	)
	ORDER BY s.ID1, s.ID2;

	set @RC = @@ROWCOUNT;

	SELECT TOP (1) @id1_start = ID1, @id2_start = ID2
	FROM TARGET_TABLE s WITH (TABLOCK)
	ORDER BY ID1 DESC, ID2 DESC;
END;

END;

We see a minor improvement in the number of logical reads as expected:

a8_table_8

This was the best algorithm that I was to come up with.

A Few Bad Ideas

I played around with a few other methods of moving the data and feel the need to caution the reader against them. One idea is to define an AFTER INSERT trigger on the target table and to use the special inserted table to save off the value of the last clustered key inserted into that table. This appeared to just be slower than the single scan method and who wants to use triggers when they can be avoided?

A standard cursor performs extremely poorly because it operators on one row at a time. There’s an impressively poorly-documented alternative called API cursors which can process more than one row at a time. After some struggling I was able to use API cursors to copy data from one table to the other, but there was an intermediate step that loaded data from the cursor into a hidden temp table. Also, the cardinality estimate from the insert came from an EXECUTE and was a single row. Performance was very poor.

Final Thoughts

The point of this post wasn’t to insist that one method of moving data is better than all others. In fact, many different algorithms will be likely be suitable depending on the table structure, table size, and time requirements. However, it’s important to know that seemingly minor changes can lead to large differences in performance when moving around data that cannot fit in the buffer pool. These performance problems can become magnified when moving large tables or lots of tables. Thanks for reading!