Copying Rowstore Tables

Sometimes there’s a need to copy tables from one database to another. A large INSERT to copy all of the rows at once may not be desired or possible. For example, if the target database has a recovery model of full we may need to avoid filling the log or long rollbacks if the process needs to be canceled. If the database has a recovery model of simple and there’s a lot of other activity going on we may need to avoid filling the log due to a lengthy transaction. Minimal logging won’t help with that. There may be a desire to throttle each loop with a WAITFOR DELAY command, and so on.

Think of the tables as being copied by a background process. We want to copy them in chunks while efficiently using the server’s resources. It’s not a race to copy all of the data for a particular table as quickly as possible. Also, there many be many tables to move and they could have different structures for their clustered indexes.

Test Data

For the test table I deliberately picked a table far too large to fit into the buffer cache. This wasn’t hard because I configured SQL Server to have the minimum required memory of 1 GB. All tests were conducted with a recovery model of simple. To make things interesting, but not too interesting, I gave the SOURCE_TABLE a two column clustered index:

DROP TABLE IF EXISTS dbo.SOURCE_TABLE;
CREATE TABLE dbo.SOURCE_TABLE (
	ID1 BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PAGE_TURNER VARCHAR(170) NOT NULL,
	PRIMARY KEY (ID1, ID2)
);

Aiming for a target size of 20 GB, I inserted 100 million rows:

CREATE TABLE #t (ID BIGINT NOT NULL, PRIMARY KEY (ID));

INSERT INTO #t WITH (TABLOCK)
SELECT TOP (10000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN  master..spt_values t2;

INSERT INTO dbo.SOURCE_TABLE WITH (TABLOCK)
SELECT
  t1.ID
, t2.ID
, REPLICATE('Z', 170)
FROM #t t1
CROSS JOIN #t t2
ORDER BY t1.ID, t2.ID
OPTION (MAXDOP 1, NO_PERFORMANCE_SPOOL);

The table has exactly 20 GB worth of data pages!

a8_pages

Which of course works out to 2.5 million pages. As an aside, I didn’t want to use the temp table to do the data prep but couldn’t find a good way around it. The usual method of spt_values, TOP, and ROW_NUMBER() lead to a pretty large sort. I tried all kinds of tricks but couldn’t make it go away:

a8_unnecessary_sort

The TOP N sort for the outer result set is for the first column of the clustered index. The TOP N sort for the inner result set is for the second column of the clustered index. The final sort before the SELECT is for column 1 and column 2. Clearly, the sort isn’t needed because the data will be already sorted in that order coming out of the join. However, the query optimizer can be very stubborn in these situations.

The code samples in this post are designed to move any table that has a unique, clustered index. If you know something about the data in the table or the structure it may be possible to write more efficient code at the table level. Note that there is no attempted handling of concurrency. If the underlying data in the source table is changing then none of this code should be expected to work. The code can also handle heaps as long as you define a clustered index on them first.

To efficiently use server resources I decided that I wanted to loop in order of the clustered key. This should avoid unnecessary page splitting and lead to a cleaner result table. With the right recovery model, it should also take advantage of reduced logging for all of the inserts in SQL Server 2016. Finally, it would be nice to avoid sorting and excessive tempdb usage. Perhaps many of these jobs will run at once and we don’t want them to fail due to tempdb usage.

Straight Insert

The first thing that I tested was a single INSERT to be used as a benchmark for all of the different algorithms:

INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
SELECT *
FROM dbo.SOURCE_TABLE WITH (TABLOCK);

This isn’t a fair comparison because we can get better minimal logging for every row, avoid query plan compilations, and we’re simply doing less work overall. However, I think it’s still useful for context and to see how close we can get to the ideal case with no overhead. Here is a table of performance numbers from sys.dm_exec_sessions for our first test:

a8_table_1

Full Copy Temp Table

We need to loop over our source table and insert chunks of rows into the target table. One strategy to do this is to insert all of the clustered keys from the source table into a temp table with an IDENTITY column. This approach is easy to understand and the number of keys in the clustered index doesn’t make the SQL more complicated. Here’s one implementation:

DECLARE @total_rows_to_move BIGINT,
@batch_size INT = 1000000,
@batch_number INT = 1;

BEGIN

SET NOCOUNT ON;

DROP TABLE IF EXISTS #TARGET_TABLE_PKS;
CREATE TABLE #TARGET_TABLE_PKS (
	ID BIGINT NOT NULL IDENTITY (1, 1),
	ID1 BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO #TARGET_TABLE_PKS WITH (TABLOCK)
(ID1, ID2)
SELECT ID1, ID2
FROM dbo.SOURCE_TABLE WITH (TABLOCK)
ORDER BY ID1, ID2;

SET @total_rows_to_move = @@ROWCOUNT;

WHILE @batch_number <= CEILING(@total_rows_to_move / @batch_size)
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	INNER JOIN #TARGET_TABLE_PKS t
	ON s.ID1 = t.ID1 AND s.ID2 = t.ID2
	WHERE t.ID BETWEEN
		1 + @batch_size * (@batch_number - 1)
		AND @batch_size * @batch_number
	OPTION (RECOMPILE);

	SET @batch_number = @batch_number + 1;
END;

DROP TABLE #TARGET_TABLE_PKS;

END;

The RECOMPILE hint was included to avoid default estimates caused by the local variables. Even with that, our first attempt does not go so well. The code takes nearly 40 minutes to complete. All performance metrics are bad across the board:

a8_table_2

Looking at the query plans, SQL Server chose a merge join between the source table and the temp table:

a8_merge_join

The merge join won’t always scan every row from the hundred million row table. In fact, the first loop will scan the minimum required number of rows, one million. The next loop scans two million rows, the one after that three million, and so on. The scan can stop early but it always starts with the first row in the table. This strategy is very poorly suited for how we’re processing the table. Performance will be quadratic with the number of loops.

Fix The Join

For this code we know something that the query optimizer doesn’t. A loop join feels like a better choice than merge for this query due to how we’re processing data from the temp table. It’ll do a constant amount of work for each loop. One way to encourage a loop join is to lower the cardinality estimate for the temp table. In the code below I did this by adding a TOP operator and removing the RECOMPILE hint:

DECLARE @total_rows_to_move BIGINT,
@batch_size INT = 1000000,
@batch_number INT = 1;

BEGIN

SET NOCOUNT ON;

DROP TABLE IF EXISTS #TARGET_TABLE_PKS;
CREATE TABLE #TARGET_TABLE_PKS (
	ID BIGINT NOT NULL IDENTITY (1, 1),
	ID1 BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO #TARGET_TABLE_PKS WITH (TABLOCK)
(ID1, ID2)
SELECT ID1, ID2
FROM dbo.SOURCE_TABLE WITH (TABLOCK)
ORDER BY ID1, ID2;

SET @total_rows_to_move = @@ROWCOUNT;

WHILE @batch_number <= CEILING(@total_rows_to_move / @batch_size)
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	INNER JOIN (
		SELECT TOP (@batch_size) *
		FROM #TARGET_TABLE_PKS
		WHERE ID BETWEEN
			1 + @batch_size * (@batch_number - 1)
			AND @batch_size * @batch_number
		ORDER BY ID
	) t
	ON s.ID1 = t.ID1 AND s.ID2 = t.ID2;

	SET @batch_number = @batch_number + 1;
END;

DROP TABLE #TARGET_TABLE_PKS;

END;

Now we get a nested loop join because of the default estimate of 100 rows:

a8_loop_join

As a nice bonus, the unnecessary (from our point of view) sort operator goes away. The data already is in clustered key order due to how we built the temp table. The cardinality estimate for the insert is a bit low, but if the data is already sorted then why should it matter? We see better performance than before:

a8_table_3

With the old code we did 100 loops and read an average of 50.5 million rows from the source table. We also read all 100 million rows to build the temp table, so we read a total of 5.15 billion rows from the table. That adds up to a lot of physical reads. With this code we do 100 million index seeks but we only read a total of 200 million rows from the source table. That might be why the logical reads increased so much but physical reads are way down.

Can we improve the code further? Storing all of the clustered keys in a temp table feels like a bit much. It’s a lot of writes and the operation could fail if the table is too large. It would also be nice to not have to read all 100 million rows from the source table using joins.

Sampled Temp Table

There’s no need to store every row from the source table in the temp table. We can instead store a sample of rows and use that sample to build key ranges to perform clustered index range scans against. That should lead to a dramatic reduction in tempdb space usage and makes the joins to the source table unnecessary. One complication is that the SQL to do efficient range scans becomes more annoying to write as the number of clustered key columns increased. For a table with two clustered key columns we can do the following:

WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
)

I’m not doing this technique justice here, but the important part is that all of the filters are seek predicates instead of predicates:

a8_seek_predicates

Here’s the full code:

DECLARE @batch_size INT = 1000000,
@id1_start BIGINT,
@id2_start BIGINT;

BEGIN

SET NOCOUNT ON;

DROP TABLE IF EXISTS #TARGET_TABLE_SAMPLED_PKS;
CREATE TABLE #TARGET_TABLE_SAMPLED_PKS (
	ID1 BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID1, ID2)
);

INSERT INTO #TARGET_TABLE_SAMPLED_PKS
SELECT
  ID1
, ID2
FROM
(
	SELECT
	  ID1
	, ID2
	, ROW_NUMBER() OVER (ORDER BY ID1, ID2) RN
	FROM dbo.SOURCE_TABLE WITH (TABLOCK)
) t
WHERE RN % @batch_size = 1;

DECLARE cur CURSOR
LOCAL FAST_FORWARD
FOR
SELECT ID1, ID2
FROM #TARGET_TABLE_SAMPLED_PKS
ORDER BY ID1, ID2;

OPEN cur  

FETCH NEXT FROM cur
INTO @id1_start, @id2_start;

WHILE @@FETCH_STATUS = 0
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT TOP (@batch_size) s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start
	OR (s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY s.ID1, s.ID2;

	FETCH NEXT FROM cur
	INTO @id1_start, @id2_start;
END;

CLOSE cur;
DEALLOCATE cur;

DROP TABLE #TARGET_TABLE_SAMPLED_PKS;

END;

Performance improves yet again:

a8_table_4

However, why is the number of logical reads so high? We’ve eliminated the nested loop joins so this is an unexpected result.

Cardinality Estimates for Inserts

It turns out that the cardinality estimate for inserts can matter after all. DMLRequestSort for the insert operator is set to false. That’s bad when we’re inserting a million of rows at a time. I don’t know the details, but a bad cardinality estimate can cause the wrong internal APIs to be used for inserting the data and logging. We no longer need to reduce the cardinality estimate to get the plan that we want, so let’s try replacing the @batch_size variable with a hardcoded value of 1000000 for the TOP operator. After that change we see another big gain in performance:

a8_table_5

The table insert now has the DMLRequestSort property set to true.

No Temp Table

One issue with the code above is that we do a clustered index scan of the entire table at the beginning. Some of that data remains in the buffer cache but very little of it will be useful for the looping part of the procedure. If we can find a way to loop as we need to we may be able to take better advantage of the buffer cache. Also, why use a temp table when you don’t have to?

One way to accomplish this is with the OFFSET keyword introduced in SQL Server 2012. SQL Server always reads and throws away the number of rows specified in the OFFSET clause. It cannot smartly skip ahead in the index. To avoid some of the performance problems with OFFSET we need to use it with an anchor row. Here is an algorithm that uses OFFSET instead of a temp table:

DECLARE @batch_size INT = 1000000,
@id1_start BIGINT,
@id2_start BIGINT;

BEGIN

SET NOCOUNT ON;

SELECT @id1_start = ID1, @id2_start = ID2
FROM SOURCE_TABLE s WITH (TABLOCK)
ORDER BY ID1, ID2
	OFFSET 0 ROWS
	FETCH FIRST 1 ROW ONLY;

WHILE @id1_start IS NOT NULL
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT TOP (1000000) s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY s.ID1, s.ID2;

	SET @id1_start = NULL;
	SET @id2_start = NULL;

	SELECT @id1_start = ID1, @id2_start = ID2
	FROM SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY ID1, ID2
		OFFSET (@batch_size) ROWS
		FETCH FIRST 1 ROW ONLY;
END;

END;

OFFSET to get the row that starts the next key range to process. Performance is a bit better than before:

a8_table_6

However, the number of physical reads didn’t fall as much as it could have. The source table has 2.5 million data pages and the straight insert only needed 2.5 million physical reads to copy the data, which is hardly a coincidence.

Better Buffer Cache

There’s a subtle issue in the last algorithm. Think about the order of scans and inserts. For the Nth loop we scan one million rows from the source table using the anchor row that we found previously. Nearly all of those rows won’t be in the buffer cache so they will be physical reads. The insert happens at the same time and that certainly has an effect on what will remain in the buffer cache. Then we scan the same range again to find the next anchor row. It feels like it would be better to scan the upcoming range first, get the next anchor row, then perform the insert up to the anchor row. That way data that we want to remain in the buffer cache won’t be kicked out as much by the INSERT. The code to do this is a bit more complex:

DECLARE @batch_size INT = 1000000,
@id1_start BIGINT,
@id2_start BIGINT,
@id1_next BIGINT,
@id2_next BIGINT;

BEGIN

SET NOCOUNT ON;

SELECT @id1_start = ID1, @id2_start = ID2
FROM SOURCE_TABLE s WITH (TABLOCK)
ORDER BY ID1, ID2
	OFFSET 0 ROWS
	FETCH FIRST 1 ROW ONLY;

SELECT @id1_next = ID1, @id2_next = ID2
FROM SOURCE_TABLE s WITH (TABLOCK)
WHERE (
s.ID1 > @id1_start OR
(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
)
ORDER BY ID1, ID2
	OFFSET (@batch_size) ROWS
	FETCH FIRST 1 ROW ONLY;

WHILE @id1_start IS NOT NULL
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT TOP (1000000) s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY s.ID1, s.ID2;

	SET @id1_start = @id1_next;
	SET @id2_start = @id2_next;
	SET @id1_next = NULL;
	SET @id2_next = NULL;

	SELECT @id1_next = ID1, @id2_next = ID2
	FROM SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 >= @id2_start)
	)
	ORDER BY ID1, ID2
		OFFSET (@batch_size) ROWS
		FETCH FIRST 1 ROW ONLY;
END;

END;

However, performance improved yet again. It’s almost like I planned this blog post out in advance!

a8_table_7

The number of physical reads is now very close to the number required by the straight insert. Note that we could have also reduced the physical read count of the previous algorithm by lowering the batch size.

The Single Source Scan Method

What if we don’t want to read the source table twice? After all, the straight insert only needs to read the source table once. We can’t let SQL Server outsmart us that much. The key to this next algorithm is that the target table starts out empty. Since we’re inserting in clustered key order, getting the next anchor for a loop is as simple as selecting the last clustered key from the target table and changing the predicate a little bit:

DECLARE @batch_size INT = 1000000,
@RC int,
@id1_start BIGINT,
@id2_start BIGINT;

BEGIN

SET NOCOUNT ON;

INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
SELECT TOP (1000000) s.*
FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
ORDER BY s.ID1, s.ID2;

set @RC = @@ROWCOUNT;

SELECT TOP (1) @id1_start = ID1, @id2_start = ID2
FROM TARGET_TABLE s WITH (TABLOCK)
ORDER BY ID1 DESC, ID2 DESC;

WHILE @RC = @batch_size
BEGIN
	INSERT INTO dbo.TARGET_TABLE WITH (TABLOCK)
	SELECT TOP (1000000) s.*
	FROM dbo.SOURCE_TABLE s WITH (TABLOCK)
	WHERE (
	s.ID1 > @id1_start OR
	(s.ID1 = @id1_start AND s.ID2 > @id2_start)
	)
	ORDER BY s.ID1, s.ID2;

	set @RC = @@ROWCOUNT;

	SELECT TOP (1) @id1_start = ID1, @id2_start = ID2
	FROM TARGET_TABLE s WITH (TABLOCK)
	ORDER BY ID1 DESC, ID2 DESC;
END;

END;

We see a minor improvement in the number of logical reads as expected:

a8_table_8

This was the best algorithm that I was to come up with.

A Few Bad Ideas

I played around with a few other methods of moving the data and feel the need to caution the reader against them. One idea is to define an AFTER INSERT trigger on the target table and to use the special inserted table to save off the value of the last clustered key inserted into that table. This appeared to just be slower than the single scan method and who wants to use triggers when they can be avoided?

A standard cursor performs extremely poorly because it operators on one row at a time. There’s an impressively poorly-documented alternative called API cursors which can process more than one row at a time. After some struggling I was able to use API cursors to copy data from one table to the other, but there was an intermediate step that loaded data from the cursor into a hidden temp table. Also, the cardinality estimate from the insert came from an EXECUTE and was a single row. Performance was very poor.

Final Thoughts

The point of this post wasn’t to insist that one method of moving data is better than all others. In fact, many different algorithms will be likely be suitable depending on the table structure, table size, and time requirements. However, it’s important to know that seemingly minor changes can lead to large differences in performance when moving around data that cannot fit in the buffer pool. These performance problems can become magnified when moving large tables or lots of tables. Thanks for reading!

CCI Partitioning Part 1: Rowgroup Elimination Fragmentation

This is part 1 of a series on columnstore index partitioning. The focus of this post is on rowgroup elimination.

Defining Fragmentation for CCIs

There are many different ways that a CCI can be fragmented. Microsoft considers having multiple delta rowgroups, deleted rows in compressed rowgroups, and compressed rowgroups less than the maximum size all as signs of fragmentation. I’m going to propose yet another way. A CCI can also be considered fragmented if the segments for a column that is very important for rowgroup elimination are packed in such a way that limits rowgroup elimination. That sentence was a bit much but the following table examples should illustrate the point.

The Test Data

First we’ll create a single column CCI with integers from 1 to 104857600. I’ve written the query so that the integers will be loaded in order. This is more commonly done using a loop and I’m relying slightly on undocumented behavior, but it worked fine in a few tests that I did:

DROP TABLE IF EXISTS dbo.LUCKY_CCI;

CREATE TABLE dbo.LUCKY_CCI (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.LUCKY_CCI WITH (TABLOCK)
SELECT TOP (100 * 1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
ORDER BY RN
OPTION (MAXDOP 1);

On my machine, the table takes less than 40 seconds to populate. We can see that the minimum and maximum IDs for each rowgroup are nice and neat from the sys.column_store_segments DMV:

a7_lucky

This table does not have any rowgroup elimination fragmentation. The segments are optimally constructed from a rowgroup elimination point of view. SQL Server will be able to skip every rowgroup except for one or two if I filter on a sequential range of one million integers:

SELECT COUNT(*)
FROM dbo.LUCKY_CCI
WHERE ID BETWEEN 12345678 AND 13345677;

STATISTICS IO output:

Table ‘LUCKY_CCI’. Segment reads 2, segment skipped 98.

Erik Darling was the inspiration for this next table, since he can never manage to get rowgroup elimination in his queries. The code below is designed to create segments that allow for pretty much no rowgroup elimination whatsoever:

DROP TABLE IF EXISTS dbo.UNLUCKY_CCI;

CREATE TABLE dbo.UNLUCKY_CCI (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.UNLUCKY_CCI WITH (TABLOCK)
SELECT rg.RN
FROM
(
	SELECT TOP (100) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
) driver
CROSS APPLY (
	SELECT driver.RN

	UNION ALL

	SELECT 104857601 - driver.RN

	UNION ALL

	SELECT TOP (1048574) 100 + 1048574 * (driver.RN - 1)
		+ ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
	FROM master..spt_values t2
	CROSS JOIN master..spt_values t3
) rg
ORDER BY driver.RN
OPTION (MAXDOP 1, NO_PERFORMANCE_SPOOL);

Each rowgroup will contain an ID near the minimum (1 – 100) and an ID near the maximum (104857301 – 104857400). We can see this with the same DMV as before:

a7_unlucky

This table has pretty much the worst possible rowgroup elimination fragmentation. SQL Server will not be able to skip any rowgroups when querying almost any sequential range of integers:

SELECT COUNT(*)
FROM dbo.UNLUCKY_CCI
WHERE ID BETWEEN 12345678 AND 13345677;

STATISTICS IO output:

Table ‘UNLUCKY_CCI’. Segment reads 100, segment skipped 0.

Measuring Rowgroup Elimination Fragmentation

It might be useful to develop a more granular measurement for rowgroup elimination instead of just “best” and “worst”. One way to approach this problem is to define a batch size in rows for a filter against the column and to estimate how many total table scans would need to be done if the entire table was read in a series of SELECT queries with each processing a batch of rows. For example, with a batch size of 1048576 query, 1 would include the IDs from 1 to 1048576, query 2 would include IDs from 1 + 1048576 to 1048576 * 2, and so on. I will call this number the rowgroup elimination fragmentation factor, or REFF for short. It roughly represents how many extra rowgroups are read that could have been skipped with less fragmentation. Somewhat-tested code to calculate the REFF is below:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

DECLARE @batch_size INT = 1048576;
DECLARE @global_min_id BIGINT;
DECLARE @global_max_id BIGINT;
DECLARE @segments BIGINT;

DROP TABLE IF EXISTS #MIN_AND_MAX_IDS;
CREATE TABLE #MIN_AND_MAX_IDS (
	min_data_id BIGINT,
	max_data_id BIGINT
);
CREATE CLUSTERED INDEX c ON #MIN_AND_MAX_IDS
(min_data_id, max_data_id);

INSERT INTO #MIN_AND_MAX_IDS
SELECT css.min_data_id, css.max_data_id
FROM sys.objects o
INNER JOIN sys.columns c ON o.object_id = c.object_id
INNER JOIN sys.partitions p ON o.object_id = p.object_id
INNER JOIN sys.column_store_segments css
	ON p.hobt_id = css.hobt_id
	AND css.column_id = c.column_id
INNER JOIN sys.dm_db_column_store_row_group_physical_stats s
	ON o.object_id = s.object_id
	AND css.segment_id = s.row_group_id
	AND s.partition_number = p.partition_number
WHERE o.name = 'LUCKY_CCI'
AND c.name = 'ID'
AND s.[state] = 3;

SET @segments = @@ROWCOUNT;

SELECT
  @global_min_id = MIN(min_data_id)
, @global_max_id = MAX(max_data_id)
FROM #MIN_AND_MAX_IDS;

DROP TABLE IF EXISTS #BUCKET_RANGES;
CREATE TABLE #BUCKET_RANGES (
	min_data_id BIGINT,
	max_data_id BIGINT
);

 -- allows up to 6.4 million pieces
INSERT INTO #BUCKET_RANGES
SELECT
  @global_min_id + (RN - 1) * @batch_size
, @global_min_id + RN * @batch_size - 1
FROM
(
	SELECT ROW_NUMBER() OVER
	(ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
WHERE -- number of buckets
t.RN <= CEILING((1.0 * @global_max_id
	- (@global_min_id)) / @batch_size);

 -- number of total full scans
SELECT 1.0 * SUM(cnt) / @segments
FROM (
	SELECT COUNT(*) cnt
	FROM #BUCKET_RANGES br
	LEFT OUTER JOIN #MIN_AND_MAX_IDS m
		ON m.min_data_id <= br.max_data_id
 		AND m.max_data_id >= br.min_data_id
	GROUP BY br.min_data_id
) t;

For the LUCKY_CCI table, if we pick a batch size of 1048576 we get a REFF of 1.0. This is because each SELECT query would read exactly 1 rowgroup and skip the other 99 rowgroups. Conversely, the UNLUCKY_CCI table has a REFF of 100.0. This is because every SELECT query would need to read all 100 rowgroups just to return 1048576 rows. In other words, each query does a full scan of the table and there are 100 rowgroups in the table.

What about DUIs?

Suppose our lucky table runs out of luck and starts getting DUIs. As rows are deleted and reinserted they will be packed into new rowgroups. We aren’t likely to keep our perfect rowgroups for every long. Below is code that randomly deletes and inserts a range of 100k integers. There is some bias in which numbers are processed but that’s ok for this type of testing. I ran the code with 600 loops which means that up to 60% of the rows in the table could have been moved around. There’s nothing special about 60%. That just represents my patience with the process.

DECLARE @target_loops INT = 600,
@batch_size INT = 100001,
@current_loop INT = 1,
@middle_num BIGINT,
@rows_in_target BIGINT;

BEGIN

SET NOCOUNT ON;

SELECT @rows_in_target = COUNT(*)
FROM dbo.LUCKY_CCI;

DROP TABLE IF EXISTS #num;
CREATE TABLE #num (ID INT NOT NULL);

INSERT INTO #num WITH (TABLOCK)
SELECT TOP (@batch_size) -1 + ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

WHILE @current_loop <= @target_loops
BEGIN
	SET @middle_num = CEILING(@rows_in_target * RAND());

	DELETE tgt WITH (TABLOCK)
 	FROM dbo.LUCKY_CCI tgt
 	WHERE tgt.ID BETWEEN
 	CAST(@middle_num - FLOOR(0.5*@batch_size) AS BIGINT)
	AND
	CAST(@middle_num + FLOOR(0.5*@batch_size) AS BIGINT);

 	INSERT INTO dbo.LUCKY_CCI WITH (TABLOCK)
 	SELECT @middle_num + n.ID - FLOOR(0.5 * @batch_size)
	FROM #num n
 	WHERE n.ID BETWEEN 1 AND @rows_in_target
	OPTION (MAXDOP 1);

	SET @current_loop = @current_loop + 1;
END;

END;

Our table now has a bunch of uncompressed rowgroups and many compressed rowgroups with lots of deleted rows. Let’s clean it up with a REORG:

ALTER INDEX CCI ON dbo.LUCKY_CCI
REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

This table has a REFF of about 41 for a batch size of 1048576. That means that we can expect to see significantly worse rowgroup elimination than before. Running the same SELECT query as before:

Table ‘LUCKY_CCI’. Segment reads 51, segment skipped 75.

The LUCKY_CCI table clearly needs to be renamed.

The Magic of Partitioning

A partition for a CCI is a way of organizing rowgroups. Let’s create the same table but with partitions that can hold five million integers. Below is code to do that for those following along at home:

CREATE PARTITION FUNCTION functioning_part_function
(BIGINT)
AS RANGE RIGHT
FOR VALUES (
  0
, 5000000
, 10000000
, 15000000
, 20000000
, 25000000
, 30000000
, 35000000
, 40000000
, 45000000
, 50000000
, 55000000
, 60000000
, 65000000
, 70000000
, 75000000
, 80000000
, 85000000
, 90000000
, 95000000
, 100000000
, 105000000
); 

CREATE PARTITION SCHEME scheming_part_scheme
AS PARTITION functioning_part_function
ALL TO ( [PRIMARY] );

DROP TABLE IF EXISTS dbo.PART_CCI;

CREATE TABLE dbo.PART_CCI (
ID BIGINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
) ON scheming_part_scheme(ID);

INSERT INTO dbo.PART_CCI WITH (TABLOCK)
SELECT TOP (100 * 1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
ORDER BY RN
OPTION (MAXDOP 1);

5 million was a somewhat arbitrary choice. The important thing is that it’s above 1048576 rows. Generally speaking, picking a partition size below 1048576 rows is a bad idea because the table’s rowgroups will never be able to reach the maximum size.

Using the sys.dm_db_column_store_row_group_physical_stats dmv we can see that rowgroups were divided among partitions as expected:

a7_part

This table has a REFF of 1.9 for a batch size of 1048576. This is perfectly reasonable because the partition size was defined as 5 * 1000000 instead of 5 * 1048576 which would be needed for a REFF of 1.0.

Now I’ll run the same code as before to do 60 million DUIs against the table. I expect the partitions to limit the damage because row cannot move out of their partitions. After the process finishes and we do a REORG, the table has a REFF of about 3.7. The rowgroups are definitely a bit messier:

a7_part_after_DUI

Running our SELECT query one last time:

SELECT COUNT(*)
FROM dbo.Part_CCI
WHERE ID BETWEEN 12345678 AND 13345677;

We see the following numbers for rowgroup elimination:

Table ‘PART_CCI’. Segment reads 4, segment skipped 3.

Only four segments are read. Many segments are skipped, including all of those in partitions not relevant to the query.

Final Thoughts

Large, unpartitioned CCIs will eventually run into issues with rowgroup elimination fragmentation if the loading process can delete data. A maintainance operation that rebuilds the table with a clustered rowstore index and rebuilds the CCI with MAXDOP = 1 is one way to restore nicely organized rowgroups, but that is a never-ending battle which grows more expensive as the table gets larger. Picking the right partitioning function can guarantee rowgroup elimination fragmentation. Thanks for reading!

Columnstore Parallel Scan Row Distribution

Parallel rowstore scans are parallel “aware”. This makes them unlike most other operators which work independently on different threads. Columnstore indexes store data in a dramatically different way than rowstore objects, so perhaps we can expect differences in how rows are distributed among threads during a parallel scan. This blog post explores some of the observable behavior around row distribution for parallel columnstore scans.

Methods of Rowstore Scan Parallel Row Distribution

This will be an extremely brief summary of how SQL Server distributes rows among threads for rowstore parallel scans. A parallel page supplier sends pages to each thread on a demand basis. Threads may end up processing different numbers of pages for many reasons. If you like, you can read more about this here and here. In addition, there is some fancy behavior for partitioned tables. The documentation describes this behavior in SQL Server 2008 and it may have changed since then, but this is sufficient to set the stage. The important part is that SQL Server will in some cases give each relevant partition its own thread. In other cases, SQL Server will assign multiple threads to a single partition.

Methods of Columnstore Scan Parallel Row Distribution

I investigated tables with only compressed rowgroups. I did not consider delta stores because real CCIs don’t have delta stores. As far as I can tell, there are at least three different methods that SQL Server can use to assign rows to threads during a parallel CCI scan.

Rowgroup Parallelism

One method of distributing rows is to assign each thread to a relevant rowgroup. Two or more threads will not read rows from the same rowgroup. This strategy can be used if the cardinality estimate from the scan is sufficiently high compared to the DOP used by the query. Rowgroup level parallelism will be used if:

Cardinality Estimate >= MAXDOP * 1048576 * 0.5

To show this, let’s build a CCI with a single column integer that runs from 1 to 1048576 * 10. Naturally, the table will have ten compressed rowgroups of the maximum size:

CREATE TABLE dbo.CCI_SHOW_RG_PARALLELISM (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_SHOW_RG_PARALLELISM WITH (TABLOCK)
SELECT TOP (1048576 * 10) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

I ran my tests with a simple query that is likely to go parallel on my machine and that doesn’t qualify for aggregate pushdown. With a cardinality estimate of exactly 1000000 and MAXDOP of 2 all of the rows are sent to one thread:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 1000000
OPTION (MAXDOP 2);

From the actual plan:

a6_RG_1

If we reduce the cardinality estimate by one row, the rows are spread out more evenly on the two threads:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 999999
OPTION (MAXDOP 2);

From the actual plan:

a6_RG_2

Note that the cardinality estimate displayed in SSMS may be misleading. The first query has a displayed cardinality estimate of 2000000 rows, but it does not use rowgroup parallelism:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 1999999 -- query plans can round
OPTION (MAXDOP 4);

From the actual plan:

a6_RG_3

But this one does:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 2000000
OPTION (MAXDOP 4);

From the actual plan:

a6_RG_4

Of course, we can get the query that takes an aggregate of 1999999 rows to use rowgroup level parallelism by bumping up the estimate:

DECLARE @filter BIGINT = 1999999;

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= @filter
OPTION (MAXDOP 4);

Here the estimate is:

0.3 * 10485800 = 3145740.0

So we get the expected parallelism strategy:

a6_RG_5

We can show that the rowgroup parallelism strategy is demand-based by deliberately slowing down the required the thread that grabs the first rowgroup in the table. Here I’m defining first by the ID that’s returned when running a SELECT TOP 1 ID query against the table. On my machine I get an ID of 9437185. The following code will add significant processing time in the CROSS APPLY part for only the row with an ID of 9437185. Every other row simply does a Constant Scan and goes on its merry way:

SELECT COUNT(*)
FROM dbo.CCI_SHOW_RG_PARALLELISM o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 9437185

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 9437185
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
OPTION (MAXDOP 2);

Thread 1 processes nine rowgroups and waits on thread 2 to process its single rowgroup:

a6_RG_6

Split Rowgroup Parallelism

If the CCI has a small number of rows and the cardinality estimate is low enough you may see a different parallel row distribution strategy employed. I couldn’t think of a good name for this, but SQL Server splits up each rowgroup into roughly equal pieces and threads can process those pieces on a demand basis. The number of pieces seems to depend on the number of rows in the rowgroup instead of MAXDOP . For MAXDOP larger than 2 this behavior can be observed if a table has one or two rowgroups. For a MAXDOP of 2 this behavior can be observed if a table has exactly one rowgroup.

The formula for the number of pieces appears to be the number of rows in the rowgroup divided by 104857, rounded down, with a minimum of 1. The maximum rowgroup size of 1048576 implies a maximum number of pieces of 10 per rowgroup. Here’s a table to show all of the possibilities:

a6_table

We can see evidence of this behavior in SQL Server with a few tests. As before I’ll need to slow down one thread. First I’ll put 943713 integers into a single compressed rowgroup:

DROP TABLE IF EXISTS dbo.TEST_CCI_SMALL;

CREATE TABLE dbo.TEST_CCI_SMALL (
ID BIGINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (943713) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

I will test with a very similar query to a previous test:

SELECT COUNT(*)
FROM dbo.TEST_CCI_SMALL o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 1

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 1
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
OPTION (QUERYTRACEON 8649, MAXDOP 4);

The parallel scan should be split into nine pieces to be divided among threads because the table has a single rowgroup with a row count of 943713. This is exactly what happens:

a6_small_1

If I truncate the table and load one fewer row, the scan is now split up into eight pieces:

a6_small_2

I can also create two compressed rowgroups of a size that should led to seven pieces per rowgroup:

DROP TABLE IF EXISTS dbo.TEST_CCI_SMALL;

CREATE TABLE dbo.TEST_CCI_SMALL (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (733999) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (733999) 733999 + ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

Thread 1 processes the row with an ID of 734000 so it only gets one piece:

a6_small_3

With more than one rowgroup, the demand-based aspect of piece distribution doesn’t quite work in the same way as with a single rowgroup. I wasn’t able to work out all of the details.

A Third Way?

What about queries that do not meet either of the two above criteria? For example, a query against a not small CCI that has a low cardinality estimate coming out of the scan? In some cases SQL Server will use rowgroup level distribution. In other cases it appears to use a combination of the two methods described above. Most of the time the behavior can be described as each thread gets assigned an entire rowgroup and threads race for pieces of the remaining rowgroups. I wasn’t able to figure out exactly how SQL Server decides which method to use, despite running many tests. However, I will show most of the behavior that I observed. First put 7 rowgroups into a CCI:

DROP TABLE IF EXISTS dbo.MYSTERIOUS_CCI;

CREATE TABLE dbo.MYSTERIOUS_CCI (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.MYSTERIOUS_CCI WITH (TABLOCK)
SELECT TOP (1048576 * 7) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

SELECT TOP 1 ID -- 6291457
FROM dbo.MYSTERIOUS_CCI;

I added local variables to my test query to lower the cardinality estimate. Otherwise I would get rowgroup distribution every time. Here’s an example query:

declare @lower_id INT = 1048576 * 2 + 1;
declare @upper_id INT = 1048576 * 7;

SELECT COUNT(*)
FROM dbo.MYSTERIOUS_CCI o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 6291457

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 6291457
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
WHERE o.ID BETWEEN @lower_id AND @upper_id
OPTION (QUERYTRACEON 8649, MAXDOP 3);

With a MAXDOP of 3 and five processed rowgroups I get rowgroup level parallelism:

a6_3rd_1

With a MAXDOP of 4 and five processed rowgroups, each thread gets a rowgroup and the other three threads race to process the remaining 20 pieces:

a6_3rd_2

With a MAXDOP of 3 and six processed rowgroups we no longer get rowgroup level parallelism:

a6_3rd_3

What about Partitioning?

In tests not reproduced here, I was not able to observe any differences in how rows were distributed for parallel CCI scans when the underlying CCI was partitioned. This seems reasonable if we think about how partitioning for CCIs is different than for rowstore tables. CCI partitions are simply a collection of rowgroups containing rows relevant to the partition. If there’s a need to split up the underlying components of a table we can just assign rowgroups to different threads. For rowstore tables, we can think of each partition as a mini-table. Partitioning for rowstore adds underlying objects which can be distributed to parallel threads.

Fun with Query Plans

With a partial understanding of how SQL server distributes rows after a parallel scan, we can write queries that show some of the edge cases that can lead to poor performance. The following query is complete nonsense but it shows the point well enough. Here I’m cross joining to a few numbers from the spt_values table. The CROSS JOIN is meant to represent other downstream work done by a query:

SELECT MAX(SQRT(o.ID + t.number))
FROM dbo.MYSTERIOUS_CCI o
CROSS JOIN master..spt_values t
WHERE o.ID <= 1100000 AND
t.number BETWEEN 0 AND 4
OPTION (MAXDOP 2, NO_PERFORMANCE_SPOOL, QUERYTRACEON 8649);

The cardinality estimate and MAXDOP of 2 leads to rowgroup level parallelism being used. Unfortunately, this is very unbalanced:

a6_fun_1

And as a result, the query barely benefits from parallelism:

CPU time = 25578 ms, elapsed time = 24580 ms.

It’s a well-guarded secret that queries run faster with odd MAXDOP , so let’s try a MAXDOP of 3. Now my parallel threads no longer sleep on the job:

CPU time = 30922 ms, elapsed time = 11840 ms.

Here’s the row distribution from the scan:

a6_fun_2

Final Thoughts

This post explored a bit of the observable behavior for how rows are distributed to threads after a parallel columnstore index scan. The algorithms used can in some cases lead to row imbalance on threads which can cause performance issues downstream in the plan. The lack of repartition stream operators in batch mode can make this problem worse than it might be for rowstore scans, but I still expect issues caused by it to be relatively uncommon in practice. Thanks for reading!

Trace Flag 610 and SQL Server 2016

SQL Server recently surprised me with unexpected minimal logging. Normally this would be a cause for celebration but I was designing demos for a class. The point of that particular demo was to demonstrate a case in which minimal logging did not occur. The specific test case was inserting into a rowstore clustered index without a TABLOCK hint. Per the documentation, I should not have gotten minimal logging unless trace flag 610 was turned on. I was testing on SQL Server 2016 without trace flag 610.

Trace Flag 610

This trace flag is documented by Microsoft in The Data Loading Performance Guide. Here’s a relevant quote:

Not every row inserted in a cluster index with trace flag 610 is minimally logged. When the bulk load operation causes a new page to be allocated, all of the rows sequentially filling that new page are minimally logged. Rows inserted into pages that are allocated before the bulk load operation occurs are still fully logged, as are rows that are moved as a result of page splits during the load.

In addition, you can see a table of expected logging results if you search for “Summarizing Minimal Logging Conditions”. That table confirms that minimal logging should not happen when inserting into a clustered index without TABLOCK or trace flag 610. However, that’s exactly what I saw on SQL Server 2016.

SQL Server 2014 Testing

In a state of panic I immediately downloaded and installed SQL Server 2014 Express. Well not really, but it makes for a better blog post. For these tests I’m using a recovery model of simple.

Simple Inserts

First I’ll insert 2500 pages into a clustered rowstore table with a few different options. Table schema:

DROP TABLE IF EXISTS dbo.target_ci;

CREATE TABLE dbo.target_ci (
	ID BIGINT NOT NULL,
	FILLER VARCHAR(7700) NOT NULL,
	PRIMARY KEY (ID)
);

I’ll be checking out what’s getting logged with these queries:

SELECT SUM(database_transaction_log_bytes_used)
FROM sys.dm_tran_database_transactions
WHERE DATABASE_ID = 7;
-- you probably have a different database ID

SELECT Operation, COUNT(*)
FROM fn_dblog(NULL, NULL)
GROUP BY Operation
ORDER BY COUNT(*) DESC;

The first test is with TABLOCK, the second is without TABLOCK, and the third is without TABLOCK but with trace flag 610 enabled. I should get minimal logging for the first and the third and not for the second. Here’s the code that I ran:

CHECKPOINT;
BEGIN TRANSACTION;

INSERT INTO dbo.target_ci WITH (TABLOCK)
SELECT TOP (2500)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 7700)
FROM master..spt_values;

-- pause here

COMMIT TRANSACTION;
TRUNCATE TABLE dbo.target_ci;
CHECKPOINT;

BEGIN TRANSACTION;

INSERT INTO dbo.target_ci
SELECT TOP (2500)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 7700)
FROM master..spt_values;

-- pause here

COMMIT TRANSACTION;
TRUNCATE TABLE dbo.target_ci;
CHECKPOINT;

DBCC TRACEON(610);

BEGIN TRANSACTION;

INSERT INTO dbo.target_ci
SELECT TOP (2500)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 7700)
FROM master..spt_values;

-- pause here

COMMIT TRANSACTION;
TRUNCATE TABLE dbo.target_ci;
CHECKPOINT;

According the transaction log DMV, the first test logged 78,464 bytes, the second test logged 22,157,848 bytes, and the third logged 2,140,124 bytes. These results were expected. Below is a table of results from the undocumented fn_dblog TVF:

a5_2014_min_logging_table

I omitted out the (hopefully) irrelevant data. I don’t know what any of it means, but we can clearly see differences between the three tests. Based on this simple testing, SQL Server 2014 appears to match the documentation.

Page Splits

We can also do a simple test to show a case where trace flag 610 doesn’t help. First insert 2500 rows with odd primary keys from 1 to 4999:

CREATE TABLE dbo.NO_NEW_PAGES (
	ID BIGINT NOT NULL,
	DATA VARCHAR(3800),
	PRIMARY KEY (ID)
);

INSERT INTO dbo.NO_NEW_PAGES WITH (TABLOCK)
SELECT TOP (2500)
  -1 + 2 * ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 3800)
FROM master..spt_values;

Then insert 2499 rows with even primary keys from 2 to 4998 with and without TF 610:

DBCC TRACEON(610);

BEGIN TRANSACTION;

INSERT INTO dbo.NO_NEW_PAGES
SELECT TOP (2499)
  2 * ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 3800)
FROM master..spt_values;

-- pause here

ROLLBACK TRANSACTION;

DBCC TRACEOFF(610);

BEGIN TRANSACTION;

INSERT INTO dbo.NO_NEW_PAGES
SELECT TOP (2499)
  2 * ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 3800)
FROM master..spt_values;

-- pause here

COMMIT TRANSACTION;

I expect a lot of page splitting because I can now fit two rows per page. TF 610 shouldn’t handle this scenario well. I logged 16,575,608 bytes with the trace flag and 9,796,224 bytes without the trace flag according to the transaction log DMV.

Nonclustered Indexes

Trace flag 610 can help with nonclustered indexes. Below is a test that inserts 2500 rows with and without trace flag 610 with an index on a new column:

CREATE TABLE dbo.IX_TEST (
	ID BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	DATA VARCHAR(3800),
	PRIMARY KEY (ID)
);

CREATE INDEX IX ON IX_TEST (ID2);

DBCC TRACEON(610);

BEGIN TRANSACTION;

INSERT INTO dbo.IX_TEST
SELECT TOP (2500)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 3800)
FROM master..spt_values;

-- pause here

ROLLBACK TRANSACTION;

DBCC TRACEOFF(610);

BEGIN TRANSACTION;

INSERT INTO dbo.IX_TEST
SELECT TOP (2500)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 3800)
FROM master..spt_values;

-- pause here

COMMIT TRANSACTION;

There was an order of magnitude difference. With trace flag 610 I logged 1,129,728 bytes and without trace flag 610 I logged 10,120,144 bytes.

SQL Server 2016 Testing

I ran the same tests again on SQL Server 2016 SP1-CU3. With TABLOCK I logged 111,568 bytes according to the transaction log DMV. For both other tests I logged 2,125,376 bytes. Here’s a table of results from fn_dblog:

a5_2016_min_logging_table

Despite what the table says, note that the results with and without the trace flag weren’t exactly the same. However, even when running the same test sometimes the number of logged records per operation varies a little bit. My conclusion was that SQL Server is doing the same thing behind the scenes.

For both the page splitting and the nonclustered index test I got the same results as in SQL Server 2014. This shows that trace flag 610 does have some effect in SQL Server 2016, but not the one you’d expect from reading the documentation.

Estimated Final Thoughts

It appears that Microsoft has added additional minimal logging optimizations for clustered rowstore tables in SQL Server 2016. At least one of these optimizations was previously locked behind trace flag 610. I could not find any mention of this change in the documentation. If you carried over trace flag 610 when upgrading to SQL Server 2016, you should consider retesting that trace flag with your workload. It’s possible to construct a workload that benefits from trace flag 610 in SQL Server 2014 but is harmed by the same trace flag in SQL Server 2016.

Actual Final Thoughts

As of August 1, Microsoft documented that trace flag 610 is a no-op in SQL Server 2016. I revisited my results and found issues with the page splitting and nonclustered index tests. Starting with an empty table, inserting rows, and rolling back that transaction can affect the amount written to the log for future transactions. I don’t know why, but it’s necessary to drop and recreate the table before changing the trace flag to get clean results. It appears to be safe to turn off trace flag 610 in SQL Server 2016. Thanks for reading!

CCIs and String Aggregation

String aggregation is a not uncommon problem in SQL Server. By string aggregation I mean grouping related rows together and concatenating a string column for that group in a certain order into a single column. How will do CCIs do with this type of query?

The Data Set

A simple table with three columns is sufficient for testing. ITEM_ID is the id for the rows that should be aggregated together. LINE_ID stores the concatenation order within the group. COMMENT is the string column to be concatenated. The table will have 104857600 rows total with 16 rows per ITEM_ID.

CREATE TABLE dbo.CCI_FOR_STRING_AGG (
	ITEM_ID BIGINT NOT NULL,
	LINE_ID BIGINT NOT NULL,
	COMMENT VARCHAR(10) NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

DECLARE @loop INT = 0
SET NOCOUNT ON;
WHILE @loop < 100
BEGIN
	INSERT INTO dbo.CCI_FOR_STRING_AGG WITH (TABLOCK)
	SELECT
	t.RN / 16
	, 1 + t.RN % 16
	, CHAR(65 + t.RN % 16)
	FROM
	(
		SELECT TOP (1048576)
		(1048576 * @loop) - 1 +
		ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
		FROM master..spt_values t1
		CROSS JOIN  master..spt_values t2
	) t
	OPTION (MAXDOP 1);

	SET @loop = @loop + 1;
END;

On my machine this codes takes around a minute and a half and the final table size is around 225 MB. I’m inserting batches of 1048576 rows with MAXDOP 1 to get nice, clean rowgroups.

The Test Query

Let’s concatenate the strings at an ITEM_ID level for all ITEM_IDs with an id between 3276800 and 3342335. All of the data is stored in a single rowgroup and the CCI was built in such a way that all other rowgroups can be skipped with that filter. This should represent the best case for CCI performance. A common way to concatenate a column is with the FOR XML PATH method:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

Note that I’m not bothering with the additional arguments for the XML part, but you should whenever using this method in production.

The query plan looks reasonable at first glance:

a4 FOR XML estimated

However, the query takes nearly six minutes to complete. That’s not the lightning fast CCI query performance that we were promised!

Spool Problems

We can see from the actual execution plan that SQL Server scanned all 104.8 million rows from the CCI on a single thread:

a4 actual threads

Those rows were then sent into an index spool which was read from all four threads in the nested loop. The CCI scan and the build of the index spool took the most time in the plan so it makes sense why CPU time is less than elapsed time, even with a MAXDOP 4 query:

CPU time = 305126 ms, elapsed time = 359176 ms.

Thinking back to how parallel nested loop joins work, it seems unavoidable that the index spool was built with just one thread. The rows from the outer result set are sent to the inner query one row at a time on different threads. All of the threads run independently at MAXDOP 1. The eager spool for the index is a blocking operation and the blocking occurs even when the subquery isn’t needed. Consider the following query in which the ITEM_ID = 3400000 filter means that rows from the FOR XML PATH part of the APPLY will never be needed:

SELECT o.ITEM_ID
, ac.ALL_COMMENTS
FROM
(
	SELECT TOP (9223372036854775807) b.ITEM_ID
	FROM dbo.CCI_FOR_STRING_AGG b
	WHERE b.ITEM_ID BETWEEN 3276800 AND 3342335
	GROUP BY b.ITEM_ID
) o
OUTER APPLY (
	SELECT 'TEST'

	UNION ALL

	SELECT STUFF(
		(
			SELECT ','+ i.COMMENT
			FROM dbo.CCI_FOR_STRING_AGG i
			WHERE o.ITEM_ID = i.ITEM_ID
			ORDER BY i.LINE_ID
			FOR XML PATH('')
		)
	,1 ,1, '')
	WHERE o.ITEM_ID =  3400000
)  ac (ALL_COMMENTS)
OPTION (MAXDOP 4, QUERYTRACEON 8649);

The index spool is still built for this query on one thread even though the startup expression predicate condition is never met. It seems unlikely that we’ll be able to do anything about the fact that the index spool is built with one thread and that it blocks execution for the query. However, we know that we only need 1048576 rows from the CCI to return the query’s results. Right now the query takes 104.8 million rows and throws them into the spool. Can we reduce the number of rows put into the spool? The most obvious approach is to simply copy the filter on ITEM_ID into the subquery:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		and i.ITEM_ID BETWEEN 3276800 AND 3342335
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

This doesn’t have the desired effect:

a4 bad filter with spool

The filter is moved after the index build. We’ll still get all of the rows from the table put into the spool, but no rows will come out unless ITEM_ID is between 3276800 and 3342335. This is not helpful. We can get more strict with the query optimizer by adding a superfluous TOP to the subquery. That should force SQL Server to filter on ITEM_ID before sending rows to the spool because otherwise the TOP restriction may not be respected. One implementation:

SELECT
o.ITEM_ID
, STUFF(
(
SELECT ','+ i.COMMENT
FROM
(
SELECT TOP (9223372036854775807)
a.ITEM_ID, a.LINE_ID, a.COMMENT
FROM dbo.CCI_FOR_STRING_AGG a
WHERE a.ITEM_ID BETWEEN 3276800 AND 3342335
) i
WHERE o.ITEM_ID = i.ITEM_ID
ORDER BY i.LINE_ID
FOR XML PATH('')
)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

SQL Server continues to outsmart us:

a4 no spool

As you can see in the plan above, the spool has completely disappeared. I was not able to find a way, even with undocumented black magic, to reduce the number of rows going into the spool. Perhaps it is a fundamental restriction regarding index spools. In fact, we can use the undocumented trace flag 8615 to see that spools are not even considered at any part in the query plan for the new query. On the left is the previous query with the spool with an example highlighted. On the right is the new query. The text shown here is just for illustration purposes, but we can see the spool on the left:

a4 TF diff

The important point is that for this query we appear to be stuck.

Rowgroup Elimination Problems

We can try our luck without the spool by relying on rowgroup elimation alone. The spool can’t be eliminated with the NO_PERFORMANCE_SPOOL hint, but another option (other than the TOP trick above) is to use the undocumented QUERYRULEOFF syntax to disable the optimizer rule for building spools:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID
OPTION (QUERYRULEOFF BuildSpool);

The spool is gone but we don’t get rowgroup elimination:

a4 no RG elimination

We can get rowgroup elimination even without hardcoded filters with a bitmap from a hash join. Why don’t we get it with a nested loop join? Surely SQL Server ought to be able to apply rowgroup elimination for each outer row as it’s processed by the inner part of the nested loop join. We can explore this question further with tests that don’t do string aggregation. The query below gets rowgroup elimination but the constant filters are passed down to the CCI predicate:

SELECT *
FROM (VALUES (1), (2), (3), (4)) AS v(x)
INNER JOIN dbo.CCI_FOR_STRING_AGG a ON v.x = a.ITEM_ID
OPTION (FORCE ORDER, LOOP JOIN);

If we throw the four values into a temp table:

SELECT x INTO #t
FROM (VALUES (1), (2), (3), (4)) AS v(x)

The join predicate is applied in the nested loop operator. We don’t get any rowgroup elimination. Perhaps TOP can help us:

SELECT *
FROM #t v
CROSS APPLY (
	SELECT TOP (9223372036854775807) *
	FROM dbo.CCI_FOR_STRING_AGG a
	WHERE v.x = a.ITEM_ID
) a
OPTION (LOOP JOIN);

Sadly not:

a4 top failure

The join predicate is applied in the filter operation. We still don’t get any rowgroup elimination. Frustratingly, we get the behavior that we’re after if we replace the CCI with a heap:

a4 heap success

I don’t really see an advantage in pushing down the predicate with a heap. Perhaps Microsoft did not program the optimization that we’re looking for into the query optimizer. After all, this is likely to be a relatively uncommon case. This query is simple enough in that we filter directly against the CCI. In theory, we can give our query a fighting chance by adding a redundant filter on ITEM_ID to the subquery. Here’s the query that I’ll run:

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM dbo.CCI_FOR_STRING_AGG i
		WHERE o.ITEM_ID = i.ITEM_ID
		and i.ITEM_ID BETWEEN 3276800 AND 3342335
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID
OPTION (QUERYRULEOFF BuildSpool);

Unfortunately performance is even worse than before. The query finished in around 51 minutes on my machine. It was a proper parallel query and we skipped quite a few rowgroups:

Table ‘CCI_FOR_STRING_AGG’. Segment reads 65537, segment skipped 6488163.
CPU time = 10028282 ms, elapsed time = 3083875 ms.

Skipping trillions of rows is impressive but we still read over 68 billion rows from the CCI. That’s not going to be fast. We can’t improve performance further with this method since there aren’t any nonclustered indexes on the CCI, so there’s nothing better to seek against in the inner part of the loop.

Temp Tables

We can use the humble temp table to avoid some of the problems with the index spool. We’re able to insert into it in parallel, build the index in parallel, and we can limit the temp table to only the 1048576 rows that are needed for the query. The following code finishes in less than two seconds on my machine:

CREATE TABLE #CCI_FOR_STRING_AGG_SPOOL (
	ITEM_ID BIGINT NOT NULL,
	LINE_ID BIGINT NOT NULL,
	COMMENT VARCHAR(10) NULL
);

INSERT INTO #CCI_FOR_STRING_AGG_SPOOL WITH (TABLOCK)
SELECT a.ITEM_ID, a.LINE_ID, a.COMMENT
FROM dbo.CCI_FOR_STRING_AGG a
WHERE a.ITEM_ID BETWEEN 3276800 AND 3342335;

CREATE CLUSTERED INDEX CI_CCI_FOR_STRING_AGG_SPOOL
ON #CCI_FOR_STRING_AGG_SPOOL (ITEM_ID, LINE_ID);

SELECT
  o.ITEM_ID
, STUFF(
	(
		SELECT ','+ i.COMMENT
		FROM #CCI_FOR_STRING_AGG_SPOOL i
		WHERE o.ITEM_ID = i.ITEM_ID
		ORDER BY i.LINE_ID
		FOR XML PATH('')
	)
,1 ,1, '') ALL_COMMENTS
FROM dbo.CCI_FOR_STRING_AGG o
WHERE o.ITEM_ID BETWEEN 3276800 AND 3342335
GROUP BY o.ITEM_ID;

The query plan for the last query:

a4 temp table

With this approach we’re not really getting much from defining the underlying table as a CCI. If the table had more columns that weren’t needed then we could skip those columns with column elimination.

Other Solutions

An obvious solution is simply to add a covering nonclustered index to the table. Of course, that comes with the admission that columnstore currently doesn’t have anything to offer with this type of query. Other solutions which are likely to be more columnstore friendly include using a CLR function to do the aggregation and the SQL Server 2017 STRING_AGG function.

Note that recursion is not likely to be a good solution here. Every recursive query that I’ve seen involves nested loops. The dreaded index spool returns:

a4 recursion issue

Final Thoughts

Note that the poor performance of these queries isn’t a problem specific to columnstore. Rowstore tables can have the similar performance issues with string aggregation if there isn’t a sufficient covering index. However, if CCIs are used as a replacement for all nonclustered indexes, then the performance of queries that require the table to be on the inner side of a nested loop join may significantly degrade. Some of the optimizations that make querying CCIs fast appear to be difficult to realize for these types of queries. The same string aggregation query when run against a table with a rowstore clustered index on ITEM_ID and LINE_ID finished in 1 second. That is an almost 360X improvement over the CCI. Thanks for reading!