November 2017 – ORDER BY (SELECT NULL)

Containment for the Common Man

Posted on November 28, 2017November 28, 2017 by Joe Obbish

Lots of smart people have written about join containment, but none of the explanations really made sense to me. I felt like a student memorizing definitions for a test. Sure, I could tell you the definitions of base and simple containment, but what practical difference does it make when it comes to cardinality estimation? The concept finally clicked when working on an Oracle query of all things, and as a result I wrote this blog post. All testing was done on SQL Server 2017 with a CE version of 140.

A Note on Join Cardinality

Join cardinality calculations are incredibly complex in SQL Server. You can get a small taste of that complexity here. I’ve chosen the example data in this blog post to avoid most of the complexity. The formulas and concepts described in this post can’t be used to model join cardinality generally, but I hope that they serve as a good illustration of containment.

Demo Tables

All of the demo tables have identical structures with similar data. The first column, UNIQUE_ID, stores unique integers in the range specified in the table name. For example, TA_1_TO_1000000 is a table that stores integers from 1 to 1000000. The second column, MOD_FILTER, stores integers from 1 to 100 cycling through all rows. The purpose of this column is to make filtering cardinality estimates simple to calculate and predict. For example, MOD_FILTER BETWEEN 1 AND 50 will return 50% of the rows from the table. Full statistics are gathered on all columns, and there are four tables in all.

DROP TABLE IF EXISTS dbo.TA_1_TO_1000000;

CREATE TABLE dbo.TA_1_TO_1000000 (
	UNIQUE_ID BIGINT NOT NULL,
	MOD_FILTER BIGINT NOT NULL
);

INSERT INTO dbo.TA_1_TO_1000000
	WITH (TABLOCK)
SELECT t.RN
, 1 + t.RN % 100
FROM
(
	SELECT TOP (1000000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.TA_1_TO_1000000 (UNIQUE_ID)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.TA_1_TO_1000000 (MOD_FILTER)
	WITH FULLSCAN;

DROP TABLE IF EXISTS dbo.TB_1_TO_1000000;

CREATE TABLE dbo.TB_1_TO_1000000 (
	UNIQUE_ID BIGINT NOT NULL,
	MOD_FILTER BIGINT NOT NULL
);

INSERT INTO dbo.TB_1_TO_1000000
	WITH (TABLOCK)
SELECT t.RN
, 1 + t.RN % 100
FROM
(
	SELECT TOP (1000000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.TB_1_TO_1000000 (UNIQUE_ID)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.TB_1_TO_1000000 (MOD_FILTER)
	WITH FULLSCAN;

DROP TABLE IF EXISTS dbo.TC_1_TO_100000;

CREATE TABLE dbo.TC_1_TO_100000 (
	UNIQUE_ID BIGINT NOT NULL,
	MOD_FILTER BIGINT NOT NULL
);

INSERT INTO dbo.TC_1_TO_100000
	WITH (TABLOCK)
SELECT t.RN
, 1 + t.RN % 100
FROM
(
	SELECT TOP (100000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.TC_1_TO_100000 (UNIQUE_ID)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.TC_1_TO_100000 (MOD_FILTER)
	WITH FULLSCAN;

DROP TABLE IF EXISTS dbo.TD_500001_TO_1500000;

CREATE TABLE dbo.TD_500001_TO_1500000 (
	UNIQUE_ID BIGINT NOT NULL,
	MOD_FILTER BIGINT NOT NULL
);

INSERT INTO dbo.TD_500001_TO_1500000
	WITH (TABLOCK)
SELECT t.RN
, 1 + t.RN % 100
FROM
(
	SELECT TOP (1000000) 500000 + ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.TD_500001_TO_1500000 (UNIQUE_ID)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.TD_500001_TO_1500000 (MOD_FILTER)
	WITH FULLSCAN;

The statistics objects are perfect in that they fully describe the data. Here’s the statistics output for the UNIQUE_ID column:

a21_T2_perfect_stats_1

And here’s the output for the MOD_FILTER column:

a21_T2_perfect_stats_2

This only happened because the table was populated with very simple data that fits well within the framework for generating histograms in SQL Server. Gathering statistics, even with FULLSCAN, will often not perfectly represent the data in the column.

A Simple Model of Join Cardinality Estimation

Consider the following simple query:

SELECT *
FROM TB_1_TO_1000000 b
INNER JOIN dbo.TD_500001_TO_1500000 d
	ON b.UNIQUE_ID = d.UNIQUE_ID;

We know that exactly 500000 rows will be returned, but how might SQL Server estimate the number of rows to be returned? Let’s look at the histograms and try to align their steps:

a21_ex1_not_aligned

That doesn’t exactly work, but we can split up the histogram steps so they align. The assumption of uniformity within the step isn’t even needed here because we know that there aren’t missing any integer values. The histograms below are equivalent to the original ones:

a21_ex1_aligned

Now the RANGE_HI_KEY values align. For the step with a high value of 500001 we can expect only one row to match between tables. For the step with a high value of 1000000 we can expect 499998 + 1 rows to match. This brings the total row estimate to 500000, which happens to match what I get in SQL Server 2017 with the new CE. Remember, what we’re doing here isn’t how the query optimizer does the calculation. This is just a simple model that will be useful later.

Now consider the two queries below:

SELECT *
FROM TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50;

SELECT *
FROM TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 51 AND 100;

We know that the first query will return 500k rows and the second query will return 0 rows. However, can SQL Server know that? Each statistics object only contains information about its own column. There’s no correlation between the UNIQUE_ID and MOD_FILTER columns, so there isn’t a way for SQL Server to know that the queries will return different estimates. The query optimizer can create an estimate based on the filters on the WHERE clause and on the histograms of the join columns, but there’s no foolproof way to do that calculation. The presence of the filters introduces uncertainty into the estimate, even with statistics that perfectly describe the data for each column. The containment assumption is all about the modeling assumption that SQL Server has to make to resolve that uncertainty.

Base Containment

Base containment is the assumption that the filter predicates are independent from the join selectivity. The estimate for the join should be obtained by multiplying together the selectivity from both filters and the join. The query optimizer uses base containment starting with CE model version 120, also known as the new CE introduced in SQL Server 2014. It can be used with the legacy CE if trace flag 2301 is turned on. The best reference for trace flag 2301 is a blog post from 2006 which is no longer published.

Let’s go back to this example query:

SELECT *
FROM TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50;

The selectivity for the filter on MOD_FILTER is 0.5 for both tables. This is because there are 100 unique values for MOD_FILTER between 1 and 100 and each value matches 1% of the table. We can see this by getting an estimated query plan on just TA_1_TO_1000000:

a21_ex2_filter_selectivity

The table has 1 million rows, so the estimate is 500000 = 0.5 * 1000000.

That leaves the join selectivity. We put the same data into both tables:

a21_ex2_same_histograms

We don’t need highlighters to see that the join selectivity is 1.0.

Putting it all together, the cardinality estimate under base containment for this query should be 1000000 * 1.0 * 0.5 * 0.5 = 250000. This is indeed the estimate:

a21_ex2_base_estimate

Of course, this doesn’t match the actual number of rows which is 500000. But it’s easy to change the filter predicates so that the estimated number of rows and the actual number of rows match.

Simple Containment

Simple containment is the assumption that the filter predicates are not independent. The estimate for the join should be obtained by applying the filter selectivities to the join histograms and joining based on the adjusted histograms. The query optimizer uses simple containment within the legacy CE. Simple containment can be used in the new CE via trace flag or USE HINT.

Let’s go back to the same example query:

SELECT *
FROM TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

We know that the filter selectivity for both tables is 0.5. How can that be used to adjust the histograms? The simplest method would be to just multiply RANGE_ROWS, EQ_ROWS, and DISTINCT_RANGE_ROWS by the filter selectivity. After doing so we’re left with two still identical histograms:

a21_ex2_simple_histograms

It might seem odd to work with fractions of a row, but as long as everything is rounded at the end why should it matter? With two identical, aligned histograms it seems reasonable to expect a cardinality estimate of 0.5 + 499999 + 0.5 = 500000. This is exactly what we get in SQL Server:

a21_ex2_simple_estimate

The actual row estimate matches the estimated row estimate because the filters are perfectly correlated. Every row left after filtering still has a matching row in the other table.

Just One Filter

What happens if we filter on just a single table? For example:

SELECT *
FROM dbo.TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 30;

SELECT *
FROM dbo.TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.MOD_FILTER BETWEEN 1 AND 30
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

For base containment, we know that the filter selectivity is 0.3 and the join selectivity is 1.0. We can expect a cardinality estimate of 1000000 * 1.0 * 0.3 = 300000 rows.

For simple containment we need to multiply the histogram for TA_1_TO_1000000 by 0.3. Here’s what the two histograms look like after factoring in filter selectivity:

a21_ex3_simple_histograms

What should the estimate be? One approach would be to assume that everything matches between the aligned steps. So we end up with 0.3 rows from the step with a RANGE_HI_KEY of 1 and 299999.4 + 0.3 rows from the step with a RANGE_HI_KEY of 1000000. The combined estimate is 300000 rows, which matches the base containment estimate. Why shouldn’t they match? Without filters on both tables there’s no concept of correlation. If it helps you can imagine a filter of 1 = 1 on TB_1_TO_1000000. For base containment multiplying by 1.0 won’t change the estimate and for simple containment multiplying by 1 won’t change the histogram. That just leaves a single filter selectivity of 0.3 for TA_1_TO_1000000 and both estimates should be the same.

For both queries the estimated number of rows in SQL Server is 300000. Our calculations match the SQL Server query optimizer exactly for this query.

Filtering on the Join Column

What happens if we filter on the join columns of both tables? For example:

SELECT *
FROM dbo.TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.UNIQUE_ID BETWEEN 1 AND 200000
AND b.UNIQUE_ID BETWEEN 1 AND 200000;

SELECT *
FROM dbo.TA_1_TO_1000000 a
INNER JOIN dbo.TB_1_TO_1000000 b
	ON a.UNIQUE_ID = b.UNIQUE_ID
WHERE a.UNIQUE_ID BETWEEN 1 AND 200000
AND b.UNIQUE_ID BETWEEN 1 AND 200000
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

Think back to why we need containment in the first place. When there are filters on columns that aren’t the join columns then we need to make an assumption as to how the selectivities all interact with each other. With a filter on the join column we can just adjust the histogram of the join column directly. There isn’t any uncertainty. Here’s what the histograms could look like:

a21_ex4_histograms

In which case, it seems obvious that the final estimate should be 200000 rows. Simple containment does not result in a different estimate here.

Removing Rows

So far the examples have been very simple. We’ve joined tables that contain the exact same data. What if one table has fewer rows than the other table? Consider the following pair of queries:

SELECT *
FROM dbo.TC_1_TO_100000 c
INNER JOIN dbo.TB_1_TO_1000000 b
	ON c.UNIQUE_ID = b.UNIQUE_ID
WHERE c.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50;

SELECT *
FROM dbo.TC_1_TO_100000 c
INNER JOIN dbo.TB_1_TO_1000000 b
	ON c.UNIQUE_ID = b.UNIQUE_ID
WHERE c.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 50
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

It’s important to call out here that TC_1_TO_100000 has just 100000 rows instead of one million. For base containment, we know that the selectivity will be 0.5 for both tables. What about join selectivity? The histogram steps of course aren’t aligned:

a21_ex5_initial_histograms

The data is densely packed, so we can use the same trick as before to split the histogram for the larger table:

a21_ex5_base_aligned_histograms

Every row in histogram for the smaller table has a match in the histogram of the larger table. From the point of view of the smaller table the join selectivity is 1.0. Multiplying together all three selectivities gives a final row estimate of 100000 * 1.0 * 0.5 * 0.5 = 25000. This matches the row estimate within SQL Server exactly.

For simple containment we need to apply the filter selectivities of 0.5 to both tables. We also need to align the histograms by splitting the larger histogram. Both will be done in one step:

a21_ex5_simple_histograms

Every row in the smaller histogram once again matches. Our final estimate is 0.5 + 49999 + 0.5 = 50000 which exactly matches the SQL Server query optimizer.

Unmatched Rows

What happens if the tables have the same number of rows but they clearly don’t contain the same data? Consider the following pair of queries:

SELECT *
FROM dbo.TD_500001_TO_1500000 d
INNER JOIN dbo.TB_1_TO_1000000 b
	ON d.UNIQUE_ID = b.UNIQUE_ID
WHERE d.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 10;

SELECT *
FROM dbo.TD_500001_TO_1500000 d
INNER JOIN dbo.TB_1_TO_1000000 b
	ON d.UNIQUE_ID = b.UNIQUE_ID
WHERE d.MOD_FILTER BETWEEN 1 AND 50
AND b.MOD_FILTER BETWEEN 1 AND 10
OPTION (
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

The filter predicate for TB_1_TO_1000000 is 0.1 and the filter predicate for TD_500001_TO_1500000 is 0.5. Here are our starting histograms:

a21_ex6_base_initial_histograms

The little man who lives inside the cardinality estimator needs to slice them up so they align. His work is complete:

a21_ex6_base_sliced_histograms

The top histogram has 500000 unmatched rows in the step with a RANGE_HI_KEY of 1500000, so the join selectivity is 500000 / 1000000 = 0.5. Putting all three selectivities together, the cardinality estimate with base containment should be 1000000 * 0.5 * 0.1 * 0.5 = 25000. This exactly matches SQL Server.

You know the drill for simple containment. We need to multiply each sliced histogram by its filter selectivity:

a21_ex6_simple_sliced_histograms

That’s pretty messy. I’m going to assume that every row has a match between the two shared steps, so the estimate should be 0.1 + 49999.8 + 0.1 = 50000. The number of estimated rows reported by SQL Server is 50000.4 :

a21_ex6_simple_estimate

What happened? Did the little man only measure once before cutting? This is one of those examples where there’s other complicated stuff going on under the hood, so the predicted row estimate doesn’t match up exactly. Interestingly, the estimate with the legacy cardinality estimator is exactly 50000.

An Approximate Formula

Define T1_CARDINALITY as the number of rows in the first joined table.
Define T1_FILTER_SELECTIVITY as the filter selectivity of the filter predicates of the first table. This number ranges from 0.0 to 1.0, with 1.0 for filters that remove no rows.
Define T2_CARDINALITY as the number of rows in the second joined table.
Define T2_FILTER_SELECTIVITY as the filter selectivity of the filter predicates of the second table. This number ranges from 0.0 to 1.0, with 1.0 for filters that remove no rows.
Define JOIN_SELECTIVITY as the selectivity of the two histograms of the joined columns from the point of view of the smaller table. This number ranges from 0.0 to 1.0, with 1.0 meaning that all rows in the smaller table have a match in the larger table.

Based on the tests above, we can model the cardinality estimates for base and simple containment as follows:

Base containment = JOIN_SELECTIVITY * LEAST(T1_CARDINALITY, T2_CARDINALITY) * T1_FILTER_SELECTIVITY * T2_FILTER_SELECTIVITY
Simple containment = JOIN_SELECTIVITY * LEAST(T1_FILTER_SELECTIVITY * T1_CARDINALITY, T2_FILTER_SELECTIVITY * T2_CARDINALITY)

Remember that this isn’t how SQL Server actually does it. However, I think that it shows the difference between base containment and simple containment quite well. For simple containment the filters are applied to the histograms and for base containment all of the selectivities are independent.

A Mathematical Proof?

So far simple containment has always had a higher cardinality estimate than base containment. Looking at the formulas it certainly feels like simple should have a higher estimate. Can we prove that the estimate will always be higher using the above formulas? It’s been quite a few years so I apologize for the proof below, but I believe that it gets the job done.

Definitions:

JS = JOIN_SELECTIVITY
C1 = T1_CARDINALITY
F1 = T1_FILTER_SELECTIVITY
C2 = T2_CARDINALITY
F2 = T2_FILTER_SELECTIVITY

Attempt a proof by contradiction, so assume the opposite of what we want to prove:

JS * LEAST(C1, C2) * F1 * F2 > JS * LEAST(F1 * C1, F2 * C2)

We know that JS > 0, F1 > 0, and F2 > 0, so:

LEAST(C1, C2) > LEAST(C1 / F2, C2 / F1)

The left hand expression can only evalute to C1 or C2. Let’s assume that it evaluates to C1, so C1 <= C2. We know that F1 <= 1, so C2 <= C2 / F1. C1 / F2 > C1, so the only hope of the inequality above being true is if C1 > C2 / F1. Putting it all together:

C1 <= C2 <= C2 / F1 < C1

That is clearly impossible. Very similar logic holds if the left hand expression evaluates to C2 (just flip 1 with c in the above), so we know that the equation that we started out with is not true. Therefore:

JS * LEAST(C1, C2) * F1 * F2 <= JS * LEAST(F1 * C1, F2 * C2)

In other words:

BASE CONTAINMENT <= SIMPLE CONTAINMENT

Here’s my public domain celebration picture:

a21_anniversary-157248_960_720

The details of this stuff within SQL Server are very complicated, so this doesn’t mean that there doesn’t exist a query that has a larger cardinality estimate with base containment. However, it seems to be a safe assumption that in general simple containment will result in a larger or equal estimate compared to base containment.

Why Does Any of This Matter?

I almost created a kind of real life example here, but I ran out of time so you’re eating Zs for dinner again as usual. Let’s introduce a table to cause some trouble:

DROP TABLE IF EXISTS dbo.ROWGOAL_TROUBLES;

CREATE TABLE dbo.ROWGOAL_TROUBLES (
	UNIQUE_EVEN_ID BIGINT NOT NULL,
	PAGE_FILLER VARCHAR(1000) NOT NULL
);

INSERT INTO dbo.ROWGOAL_TROUBLES
	WITH (TABLOCK)
SELECT 2 * t.RN
, REPLICATE('Z', 1000)
FROM
(
	SELECT TOP (50000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) / 100 RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);

Consider the following business critical query that I run all the time:

SELECT *
FROM dbo.TA_1_TO_1000000 t1
INNER JOIN dbo.TB_1_TO_1000000 t2
	ON t1.UNIQUE_ID = t2.UNIQUE_ID
WHERE t1.MOD_FILTER = 1
AND t2.MOD_FILTER = 1
AND NOT EXISTS (
	SELECT 1
	FROM dbo.ROWGOAL_TROUBLES rt
	WHERE rt.UNIQUE_EVEN_ID = t1.UNIQUE_ID
)
OPTION (MAXDOP 1);

The plan doesn’t look so hot:

a21_bad_row_goal

There are unmatched rows in the ROWGOAL_TROUBLES table, so we know that the scan on the inner side of the nested loop is going to read a lot of rows. The query took about 60 seconds to finish on my machine and read 499775000 rows from the ROWGOAL_TROUBLES table. Why did this plan seem attractive to SQL Server? The query optimizer thought that only 100 rows would be returned after the join of TA_1_TO_1000000 to TB_1_TO_1000000. The filters are perfectly correlated so 10000 rows will be returned in reality. With perfectly correlated filters we can expect a better estimate if we use simple containment:

SELECT *
FROM dbo.TA_1_TO_1000000 t1
INNER JOIN dbo.TB_1_TO_1000000 t2
	ON t1.UNIQUE_ID = t2.UNIQUE_ID
WHERE t1.MOD_FILTER = 1
AND t2.MOD_FILTER = 1
AND NOT EXISTS (
	SELECT 1
	FROM dbo.ROWGOAL_TROUBLES rt
	WHERE rt.UNIQUE_EVEN_ID = t1.UNIQUE_ID
)
OPTION (
MAXDOP 1,
USE HINT ('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
);

With a better estimate of 10000 rows comes a better query plan:

a21_no_row_goal

The query finishes in under a second on my machine.

Final Thoughts

Hopefully this blog post gives you a better understanding of the difference between base and simple containment. Read some of the other explanations out there if this wasn’t helpful. Containment is a tricky subject and you never know what it’ll take for it to make sense to you. Thanks for reading!

A Columnstore Compression Magic Trick

Posted on November 21, 2017November 28, 2017 by Joe Obbish

Columnstore compression is complicated, and in some cases, surprising.

The Setup

The source data for the CCI has enough rows to fit six perfect rowgroups. The ID column is just sequential integers from 1 to 6291456. The ID2 column is the ID column modulo 20001. Code to load the data into a temp table:

<span 				data-mce-type="bookmark" 				id="mce_SELREST_end" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>
DROP TABLE IF EXISTS #STG_DATA;
CREATE TABLE #STG_DATA (
	ID BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO #STG_DATA WITH (TABLOCK)
SELECT t.RN, t.RN % 20001
FROM
(
	SELECT TOP (6 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

Here’s the table definition for the target CCI:

DROP TABLE IF EXISTS dbo.TARGET_CCI;
CREATE TABLE dbo.TARGET_CCI (
	ID2 BIGINT NOT NULL,
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

The reversal of column order is important for the demo below.

Changing MAXDOP

First let’s load the ID2 column the temp table into the CCI. The order of data can matter for compression so I have a superfluous TOP expression to force SQL Server to read the data in clustered key order.

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 1);

The insert query takes 2765 ms of CPU time and 2771 ms of elapsed time on my machine. According to sys.dm_db_column_store_row_group_physical_stats each rowgroup has a size of 2098320 bytes:

a20_maxdop_1_rg_dmv

Now let’s move on to a parallel insert query with MAXDOP 2. The purpose of the second column in the CCI is to make the insert go parallel on my machine. It’s possible that you’ll need to use trace flag 8649 or some other trick to get a parallel insert. Here’s the code that I ran:

TRUNCATE TABLE dbo.TARGET_CCI;

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 2);

The insert query now takes 3594 ms of CPU time and 2112 ms of elapsed time on my machine. The size of each rowgroup did not change. It’s still 2098320 bytes. Even though this is a parallel query there’s no element of randomness in this case. In the query plan we can see that the source table was scanned in a serial zone and round robin distribution is to used to distribute exactly half of the rows to each parallel thread.

a20_parallel_insert

This seems like a reasonable plan given that TOP forces a serial zone and we need to preserve order. It’s possible to rewrite the query to encourage a parallel scan of the source table, but that would introduce an order-preserving gather streams operator.

I’m not satisfied with the runtime yet, so I’m going to bump up MAXDOP to 3:

TRUNCATE TABLE dbo.TARGET_CCI;

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 3);

The insert query now takes 114172 ms of CPU time and 39208 ms of elapsed time to execute. However, each rowgroup now is just 54496 bytes.

a20_maxdop_3_rg_dmv

The INSERT took significantly longer than before, but we have 38X better compression compared to the table after the MAXDOP 2 query. What happened?

Revealing the Magic Trick

An interesting pattern for compressed data sizes appears when working with repeated integers for a single rowgroup. The query that I tested with was roughly of the following format:

INSERT INTO dbo.CCI
SELECT t.RN % @MOD_NUM
FROM
(
	SELECT TOP (@ROWS_INSERTED)
		ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

Below is a contour plot that shows how the compressed size for a single rowgroup varies as the number of rows and the modulus value changes:

a20_contour_size

Values that are repeated 64 or more times seem to be compressed much better than other values. This pattern definitely doesn’t always hold as you add more columns to the table which is why I made the ID2 column the first column in the target CCI. Why is this pattern relevant to the previous example?

Consider the MAXDOP 1 insert query. With a full rowgroup of 1048576 rows a value will be repeated at most 1048576/20001 = 53 times in each rowgroup. It doesn’t cross the threshold of 64 so we end up with a compressed size of 2098320 bytes.

Now consider the MAXDOP 2 insert query. The ordered data from the scan is distributed using round robin distribution on two threads. For the first 20001 rows from the scan, thread 0 gets all even values and thread 1 gets all odd values. For the next 20001 rows, thread 0 gets all odd values and thread 1 gets all even values. This occurs because 20001 isn’t divisible by 2. For all six compressed rowgroups we end up with the same data distribution as we had when doing MAXDOP 1 inserts. It makes sense that the compressed size remained at 2098320 bytes.

Now consider the MAXDOP 3 insert query. The query still uses round robin distribution but there are now three threads. 20001 is divisible by 3 so thread 0 only ends up with 6667 unique values from 0, 3, … to 19999. Thread 1 also ends up with 6667 unique values from 1, 4, … to 20000. Thread 2 follows a similar pattern. Each compressed rowgroup only has 6667 unique values instead of 20001. Each value shows up at least 157 times in the rowgroup, so all of the data qualifies for much better compression.

Final Thoughts

This has absolutely no practical value. Thanks for reading!

ROWGROUP_FLUSH Deadlocks

Posted on November 14, 2017November 12, 2017 by Joe Obbish

We recently observed many ROWGROUP_FLUSH deadlocks while doing concurrent inserts into CCIs. I’m not really a concurrency kind of guy but I figured that I should blog about this just so other people with the same problem can find some information about it.

Deadlock Reproduction

The schedulers of the involved sessions are important in some way, especially when going for a simple reproduction. It’s easiest to just make all new sessions go the same CPU:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU = 0;

Obviously you should never do that in production. After affinity has been addressed I recommend creating a nearly empty source table and a new CCI table:

DROP TABLE IF EXISTS dbo.CCI_DEADLOCKED;
CREATE TABLE dbo.CCI_DEADLOCKED (
	COL VARCHAR(1500),
	INDEX CCI CLUSTERED COLUMNSTORE
);

CREATE TABLE ##SOURCE_IDS (ID BIGINT NOT NULL);

INSERT INTO ##SOURCE_IDS WITH (TABLOCK)
SELECT TOP (1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

One way to see the deadlock is to quickly kick off two inserts into the CCI_DEADLOCKED table from different sessions. Inserting a larger amount of data means that you’ll have more time to kick off the second session before the first completes, but a longer rollback time on the first session. On my machine inserting 1048576 rows of VARCHAR(1500) data seems like a reasonable compromise:

INSERT INTO dbo.CCI_DEADLOCKED
SELECT REPLICATE('Z', 1500)
FROM ##SOURCE_IDS
OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);

The second session waits on the first with a LCK_M_IX wait event. The first session loads all of its rows into the delta store, then deadlocks and rolls them all back. You can see this happen in near real time by looking at sys.dm_db_column_store_row_group_physical_stats:

a19_disappearing_delta_store_rows

Here’s the deadlock XML for those who are interested in that kind of thing:

<?xml version="1.0" encoding="UTF-8"?>
<deadlock>
   <victim-list>
      <victimProcess id="process250c6d59c28" />
   </victim-list>
   <process-list>
      <process id="process250c6d59c28" taskpriority="0" logused="0" waitresource="HOBT: 5:72057594084917248 [ROWGROUP_FLUSH]" waittime="3635" ownerId="8986700" transactionname="CloseDeltastore" lasttranstarted="2017-11-12T16:51:53.260" XDES="0x250b572c490" lockMode="X" schedulerid="1" kpid="6288" status="suspended" spid="56" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2017-11-12T16:51:42.827" lastbatchcompleted="2017-11-12T16:51:42.827" lastattention="1900-01-01T00:00:00.827" clientapp="Microsoft SQL Server Management Studio - Query" hostname="???????" hostpid="4404" loginname="???????\?" isolationlevel="read committed (2)" xactid="8775106" currentdb="5" lockTimeout="4294967295" clientoption1="671090784" clientoption2="390200">
         <executionStack>
            <frame procname="adhoc" line="1" stmtend="238" sqlhandle="0x020000004d878e20b5b8d311601f91ebfeb1174f657907d00000000000000000000000000000000000000000">unknown</frame>
         </executionStack>
         <inputbuf>INSERT INTO dbo.CCI_DEADLOCKED  SELECT REPLICATE('Z', 1500)  FROM ##SOURCE_IDS  OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);</inputbuf>
      </process>
      <process id="process250bacf04e8" taskpriority="0" logused="168" waitresource="HOBT: 5:72057594085179392 " waittime="13628" ownerId="8785145" transactionname="INSERT" lasttranstarted="2017-11-12T16:51:43.267" XDES="0x2508f1ac040" lockMode="IX" schedulerid="1" kpid="6856" status="suspended" spid="54" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2017-11-12T16:51:43.267" lastbatchcompleted="2017-11-12T16:51:43.250" lastattention="1900-01-01T00:00:00.250" clientapp="Microsoft SQL Server Management Studio - Query" hostname="???????" hostpid="4404" loginname="???????\?" isolationlevel="read committed (2)" xactid="8785145" currentdb="5" lockTimeout="4294967295" clientoption1="671090784" clientoption2="390200">
         <executionStack>
            <frame procname="adhoc" line="1" stmtend="238" sqlhandle="0x020000004d878e20b5b8d311601f91ebfeb1174f657907d00000000000000000000000000000000000000000">unknown</frame>
         </executionStack>
         <inputbuf>INSERT INTO dbo.CCI_DEADLOCKED  SELECT REPLICATE('Z', 1500)  FROM ##SOURCE_IDS  OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);</inputbuf>
      </process>
   </process-list>
   <resource-list>
      <hobtlock hobtid="72057594084917248" subresource="ROWGROUP_FLUSH" dbid="5" objectname="D1.dbo.CCI_DEADLOCKED" indexname="CCI" id="lock250b52ab400" mode="S" associatedObjectId="72057594084917248">
         <owner-list>
            <owner id="process250bacf04e8" mode="S" />
         </owner-list>
         <waiter-list>
            <waiter id="process250c6d59c28" mode="X" requestType="wait" />
         </waiter-list>
      </hobtlock>
      <hobtlock hobtid="72057594085179392" subresource="FULL" dbid="5" objectname="D1.dbo.CCI_DEADLOCKED" indexname="CCI" id="lock250b5b8a280" mode="X" associatedObjectId="72057594085179392">
         <owner-list>
            <owner id="process250c6d59c28" mode="X" />
         </owner-list>
         <waiter-list>
            <waiter id="process250bacf04e8" mode="IX" requestType="convert" />
         </waiter-list>
      </hobtlock>
   </resource-list>
</deadlock>

SSMS can’t produce a deadlock graph for this type of deadlock. Below is the non-copy-and-pastable error message from it:

Failed to initialize deadlock control.
There is an error in XML document (1, 2497).
Instance validation error: ‘ROWGROUP_FLUSH’ is not a valid value for hobtlockSubresource.

Plan Explorer from SentryOne can help us:

a19_deadlock_graph

If you’re following along at home don’t forget to reset your affinity to whatever you had it before. The most common option:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU = AUTO;

The Workarounds

We’ve only observed this deadlock with multiple concurrent sessions insert to the delta store for the same target CCI due to server memory pressure or very low cardinality estimates (less than 251 rows). The correct mitigation depends on why you’re seeing the issue in the first place. If you’re seeing it due to low cardinality estimates then fix your estimates, or at the very least get them above 250 rows. If you’re seeing them because the memory grant for the CCI build times out after 25 seconds then lower concurrency or increase server memory.

The problem can also be avoided by not doing concurrent inserts in the first place. In some cases a parallel insert may be a reasonable alterative. There’s also some evidence that the deadlock is only seen when the number of rows for insert is very close to 1048576, but we weren’t able to make any definitive conclusions around that.

Final Thoughts

Don’t despair if you run into a ROWGROUP_FLUSH deadlock! There’s probably something you can do in the application to avoid it. If you feel that you shouldn’t have to take such measures feel free to vote for my connect item here.

Surprise Delta Stores

Posted on November 7, 2017February 6, 2018 by Joe Obbish

This post contains all of the possible causes for delta store creation that I’ve found. I cannot say with certainty that it’s a complete list, but some of them may be new or unexpected to the reader.

Why Care about Delta Stores?

Microsoft and many others will be quick to tell you that loading data into CCIs is much faster when you can bypass the delta store. In SQL Server 2016 and beyond, delta stores are uncompressed rowstore mini-tables that serve as a temporary holding data until the data can be compressed into columnar format. They’re good when you have a trickle of data to load into a CCI, but bad in all possible ways for a data warehouse workload.

Reviewing the Documentation

I briefly reviewed the documentation written by Microsoft concerning the appearance of delta stores. Here’s a quote:

Rows go to the deltastore when they are:
Inserted with the INSERT INTO VALUES statement.
At the end of a bulk load and they number less than 102,400.
Updated. Each update is implemented as a delete and an insert.

There are also a few mentions of how partitioning can lead to the creation of multiple delta stores from a single insert. It seems as if the document is incomplete or a little misleading, but I admit that I didn’t exhaustively review everything. After all, Microsoft hides columnstore documentation all over the place.

Test Data

The source data for the CCI inserts is fairly uninteresting. I put four rowgroups worth of rows into a rowstore table with a BIGINT column and a randomly generated VARCHAR(16) value.

DROP TABLE IF EXISTS dbo.STAGING_TABLE;

CREATE TABLE dbo.STAGING_TABLE (
	ID BIGINT NOT NULL,
	STR1 VARCHAR(16) NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO dbo.STAGING_TABLE WITH (TABLOCK)
SELECT TOP (4 * 1048576)
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, LEFT(CAST(NEWID() AS VARCHAR(36)), 16)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

The columns for the table definition for the CCI were chosen to cover all of the demos except for the partitioning one. Your fact table definitions probably don’t look like this.

DROP TABLE IF EXISTS dbo.DELTA_STORE_DUMPING_GROUND;
CREATE TABLE dbo.DELTA_STORE_DUMPING_GROUND (
	ID BIGINT NULL,
	STR1 VARCHAR(100) NULL,
	STR2 VARCHAR(100) NULL,
	STR3 VARCHAR(100) NULL,
	STR1_MAX VARCHAR(MAX) NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

Not Enough Rows For Bulk Load

The first reason for delta creation is well known and understood on SQL Server 2016. If you insert fewer than 102400 rows then SQL Server will not attempt to skip the delta store. This behavior is by design. The following query does not do a bulk load:

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (102399) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

We can see the delta store that was just created with the following query:

SELECT
    row_group_id
  , state_desc
  , total_rows
--, trim_reason_desc
--, deleted_rows
--, partition_number
FROM sys.dm_db_column_store_row_group_physical_stats rg
INNER JOIN sys.tables t ON rg.OBJECT_ID = t.OBJECT_ID
WHERE t.name = 'DELTA_STORE_DUMPING_GROUND';

The results:

a18_dmv_1

The other examples in this post use similar queries to get information about the newly added rowgroups to the table. They will be omitted for brevity. Simply inserting one row results in the delta store getting skipped:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (102400) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

Now the rowgroup is compressed:

a18_dmv_2

The rules change slightly in SQL Server 2017 with support of VARCHAR(MAX) and other LOB columns in columnstore. The delta store can be skipped with an insert of as few as 251 rows. Whether or not you write to the delta store depends on the amount of data being written. Below is one query that still writes to the delta store:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (STR1_MAX)
SELECT TOP (251) REPLICATE(STR1, 40)
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

Once again you can see the delta store:

a18_dmv_3

Things are different if we increase the length of the inserted data. The query below writes to a compressed rowgroup and bypasses the delta store:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (STR1_MAX)
SELECT TOP (251) REPLICATE(STR1, 500)
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

The resulting rowgroup is compressed:

a18_dmv_4

Removing just a single row from the insert brings us back to the delta store.

Inserting to Multiple Partitions

If a MAXDOP 1 INSERT query writes to multiple partitions then it could possibly write to multiple delta stores. The number of rows written to each partition is important as opposed to the total number of rows written to the table. Below I define a simple table with 2 partitions:

CREATE PARTITION FUNCTION CLUNKY_SYNTAX_1
(BIGINT)
AS RANGE LEFT
FOR VALUES (
  0
, 2000000
); 

CREATE PARTITION SCHEME CLUNKY_SYNTAX_2
AS PARTITION CLUNKY_SYNTAX_1
ALL TO ( [PRIMARY] );

DROP TABLE IF EXISTS dbo.PARTITIONED_DELTA_STORE;
CREATE TABLE dbo.PARTITIONED_DELTA_STORE (
ID BIGINT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
) ON CLUNKY_SYNTAX_2 (ID);

The insert writes 200k rows to the CCI which you might expect to bypass the delta store, but since the rows are evenly spread over two partitions we end up with two delta stores:

INSERT INTO dbo.PARTITIONED_DELTA_STORE (ID)
SELECT ID
FROM dbo.STAGING_TABLE
WHERE ID BETWEEN 1900001 AND 2100000
OPTION (MAXDOP 1);

a18_dmv_5

With MAXDOP 8 INSERT queries and the maximum number of partitions defined on a table, it is possible to get 120000 delta stores. I don’t recommend doing this.

Bulk Insert Leftovers

Often applications will not insert an exact multiple of 1048576 rows. That means that rows can be left over after a few rowgroups worth of inserted rows are compressed. Those leftover rows can go into a delta store. Consider the following insert query that inserts 100000 rows more than 1048576:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (1048576 + 100000) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

As expected, the final result is one compressed rowgroup of 1048576 rows and one delta store of 100k rows.

a18_dmv_6

If we inserted just a few thousand more rows than we’d end up with two compressed rowgroups.

Updates

UPDATE queries always write to the delta store. There are many other reasons to avoid UPDATES to CCIs if the application makes it possible to do so.

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (1048576) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

At first there’s just a single compressed rowgroup:

a18_dmv_7

Now run the UPDATE query and go make coffee:

UPDATE DELTA_STORE_DUMPING_GROUND
SET ID = ID;

Our table doesn’t look so hot:

a18_dmv_8

In SQL Server 2016 the Tuple Mover will not clean up this table. Another row needs to be inserted into the table before the rowgroup is marked as CLOSED.

Parallel Insert

Many parallel queries have an element of randomess around how rows are distributed to parallel threads. Rows are not moved between threads after they flow to the part of the plan that performs the insert into the CCI. It’s possible to end up with a number of new delta stores equal to the number of parallel threads for the query. Let’s start with a parallel insert that moves 4 * 1048576 rows into the CCI:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
WITH (TABLOCK) (ID)
SELECT ID
FROM dbo.STAGING_TABLE
OPTION (MAXDOP 4);

It’s possible to end up without any delta stores and the results of the query against sys.dm_db_column_store_row_group_physical_stats will vary, but generally you’ll get at least one:

a18_dmv_9

If we have unnaturally high beauty standards for our rowgroups we can rewrite the query to effectively force rows to be evenly distributed on all threads. The query below does this with a join to a derived table:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
WITH (TABLOCK) (ID)
SELECT stg.ID
FROM
(
	VALUES
	(0 * 1048576 + 1, 1 * 1048576),
	(1 * 1048576 + 1, 2 * 1048576),
	(2 * 1048576 + 1, 3 * 1048576),
	(3 * 1048576 + 1, 4 * 1048576)
)
v (start_id, end_id)
INNER JOIN dbo.STAGING_TABLE stg ON
	stg.ID BETWEEN v.start_id and v.end_id
OPTION (MAXDOP 4);

Perfection:

a18_dmv_10

I know that you were looking forward to another image of a tiny table, but here’s the important part of the query plan for those who like that sort of thing:

a18_parallel_query_plan_1

Getting perfect rowgroups can also be accomplished by adding the TOP operator to the original query, but that adds a serial zone to the plan:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
WITH (TABLOCK) (ID)
SELECT TOP (9999999999999999) stg.ID
FROM dbo.STAGING_TABLE stg
OPTION (MAXDOP 4);

The key here is the parallelism operator in the plan uses a round robin method for distributing rows:

a18_parallel_query_plan_2

Dictionary Pressure

In SQL Server 2016 the maximum size for a column dictionary is 16 MB. This limit is raised in SQL Server 2017 for VARCHAR(MAX) and similar columns. I’m not going to get into the details of dictionaries here but it suffices to say that columns with too many unique string columns can experience dictionary pressure. Dictionary pressure leads to compressed rows that are less than the perfect size of 1048576 rows. Let’s insert the STR1 column into the CCI this time:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (STR1)
SELECT TOP (1048576) stg.STR1
FROM dbo.STAGING_TABLE stg
ORDER BY ID
OPTION (MAXDOP 1);

Due to dictionary pressure there’s now a delta store with about 73000 rows:

a18_dmv_11

We can see that the dictionary size for the column is close to the limit with the query below:

SELECT csd.entry_count, csd.on_disk_size
FROM sys.column_store_dictionaries csd
INNER JOIN sys.partitions p
    ON csd.partition_id = p.partition_id
INNER JOIN sys.tables t
    ON p.OBJECT_ID = t.OBJECT_ID
WHERE t.name = 'DELTA_STORE_DUMPING_GROUND'
AND csd.column_id = 2;

Here are the results:

a18_dict

Rowgroup Memory Pressure

The memory grant for CCI compression for an INSERT is calculated based on DOP and column definitions of target columns in the target table. The memory grant can be insufficient to get a full 1048576 rows into a compressed rowgroup depending on the table definition and the characteristics of the data getting loaded into the table. Consider an example in which data is loaded into three columns of the CCI:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
(STR1, STR2, STR3)
SELECT TOP (1048576)
  LEFT(STR1, 10)
, LEFT(STR1, 5)
, LEFT(STR1, 6)
FROM
dbo.STAGING_TABLE stg
ORDER BY ID
OPTION (MAXDOP 1);

With the above syntax the memory grant is calculated from just the STR1, STR2, and STR3 columns. The memory grant of 171152 KB isn’t enough to avoid a delta store:

a18_dmv_12

Note that you may not see the same results on your machine due to the randomness of the source data. For my table and source data set, adding a single column and inserting NULL into it bumps the memory grant up enough to avoid memory pressure:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

ALTER TABLE dbo.DELTA_STORE_DUMPING_GROUND
ADD MORE_MEMORY_PLZ VARCHAR(1) NULL;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND
(STR1, STR2, STR3, MORE_MEMORY_PLZ)
SELECT TOP (1048576)
  LEFT(STR1, 10)
, LEFT(STR1, 5)
, LEFT(STR1, 6)
, NULL
FROM
dbo.STAGING_TABLE stg
ORDER BY ID
OPTION (MAXDOP 1);

The compressed rowgroup contains 1048576 rows now that memory pressure has been addressed.

a18_dmv_13

Cardinality Estimate Less Than 251 Rows

SQL Server won’t even ask for a memory grant if the cardinality estimate is less than 251 rows. Perhaps this is because the memory grant would be wasted unless at least 102400 rows were inserted into the table. There’s no second chance at a memory grant here, so it’s possible to insert millions of rows to delta stores. A TOP expression with a variable will default to a cardinality estimate of 100 rows, so this works nicely to show the behavior:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

DECLARE @top_rows BIGINT = 1048576;
INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (@top_rows) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1);

Despite inserting 1048576 rows we aren’t able to bypass the delta store:

a18_dmv_14

The same behavior can be observed with a cardinality estimate of 250 rows. The OPTIMIZE FOR query hint is used to control the estimate.

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

DECLARE @top_rows BIGINT = 1048576;
INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (@top_rows) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1, OPTIMIZE FOR (@top_rows = 250));

However, if I bump up the estimate by one more row a memory grant is given to the query and the delta store is bypassed:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

DECLARE @top_rows BIGINT = 1048576;
INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (@top_rows) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1, OPTIMIZE FOR (@top_rows = 251));

a18_dmv_15

Under this scenario we’ve observed deadlocks when multiple sessions insert into delta stores from the same target table.

Extreme Server Memory Pressure

Memory grants for queries that insert into CCIs have a hardcoded timeout of 25 seconds. After 25 seconds they execute with required serial memory and always write to the delta store. In the query below I simulate memory pressure with a MAX_GRANT_PERCENT hint of 0:

TRUNCATE TABLE dbo.DELTA_STORE_DUMPING_GROUND;

INSERT INTO dbo.DELTA_STORE_DUMPING_GROUND (ID)
SELECT TOP (1048576) ID
FROM dbo.STAGING_TABLE
ORDER BY ID
OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);

The query always writes to the delta store. It cannot compress data without a memory grant.

a18_dmv_15

Under this scenario we’ve observed deadlocks when multiple sessions insert into delta stores from the same target table.

Final Thoughts

It took forever to do the formatting for this one, so I hope that someone finds it useful.