Columnstore Parallel Scan Row Distribution

Parallel rowstore scans are parallel “aware”. This makes them unlike most other operators which work independently on different threads. Columnstore indexes store data in a dramatically different way than rowstore objects, so perhaps we can expect differences in how rows are distributed among threads during a parallel scan. This blog post explores some of the observable behavior around row distribution for parallel columnstore scans.

Methods of Rowstore Scan Parallel Row Distribution

This will be an extremely brief summary of how SQL Server distributes rows among threads for rowstore parallel scans. A parallel page supplier sends pages to each thread on a demand basis. Threads may end up processing different numbers of pages for many reasons. If you like, you can read more about this here and here. In addition, there is some fancy behavior for partitioned tables. The documentation describes this behavior in SQL Server 2008 and it may have changed since then, but this is sufficient to set the stage. The important part is that SQL Server will in some cases give each relevant partition its own thread. In other cases, SQL Server will assign multiple threads to a single partition.

Methods of Columnstore Scan Parallel Row Distribution

I investigated tables with only compressed rowgroups. I did not consider delta stores because real CCIs don’t have delta stores. As far as I can tell, there are at least three different methods that SQL Server can use to assign rows to threads during a parallel CCI scan.

Rowgroup Parallelism

One method of distributing rows is to assign each thread to a relevant rowgroup. Two or more threads will not read rows from the same rowgroup. This strategy can be used if the cardinality estimate from the scan is sufficiently high compared to the DOP used by the query. Rowgroup level parallelism will be used if:

Cardinality Estimate >= MAXDOP * 1048576 * 0.5

To show this, let’s build a CCI with a single column integer that runs from 1 to 1048576 * 10. Naturally, the table will have ten compressed rowgroups of the maximum size:

CREATE TABLE dbo.CCI_SHOW_RG_PARALLELISM (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_SHOW_RG_PARALLELISM WITH (TABLOCK)
SELECT TOP (1048576 * 10) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

I ran my tests with a simple query that is likely to go parallel on my machine and that doesn’t qualify for aggregate pushdown. With a cardinality estimate of exactly 1000000 and MAXDOP of 2 all of the rows are sent to one thread:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 1000000
OPTION (MAXDOP 2);

From the actual plan:

a6_RG_1

If we reduce the cardinality estimate by one row, the rows are spread out more evenly on the two threads:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 999999
OPTION (MAXDOP 2);

From the actual plan:

a6_RG_2

Note that the cardinality estimate displayed in SSMS may be misleading. The first query has a displayed cardinality estimate of 2000000 rows, but it does not use rowgroup parallelism:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 1999999 -- query plans can round
OPTION (MAXDOP 4);

From the actual plan:

a6_RG_3

But this one does:

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= 2000000
OPTION (MAXDOP 4);

From the actual plan:

a6_RG_4

Of course, we can get the query that takes an aggregate of 1999999 rows to use rowgroup level parallelism by bumping up the estimate:

DECLARE @filter BIGINT = 1999999;

SELECT MAX(SQRT(ID))
FROM dbo.CCI_SHOW_RG_PARALLELISM
WHERE ID <= @filter
OPTION (MAXDOP 4);

Here the estimate is:

0.3 * 10485800 = 3145740.0

So we get the expected parallelism strategy:

a6_RG_5

We can show that the rowgroup parallelism strategy is demand-based by deliberately slowing down the required the thread that grabs the first rowgroup in the table. Here I’m defining first by the ID that’s returned when running a SELECT TOP 1 ID query against the table. On my machine I get an ID of 9437185. The following code will add significant processing time in the CROSS APPLY part for only the row with an ID of 9437185. Every other row simply does a Constant Scan and goes on its merry way:

SELECT COUNT(*)
FROM dbo.CCI_SHOW_RG_PARALLELISM o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 9437185

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 9437185
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
OPTION (MAXDOP 2);

Thread 1 processes nine rowgroups and waits on thread 2 to process its single rowgroup:

a6_RG_6

Split Rowgroup Parallelism

If the CCI has a small number of rows and the cardinality estimate is low enough you may see a different parallel row distribution strategy employed. I couldn’t think of a good name for this, but SQL Server splits up each rowgroup into roughly equal pieces and threads can process those pieces on a demand basis. The number of pieces seems to depend on the number of rows in the rowgroup instead of MAXDOP . For MAXDOP larger than 2 this behavior can be observed if a table has one or two rowgroups. For a MAXDOP of 2 this behavior can be observed if a table has exactly one rowgroup.

The formula for the number of pieces appears to be the number of rows in the rowgroup divided by 104857, rounded down, with a minimum of 1. The maximum rowgroup size of 1048576 implies a maximum number of pieces of 10 per rowgroup. Here’s a table to show all of the possibilities:

a6_table

We can see evidence of this behavior in SQL Server with a few tests. As before I’ll need to slow down one thread. First I’ll put 943713 integers into a single compressed rowgroup:

DROP TABLE IF EXISTS dbo.TEST_CCI_SMALL;

CREATE TABLE dbo.TEST_CCI_SMALL (
ID BIGINT NOT NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (943713) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

I will test with a very similar query to a previous test:

SELECT COUNT(*)
FROM dbo.TEST_CCI_SMALL o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 1

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 1
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
OPTION (QUERYTRACEON 8649, MAXDOP 4);

The parallel scan should be split into nine pieces to be divided among threads because the table has a single rowgroup with a row count of 943713. This is exactly what happens:

a6_small_1

If I truncate the table and load one fewer row, the scan is now split up into eight pieces:

a6_small_2

I can also create two compressed rowgroups of a size that should led to seven pieces per rowgroup:

DROP TABLE IF EXISTS dbo.TEST_CCI_SMALL;

CREATE TABLE dbo.TEST_CCI_SMALL (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (733999) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

INSERT INTO dbo.TEST_CCI_SMALL WITH (TABLOCK)
SELECT TOP (733999) 733999 + ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

Thread 1 processes the row with an ID of 734000 so it only gets one piece:

a6_small_3

With more than one rowgroup, the demand-based aspect of piece distribution doesn’t quite work in the same way as with a single rowgroup. I wasn’t able to work out all of the details.

A Third Way?

What about queries that do not meet either of the two above criteria? For example, a query against a not small CCI that has a low cardinality estimate coming out of the scan? In some cases SQL Server will use rowgroup level distribution. In other cases it appears to use a combination of the two methods described above. Most of the time the behavior can be described as each thread gets assigned an entire rowgroup and threads race for pieces of the remaining rowgroups. I wasn’t able to figure out exactly how SQL Server decides which method to use, despite running many tests. However, I will show most of the behavior that I observed. First put 7 rowgroups into a CCI:

DROP TABLE IF EXISTS dbo.MYSTERIOUS_CCI;

CREATE TABLE dbo.MYSTERIOUS_CCI (
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.MYSTERIOUS_CCI WITH (TABLOCK)
SELECT TOP (1048576 * 7) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

SELECT TOP 1 ID -- 6291457
FROM dbo.MYSTERIOUS_CCI;

I added local variables to my test query to lower the cardinality estimate. Otherwise I would get rowgroup distribution every time. Here’s an example query:

declare @lower_id INT = 1048576 * 2 + 1;
declare @upper_id INT = 1048576 * 7;

SELECT COUNT(*)
FROM dbo.MYSTERIOUS_CCI o
CROSS APPLY (
	SELECT TOP 1 1 c
	FROM
	(
		SELECT 1 c
		WHERE o.ID <> 6291457

		UNION ALL

		SELECT 1 c
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
		CROSS JOIN master..spt_values t3
		WHERE o.ID = 6291457
		ORDER BY (SELECT NULL)
			OFFSET 100000000 ROWS
			FETCH FIRST 1 ROW ONLY
	) t
) t2
WHERE o.ID BETWEEN @lower_id AND @upper_id
OPTION (QUERYTRACEON 8649, MAXDOP 3);

With a MAXDOP of 3 and five processed rowgroups I get rowgroup level parallelism:

a6_3rd_1

With a MAXDOP of 4 and five processed rowgroups, each thread gets a rowgroup and the other three threads race to process the remaining 20 pieces:

a6_3rd_2

With a MAXDOP of 3 and six processed rowgroups we no longer get rowgroup level parallelism:

a6_3rd_3

What about Partitioning?

In tests not reproduced here, I was not able to observe any differences in how rows were distributed for parallel CCI scans when the underlying CCI was partitioned. This seems reasonable if we think about how partitioning for CCIs is different than for rowstore tables. CCI partitions are simply a collection of rowgroups containing rows relevant to the partition. If there’s a need to split up the underlying components of a table we can just assign rowgroups to different threads. For rowstore tables, we can think of each partition as a mini-table. Partitioning for rowstore adds underlying objects which can be distributed to parallel threads.

Fun with Query Plans

With a partial understanding of how SQL server distributes rows after a parallel scan, we can write queries that show some of the edge cases that can lead to poor performance. The following query is complete nonsense but it shows the point well enough. Here I’m cross joining to a few numbers from the spt_values table. The CROSS JOIN is meant to represent other downstream work done by a query:

SELECT MAX(SQRT(o.ID + t.number))
FROM dbo.MYSTERIOUS_CCI o
CROSS JOIN master..spt_values t
WHERE o.ID <= 1100000 AND
t.number BETWEEN 0 AND 4
OPTION (MAXDOP 2, NO_PERFORMANCE_SPOOL, QUERYTRACEON 8649);

The cardinality estimate and MAXDOP of 2 leads to rowgroup level parallelism being used. Unfortunately, this is very unbalanced:

a6_fun_1

And as a result, the query barely benefits from parallelism:

CPU time = 25578 ms, elapsed time = 24580 ms.

It’s a well-guarded secret that queries run faster with odd MAXDOP , so let’s try a MAXDOP of 3. Now my parallel threads no longer sleep on the job:

CPU time = 30922 ms, elapsed time = 11840 ms.

Here’s the row distribution from the scan:

a6_fun_2

Final Thoughts

This post explored a bit of the observable behavior for how rows are distributed to threads after a parallel columnstore index scan. The algorithms used can in some cases lead to row imbalance on threads which can cause performance issues downstream in the plan. The lack of repartition stream operators in batch mode can make this problem worse than it might be for rowstore scans, but I still expect issues caused by it to be relatively uncommon in practice. Thanks for reading!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s