What To Do When Wait Stats Don’t Help

Some SQL Server workloads are slow even though there aren’t any hints in the wait stats that suggest ways to make them go faster. This blog post works through a columnstore example of such a workload.

The Test Setup

The test server has 96 physical cores. The test workload has many concurrent writers to the same columnstore target table. The source data is a simple staging table with ten columns. Each column contains sequential integers from 1 to 10485760. Here is T-SQL to create and populate the tables:

DROP TABLE IF EXISTS dbo.RS_SOURCE_10_COLUMN_SIMPLE;

CREATE TABLE dbo.RS_SOURCE_10_COLUMN_SIMPLE (
    ID1 BIGINT,
    ID2 BIGINT,
    ID3 BIGINT,
    ID4 BIGINT,
    ID5 BIGINT,
    ID6 BIGINT,
    ID7 BIGINT,
    ID8 BIGINT,
    ID9 BIGINT,
    ID10 BIGINT
);

INSERT INTO dbo.RS_SOURCE_10_COLUMN_SIMPLE
	WITH (TABLOCK)
SELECT q.ID, q.ID, q.ID, q.ID, q.ID
, q.ID, q.ID, q.ID, q.ID, q.ID
FROM
(
       SELECT TOP (10485760) ROW_NUMBER()
             OVER (ORDER BY (SELECT NULL)) ID
       FROM master..spt_values t1
       CROSS JOIN master..spt_values t2
       CROSS JOIN master..spt_values t3
) q
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS dbo.CCI_TARGET_10_COLUMN;

CREATE TABLE dbo.CCI_TARGET_10_COLUMN (
    ID1 BIGINT,
    ID2 BIGINT,
    ID3 BIGINT,
    ID4 BIGINT,
    ID5 BIGINT,
    ID6 BIGINT,
    ID7 BIGINT,
    ID8 BIGINT,
    ID9 BIGINT,
    ID10 BIGINT,
    INDEX CCI0 CLUSTERED COLUMNSTORE
);

Concurrency is controlled by changing the number of stored procedures kicked off by a SQLCMD batch file. Each stored procedure does a single MAXDOP 1 insert from the staging table into the columnstore table. This test is different from the previous columnstore insert test documented here in that all of the sessions write to the same table and the source data gets significantly worse compression. In addition, each stored procedure only does a single insert, so the total overall work done increases as the number of concurrent sessions increases. If I run a single insert like the following:

INSERT INTO dbo.CCI_TARGET_10_COLUMN
SELECT *
FROM dbo.RS_SOURCE_10_COLUMN_SIMPLE
OPTION (MAXDOP 1);

That takes about 25 seconds to complete on the test server. It would be ideal if the average run time of the stored procedure stays at around 25 seconds as concurrency increases.

Test Results

We are not in an ideal world. The average run time of each stored procedure execution quickly increases with the number of concurrent sessions:

a34_scale_table

It may be slightly unreasonable to expect 96 concurrent sessions to scale perfectly well and to end up with an average run time of 24 seconds. With that said, the observed overhead is almost 4X at that scale. Let’s start by looking at wait stats:

a34_waits_for_LPIM_96

Most of the entries are related to memory. SOS_PHYS_PAGE_CACHE shows up because LPIM is enabled. This might be a helpful clue, but the waits only add up to about 10 seconds of waits per second. On a server with 96 cores that alone cannot explain a 4X degradation in scalability.

There is nothing helpful in latches:

a34_latches_for_LPIM_96

A number of spinlock collisions occurred during the workload:

a34_spins_for_LPIM_96

This is entirely a judgment call on my part, but based on my experience on this server and with similar workloads I don’t find those spinlock numbers to be at all remarkable. With that said, one avenue of investigation would be to dig into those spinlocks more.

From my point of view, the most important clue is that the server experiences high CPU throughout the workload according to SQL Server (using sys.dm_resource_governor_resource_pools), resource monitor, and perfmon. Perfmon also reports high privileged time:

a34_perfmon

SQL Server appears to be doing something which requires significantly more CPU at higher concurrency. Therefore, it makes perfect sense that the wait stats and latch stats alone aren’t very helpful. It’s reasonable to expect most wait types and latch classes to result in low CPU during their waits as opposed to high CPU.

ETW Tracing

ETW tracing can be used to diagnose CPU issues on servers. I’m still pretty new at it, but I use PerfView. In particular, what I really want to see is sampled call stacks to get an idea as to where CPU is being spent. I took a ten second sample during the 96 concurrent session workload and grouped by module name:

a34_perfview_LPIM_modules

The circled modules are typically what I see in SQL Server call stacks. SQL Server is responsible for the majority of CPU work on the server, but the majority of the work is done in the ntoskrnl and hal modules. I have it on good authority that CPU spent in that context will be tracked as privileged time in perfmon, so this roughly lines up with the perfmon graph previously shown.

A different view of the data reveals that most of the work is done as part of kernalbase!MapuserPhysicalPages:

a34_perfview_LPIM_just_my_app

In addition, we can see methods that call kernalbase!MapuserPhysicalPages:

a34_perfview_LPIM_methods_call_mapPhysicalPages

Presenting the data in a different way, we can see what was called by sqldk!SOS_MemoryWorkSpace::FreePage:

a34_perfview_LPIM_callee_free_Page

It appears that most of the CPU during the workload is spent returning memory from SQL Server to the OS. Backing that up, the memory commit for SQL Server was observed to decrease during the workload through resource monitor, even though there wasn’t any pressure from the OS and the total commit size was only at about 50% of max server memory.

For this workload, we are lucky in that important call stack names can be tied back to something in SQL Server. For example, sqldk!SOS_MemoryWorkSpace::FreePage would appear to be associated with freeing pages and that can be tracked from within SQL Server using extended events.

Extending The Investigation

The page_allocated and page_freed debug extended events are helpful in understanding how memory is allocated and freed during a query’s execution. First I started with a histogram target on memory_clerk_name for the page_freed event page_freed filtered to a single session. Here’s what that looks like:

a34_histogram

Nearly all of the free page events are for the MEMORYCLERK_SQLQERESERVATIONS clerk. I believe this clerk is responsible for granted query memory so this makes sense. Next we can turn on causality tracking and include the page_allocated event as well. Here’s a screenshot of some of the data for a simple CCI insert that processes three rowgroups in total:

a34_XE_picture

There are many different patterns in the data. Sometimes there is a series of free events and sometimes there is a series of allocation events. They can also alternate as shown in the picture above. The number of pages freed or allocated is usually one plus a power of two. I saved the data to a table and calculated a running total of current memory allocations for the query. The running total increased whenever pages were allocated and decreased whenever pages were freed. I’m really phoning it in with this graph, but here are the results:

a34_running_total_graph

The x-axis is time and the y-axis is total pages currently allocated to the query. This query compresses three rowgroups in total. That’s why the pattern repeats three times. Other than that, the only thing I was able to tie back to the compression algorithm was that pages are rapidly freed as segments are finished:

a34_segment_deallocations

In summary, page allocations and page free events rapidly occur, sometimes in an alternating pattern. SQL Server will often free a number of pages just to immediately request allocations for a similar number of pages. If all of the free page events result in returned memory to the OS then the reason for the scalability bottleneck becomes clear. When running the full workaround with 96 concurrent sessions, a total of 341965 page freed operations were performed. Those events freed about 71.3 million pages in total. That amounts to about 584 GB of memory returned to the OS in total, based on the previous assumptions.

It seemed bizarre to me for SQL Server to return so much memory to the OS during columnstore compression even though it would very likely immediately need more memory for compression. A colleague suggested that this could be a deliberate design decision by Microsoft to reduce the likelihood of out-of-memory errors when allocating memory above the max server memory setting:

“Starting with SQL Server 2012 (11.x), SQL Server might allocate more memory than the value specified in the max server memory setting. This behavior may occur when the Total Server Memory (KB) value has already reached the Target Server Memory (KB) setting (as specified by max server memory). If there is insufficient contiguous free memory to meet the demand of multi-page memory requests (more than 8 KB) because of memory fragmentation, SQL Server can perform over-commitment instead of rejecting the memory request.

This behavior is typically observed during the following operations:

Large Columnstore index queries.
Columnstore index (re)builds, which use large volumes of memory to perform Hash and Sort operations.”

What is Normal?

All of the above observations around memory allocations and freed memory are interesting but some additional context is needed. I suspect that it doesn’t, but it’s possible that a traditional row mode query would show the same pattern of allocated and freed pages. To investigate that, I created a row mode query with a memory consuming operator that processes a result set of rows, releases the rows, and continues to do so in a loop. The query plan looks like this:

a34_query_plan

Note the Hash Match (Aggregate) operator on the inner side of the nested loop. For each row in the DRIVE_ME table, SQL Server will send nearly all of the rows from DISTINCT_ONE_COL into the hash match operator. The reason that I went for this type of query is that it felt like the most similar pattern to columnstore compression in that compression processes up to 1048576 rows per thread in what’s effectively a loop. For those following along at home here is T-SQL to generate that query plan:

drop table if exists DRIVE_ME;

create table DRIVE_ME (ID INT);

INSERT INTO DRIVE_ME VALUES
(1), (2), (3), (4), (5), (6), (7),
(8), (9), (10), (11), (12), (13), (14);

drop table if exists DISTINCT_ONE_COL;

create table DISTINCT_ONE_COL (ID INT);

INSERT INTO DISTINCT_ONE_COL WITH (TABLOCK)
SELECT TOP (10000000) ROW_NUMBER()
      OVER (ORDER BY (SELECT NULL)) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1)

GO

CREATE OR ALTER PROCEDURE
dbo.REPEATED_ROW_MODE_ALLOC AS
BEGIN
SET NOCOUNT ON;
SELECT ca.cnt
FROM DRIVE_ME d
OUTER APPLY
(
	SELECT COUNT(DISTINCT ID)
	FROM DISTINCT_ONE_COL
	WHERE ID > d.ID
) ca (cnt)
OPTION (MIN_GRANT_PERCENT = 2.5)
END;

On the test server executing a single query took around 90 seconds. That single query execution frees around 760000 memory pages in total. Memory is allocated and freed in cycles. There isn’t a pattern where allocation and free events trade off between each other.

The workload does very well at scale. With 96 concurrent sessions the average stored procedure execution remains at around 90 seconds. We’re freeing the same amount of pages in total as the columnstore insert workload without seeing a drop in scalability. We can again use PerfView to analyze the workload by modules:

a34_perfview_rowstore_modules

This time most of the CPU is spent on SQL Server managed code. Drilling down, the operation that uses the most CPU appears to involving inserting rows into a hash table that’s in a query’s memory grant:

a34_perfview_rowstore_biggest_cpu

There isn’t necessarily anything wrong with this. After all, something needs to use the most CPU during the workload and we do achieve very good scalability. To me, the most important part is how call stacks are different when looking at everything called by sqldk!SOS_MemoryWorkSpace::FreePage:

a34_perfview_rowstore_callee_free_Page

Within SQLOS we call “FreeBlock” instead of “DecommitBlock”. The windows OS calls are significantly different compared to columnstore compression and I wasn’t able to observe the memory commit for SQL Server changing during the workload. I can’t say for sure, but it looks like when memory is freed for memory-consuming row mode operators it is returned to SQL Server as opposed to returned to the OS. That makes a big difference in scalability.

I wanted to test with batch mode memory consuming operators, but couldn’t figure out a query pattern that resulted in the number of processed rows getting constantly reset for an operator that didn’t involve UDFs. I will say that I observed memory getting returned to the OS during batch mode queries, but I did not immediately see the performance bottleneck around freeing the pages. It’s possible that with more work a demo could be created to show this bottleneck. It’s also possible that the different pattern of freed pages means that it won’t realistically happen. In my limited testing I did not see batch mode operators free pages before they had done all of their work, which is significantly different from the columnstore compression pattern of allocating and freeing pages.

Changing the Memory Model

We know that the problem has to do with memory deallocation, so testing with the conventional and large pages memory models may be fruitful. The workload significantly degraded without lock pages in memory. The average insert time was about 368 seconds with 96 concurrent sessions. 91% of the workload time can be tied back to memory related waits:

a34_convential_waits

At first glance, this workload has a problem with memory allocation as opposed to freeing memory pages. It would be possible to use techniques within SQL Server to further drill into the scalability bottleneck, but I will not do that analysis here.

Enabling large pages for the buffer pool via TF 834 does not resolve this bottleneck. The average insert completion time is nearly identical to the LPIM test and I did not observe any significant differences in PerfView results. TF 834 is helpful with some columnstore workloads but it does not help with this one. Keep in mind that TF 834 is not supported with columnstore according to Microsoft.

Waiting for a Wait Type

I’m not a cloud guy, but I wonder if it would even be possible to troubleshoot an issue like this if the investigator doesn’t have access to the underlying OS or the ability to install ETW tracing tools. It would be helpful if Microsoft was able to capture and present this bottleneck in SQL Server. For example, a new wait type could be defined named something like “RESERVED_MEMORY_FREE_EXT” with the timer starting after sqldk!SOS_MemoryWorkSpace::DeCommitBlock and ending after the OS owned code finishes its work. A defined wait type would make it much more obvious that there’s a problem and allow affected end users to easily search for guidance or workarounds for the issue. This would be a cpu busy external wait. I don’t know if defining a wait like this is in line with Microsoft’s general philosophy for designing wait types, but it’s just an idea.

If the bottleneck in the OS truly can’t be solved then perhaps it could be protected by a spinlock within SQL Server. That spinlock might not accomplish anything other than letting someone who only has access to SQL Server know that there’s a problem. Again I don’t know if that’s an appropriate use of a spinlock. I’m out of my depth here when it comes to suggestions like these, but I do think there is a need to classify this bottleneck. Of course, Microsoft making code changes in SQL Server or even in Windows Server to resolve the bottleneck would be preferred.

Final Thoughts

This blog post covered the investigation of a slow workload for which wait stats were not helpful in diagnosing the problem. In summary:

  1. ETW tracing and extended events revealed a bottleneck related to returning memory to the OS from SQL Server. This bottleneck can occur when loading data into columnstore tables at reasonably high concurrency.
  2. Lock pages in memory can have a dramatic effect on workload performance even when paging to disk is not occurring. This must be a gross oversimplification, but with LPIM SQL Server uses different OS calls to manage memory and the difference in those calls can lead to scalability issues with and without LPIM. For example, highly concurrent workloads that call HASHBYTES have been observed to get better throughput without LPIM.
  3. For columnstore workloads on SQL Server 2017, I recommend the use of LPIM unless there is a very good reason not to use it. I have seen LPIM cause performance issues on SQL Server 2016 but have not fully sourced those issues. LPIM still may be the right starting point on SQL Server 2016.
  4. The investigation did not have a happy ending in that it isn’t clear how to resolve the bottleneck except by lowering concurrency. However, the work done here does provide a helpful foundation for reaching out to a VM or Windows admin or for opening a support ticket with Microsoft.

Thanks for reading!

Can Rowstore Compression Beat Columnstore Compression?

Columnstore has quite a few different tricks for compressing data. This blog post explores if it’s possible for a rowstore table to beat columnstore compression, even in the best case scenario for the columnstore table (no delta stores and rowgropus of the maximum size).

Page Compression and String Data

Everybody knows that strings aren’t the best fit for columnstore tables. Let’s start by puting sequential integers from 1 to 1048576 into a page compressed table:

DROP TABLE IF EXISTS dbo.RS_VARCHAR33;
GO
CREATE TABLE dbo.RS_VARCHAR33 (
	ID1 VARCHAR(33)
) WITH (DATA_COMPRESSION = PAGE);

INSERT INTO dbo.RS_VARCHAR33 WITH (TABLOCK)
SELECT TOP (1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

According to sp_spaceused, the table takes up 11464 KB of space. Next load the same data into a columnstore table:

DROP TABLE IF EXISTS dbo.CCI_VARCHAR33;
GO
CREATE TABLE dbo.CCI_VARCHAR33 (
	ID1 VARCHAR(33),
	INDEX C CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_VARCHAR33 WITH (TABLOCK)
SELECT TOP (1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

The CCI takes up 14408 KB of space. The majority of the space is used for the dictionary. The follow query returns 11495348 bytes:

SELECT csd.on_disk_size
FROM sys.column_store_dictionaries csd
INNER JOIN sys.partitions sp
	ON csd.partition_id = sp.partition_id
WHERE sp.[object_id] = OBJECT_ID('CCI_VARCHAR33');

Data types matter here. If I change the CCI to have a VARCHAR(32) column instead then the table only requires 1800 KB of space. A dictionary is not created in this case. A takeaway is that SQL Server may choose to create a dictionary based on the length of the defined string column. Data sets with many unique strings may require a relatively large amount of space for the dictionary, even to the point where page compressed data can have overall a lower footprint on the database.

Page Compression and Non-string Data

It’s certainly more difficult to come up with a demo that works without string columns, but consider how the page compression algorithm works. Data can be compressed on page basis, which includes both multiple rows and multiple columns. That means that page compression can achieve a higher compression ratio when a row has identical values in different columns. Columnstore is only able to compress on an individual column basis and you won’t directly see better compression with repeated values in different columns for a single row (as far as I know).

The table defined and populated below requires 11912 KB of disk space:

DROP TABLE IF EXISTS dbo.RS_4_COLUMN;
GO
CREATE TABLE dbo.RS_4_COLUMN (
	ID1 BIGINT,
	ID2 BIGINT,
	ID3 BIGINT,
	ID4 BIGINT
) WITH (DATA_COMPRESSION = PAGE);

INSERT INTO dbo.RS_4_COLUMN WITH (TABLOCK)
SELECT ID, ID, ID, ID
FROM (
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) / 10 ID
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

If I rerun the code with five columns, the table now takes up 13000 KB of space. That’s only an increase of 9% in space to hold 25% more data. The same data loaded into a columnstore table with four columns takes 11272 KB of space. Adding one column results in a total size of 14088 KB, which is almost exactly a 25% increase. For this data set with five columns, page compression across rows is more efficient than standard columnstore compression.

Row Compression and Non-string Data

Row compression doesn’t allow for compression benefits from storing the same value in multiple columns. Is it possible to beat columnstore compression with row compression without string columns? You betcha!

DROP TABLE IF EXISTS #BATCH_MODE;
CREATE TABLE #BATCH_MODE (
	ID INT,
	INDEX C CLUSTERED COLUMNSTORE
);

DROP TABLE IF EXISTS dbo.RS_2_COLUMN_ROW_COMPRESSED;
GO
CREATE TABLE dbo.RS_2_COLUMN_ROW_COMPRESSED (
	ID1 BIGINT,
	ID2 BIGINT
) WITH (DATA_COMPRESSION = ROW);

INSERT INTO dbo.RS_2_COLUMN_ROW_COMPRESSED WITH (TABLOCK)
SELECT
  SUM(q.ID) OVER (ORDER BY q.ID)
, SUM(q.ID) OVER (ORDER BY q.ID)
FROM
(
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) ID
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) q
LEFT OUTER JOIN #BATCH_MODE ON 1 = 0
OPTION (MAXDOP 1);

EXEC sp_spaceused 'RS_2_COLUMN_ROW_COMPRESSED';

DROP TABLE IF EXISTS #BATCH_MODE;
CREATE TABLE #BATCH_MODE (
	ID INT,
	INDEX C CLUSTERED COLUMNSTORE
);

DROP TABLE IF EXISTS dbo.CCI_2_COLUMN;
GO
CREATE TABLE dbo.CCI_2_COLUMN (
	ID1 BIGINT,
	ID2 BIGINT,
	INDEX C CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_2_COLUMN WITH (TABLOCK)
SELECT
  SUM(q.ID) OVER (ORDER BY q.ID)
, SUM(q.ID) OVER (ORDER BY q.ID)
FROM
(
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) ID
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) q
LEFT OUTER JOIN #BATCH_MODE ON 1 = 0
OPTION (MAXDOP 1);

EXEC sp_spaceused 'CCI_2_COLUMN';

In this example, the rowstore table has a total size of 15496 KB but the columnstore has a total size of 16840 KB. I’m honestly not sure why this happens. I did try to pick the unfriendliest data set to columnstore compression that I could: no repeated values and no obvious patterns in data. Perhaps there’s some additional overhead of compression that pushes it over the row compressed data.

Final Thoughts

Columnstore doesn’t guarantee better compression than rowstore, even with perfectly sized rowgroups. Rowstore can provide better compression ratios for string columns, tables with repeated values across columns, and in other uncommon scenarios. A summary of test results is here:

a33_summary

The differences are small per rowgroup, but they can add up as more columns and more rows are added to the tables. I’ve seen tables in the real world that ended up bigger as columnstore compared to rowstore, which is what inspired me to look for some of these examples. Thanks for reading!

The Return of RESERVED_MEMORY_ALLOCATION_EXT

It is possible to see a scalability bottleneck in the form of high average wait time for the RESERVED_MEMORY_ALLOCATION_EXT wait if a highly concurrent workload is run on a server that consumes memory with batch mode operators. I believe that the severity of the bottleneck depends on unknown factors in the server’s initial memory state and the rate of memory actually used by queries to run batch mode operations. This blog post shares a reproduction of the issue along with a call to action.

The Test Query

Data prep for the test query is very simple. We throw the first ten million sequential integers into two single column BIGINT tables. Additionally, an empty CCI is created to add eligibility for batch mode as desired:

DROP TABLE IF EXISTS DUMMY_CCI;
CREATE TABLE DUMMY_CCI (
	ID INT,
	INDEX CCI CLUSTERED COLUMNSTORE
);

DROP TABLE IF EXISTS HASH_JOIN_SCALE_TEST_1;
CREATE TABLE HASH_JOIN_SCALE_TEST_1 (
	ID BIGINT NOT NULL
);

INSERT INTO HASH_JOIN_SCALE_TEST_1 WITH (TABLOCK)
SELECT TOP (10000000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS HASH_JOIN_SCALE_TEST_2;
CREATE TABLE HASH_JOIN_SCALE_TEST_2 (
ID BIGINT NOT NULL
);

INSERT INTO HASH_JOIN_SCALE_TEST_2 WITH (TABLOCK)
SELECT TOP (10000000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

The SELECT query that we’re going to test with joins them together and returns a count of distinct IDs:

SELECT COUNT(DISTINCT t1.ID)
FROM dbo.HASH_JOIN_SCALE_TEST_1 t1
LEFT OUTER JOIN dbo.HASH_JOIN_SCALE_TEST_2 t2
	ON t1.ID = t2.ID
OPTION (MAXDOP 1);

This is certainly an odd thing to do, but the point of this query is to create a simple test case that uses a bit of memory to create hash tables. Without any indexes on the table we’re very likely to get a hash join. In row mode the query takes about 17 seconds on the test server and uses 1257528 KB of memory.

As far as I can tell this query is a good fit for batch mode. The join is on BIGINT columns and everything except for the rowstore scans can execute in batch mode:

a28_query_plan

Even with the hidden conversions of 20 million rows from row mode to batch mode the query finishes in about 5 seconds. It uses significantly less memory compared to the row mode query: 386112 KB. All of this seems very reasonable.

Testing at Scale

The situation becomes less reasonable once there are many concurrent queries running at once. We go from the batch mode query running in a third of the time of row mode to sometimes running slower. Testing was done with 96 concurrent queries each running queries until 576 total queries were completed. Below is a graph of OS CPU and OS privileged time CPU based on perfmon data:

a28_cpu_graphs

The first third of the graph corresponds to the batch mode workload. CPU never hits 100% and is extremely variable during the run. We also see a high amount of privileged time in the OS. Within SQL Server, we use a total of 222,400,512 KB of query memory. The total wait time for RESERVED_MEMORY_ALLOCATION_EXT can vary, but for this one it was around 11.5 million ms.

The second third of the graph is the part that resets the server memory state. As stated in the introduction, this scalability bottleneck doesn’t always happen and seems to have some kind of dependency on the server’s initial memory state. Here we do a SELECT COUNT(*) on a 1 TB table to add pressure to the rowstore buffer pool.

The final third of the graph corresponds to the row mode workload. CPU jumps to near 100% very quickly and doesn’t waver much during the run. There is very little privileged time in the OS. Within SQL Server, we use a total of 724,336,128 KB of query memory. The wait time for RESERVED_MEMORY_ALLOCATION_EXT is always significantly reduced compared to batch mode. For this run it was around 107k ms.

In conclusion, we see about 1% of memory wait time with the row mode workload compared to the batch mode workload. Usually the row mode workload finishes a little faster than the batch mode one, despite needing over 3X as much total memory and the per-query efficiency advantage of batch mode.

Speculation on the Problem

There may be a hint in the memory management architecture guide:

Starting with SQL Server 2012, SQL Server might allocate more memory than the value specified in the max server memory setting. This behavior may occur when the Total Server Memory (KB) value has already reached the Target Server Memory (KB) setting (as specified by max server memory). If there is insufficient contiguous free memory to meet the demand of multi-page memory requests (more than 8 KB) because of memory fragmentation, SQL Server can perform over-commitment instead of rejecting the memory request.

This behavior is typically observed during the following operations: Large Columnstore index queries.

There are observable differences in how memory is allocated for the same operator when switching from batch mode to row mode. The number of wait events for RESERVED_MEMORY_ALLOCATION_EXT is significantly higher for row mode compared to batch mode. In addition, it remains nearly constant when TF 834 is enabled:

a28_memory_waits

TF 834 appears to fully resolve the scalability issue with batch mode, just like last time. It offers a dramatic improvement in workload completion time:

a28_summary_of_timings

Perhaps some batch mode operators require contiguous free memory for their allocations but the same operation done through row mode does not have that restriction. I can’t go any further than this because I don’t know anything about OS internals or memory management, but it could explain why we see such different behavior between row mode and batch mode.

Use Your Voice

Since SQL Server 2012, Microsoft has listed TF 834 as incompatible with columnstore. There are lots of scary consequences:

  • A non-yielding scheduler error and associated memory dumps in the SQL Server Error log.
  • Columnstore queries trigger severe performance issues.
  • A SQL Server instance triggers access violations when you execute Columnstore queries.
  • You encounter the following error when you run sp_createstats

However, I’ve been told that many of the SQL Server columnstore benchmarks for the TPC-H benchmarks use TF 834. It seems to be a natural fit for columnstore, and there are some workloads (seen in the lab and reproduced in the last two blog posts) where it resolves a key scalability bottleneck which is difficult to address through other means. I have created a feedback item on UserVoice requesting that Microsoft support TF 834 with columnstore.

Final Thoughts

Please provide feedback on the UserVoice item if you are able to do so. Thanks for reading!

Large CCI ETLs Cannot Scale Without TF 834

Large servers may experience a scalability bottleneck related to the RESERVED_MEMORY_ALLOCATION_EXT wait event during loading of columnstore tables. This blog post shares a reproduction of the issue and discusses some test results.

The Test Server

Testing was done on a four socket bare metal server with 24 cores per socket. The server had 1 TB of RAM and storage was provided by a SAN. Within SQL Server, we were able to read data at a peak rate of about 5.5 GB/s. Hyperthreading was disabled, but there aren’t any other nonstandard OS configuration settings that I’m aware of.

SQL Server 2016 SP1 CU7 was installed on Windows Server 2016. Most default settings were retained for this testing, including allowing auto soft-NUMA to break up the 96 schedulers into 12 groups of 8, with 3 per memory node. Max server memory was set to around 800000 MB. The user database had 24 data files, indirect checkpoints weren’t used, and the memory model in use was conventional. No interesting trace flags were enabled, except perhaps for TF 4199. I did grow max server memory to close to the maximum before starting any of the tests.

Testing Code and Method

The workload test loaded the same data from a SQL Server table into 576 CCI target tables. For my testing I used a source table of one million rows. Each session grabs a number from a sequence, loads data into the table corresponding to the sequence number, and continues to do that until there are no more tables to process. This may seem like an odd test to run, but think of it as an abstract representation of a columnstore ETL workload which loads data into partitioned CCIs.

The first step is to define the source table. I created a 50 column table that stored only 0s in all of its rows. There’s nothing particularly special about this choice other that it has a very low disk footprint as a CCI because it compresses so well. On a non-busy system it took around 7.3 seconds to insert one million rows into a CCI. Below is the table definition:

DROP TABLE IF EXISTS CCI_SOURCE;

CREATE TABLE CCI_SOURCE (
ID1 SMALLINT,
ID2 SMALLINT,
ID3 SMALLINT,
ID4 SMALLINT,
ID5 SMALLINT,
ID6 SMALLINT,
ID7 SMALLINT,
ID8 SMALLINT,
ID9 SMALLINT,
ID10 SMALLINT,
ID11 SMALLINT,
ID12 SMALLINT,
ID13 SMALLINT,
ID14 SMALLINT,
ID15 SMALLINT,
ID16 SMALLINT,
ID17 SMALLINT,
ID18 SMALLINT,
ID19 SMALLINT,
ID20 SMALLINT,
ID21 SMALLINT,
ID22 SMALLINT,
ID23 SMALLINT,
ID24 SMALLINT,
ID25 SMALLINT,
ID26 SMALLINT,
ID27 SMALLINT,
ID28 SMALLINT,
ID29 SMALLINT,
ID30 SMALLINT,
ID31 SMALLINT,
ID32 SMALLINT,
ID33 SMALLINT,
ID34 SMALLINT,
ID35 SMALLINT,
ID36 SMALLINT,
ID37 SMALLINT,
ID38 SMALLINT,
ID39 SMALLINT,
ID40 SMALLINT,
ID41 SMALLINT,
ID42 SMALLINT,
ID43 SMALLINT,
ID44 SMALLINT,
ID45 SMALLINT,
ID46 SMALLINT,
ID47 SMALLINT,
ID48 SMALLINT,
ID49 SMALLINT,
ID50 SMALLINT
);

INSERT INTO CCI_SOURCE WITH (TABLOCK)
SELECT
 ID,ID,ID,ID,ID,ID,ID,ID,ID,ID
,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID
,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID
,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID
,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID
FROM (
	SELECT TOP (1000000) 0 ID
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

Next we need a stored procedure that can save off previous test results (as desired) and reset the server for the next test. If a test name is passed in then results are saved to CCI_TEST_RESULTS and CCI_TEST_WAIT_STATS from the previous run. The procedure always recreates all of the target CCI tables, resets the sequence, clears the buffer pool, and does a few other things.

CREATE OR ALTER PROCEDURE [dbo].[CCI_TEST_RESET] (@PreviousTestName NVARCHAR(100) = NULL, @DebugMe INT = 0)
AS
BEGIN
	DECLARE
	@tablename SYSNAME,
	@table_number INT = 1,
	@SQLToExecute NVARCHAR(4000);

	/*
CREATE TABLE CCI_TEST_WAIT_STATS (
	TEST_NAME NVARCHAR(100),
	WAIT_TYPE NVARCHAR(60),
	WAITING_TASKS_COUNT BIGINT,
	WAIT_TIME_MS BIGINT,
	MAX_WAIT_TIME_MS BIGINT,
	SIGNAL_WAIT_TIME_MS BIGINT
);

CREATE TABLE CCI_TEST_RESULTS (
	TEST_NAME NVARCHAR(100),
	TOTAL_SESSION_COUNT SMALLINT,
	TEST_DURATION INT,
	TOTAL_WORK_TIME INT,
	BEST_TABLE_TIME INT,
	WORST_TABLE_TIME INT,
	TOTAL_TABLES_PROCESSED SMALLINT,
	MIN_TABLES_PROCESSED SMALLINT,
	MAX_TABLES_PROCESSED SMALLINT
);
	*/

	SET NOCOUNT ON;

	IF @DebugMe = 0
	BEGIN
		DROP SEQUENCE IF EXISTS CCI_PARALLEL_TEST_SEQ;
		CREATE SEQUENCE CCI_PARALLEL_TEST_SEQ
			AS SMALLINT
			START WITH 1
			INCREMENT BY 1
			CACHE 600;

		IF @PreviousTestName  N''
		BEGIN
			INSERT INTO CCI_TEST_RESULTS WITH (TABLOCK)
			SELECT @PreviousTestName,
			  COUNT(*) TOTAL_SESSION_COUNT
			, DATEDIFF(MILLISECOND, MIN(MIN_START_TIME), MAX(MAX_END_TIME)) TEST_DURATION
			, SUM(TOTAL_SESSION_TIME) TOTAL_WORK_TIME
			, MIN(MIN_TABLE_TIME) BEST_TABLE_TIME
			, MAX(MAX_TABLE_TABLE) WORST_TABLE_TIME
			, SUM(CNT) TOTAL_TABLES_PROCESSED
			, MIN(CNT) MIN_TABLES_PROCESSED
			, MAX(CNT) MAX_TABLES_PROCESSED
			FROM (
				SELECT
				  SESSION_ID
				, COUNT(*) CNT
				, SUM(DATEDIFF(MILLISECOND, START_TIME, END_TIME)) TOTAL_SESSION_TIME
				, MIN(DATEDIFF(MILLISECOND, START_TIME, END_TIME)) MIN_TABLE_TIME
				, MAX(DATEDIFF(MILLISECOND, START_TIME, END_TIME)) MAX_TABLE_TABLE
				, MIN(START_TIME) MIN_START_TIME
				, MAX(END_TIME) MAX_END_TIME
				FROM CCI_TEST_LOGGING_TABLE
				GROUP BY SESSION_ID
			) t;

			INSERT INTO CCI_TEST_WAIT_STATS WITH (TABLOCK)
			SELECT @PreviousTestName, * FROM sys.dm_os_wait_stats
			WHERE wait_type IN (
				'RESERVED_MEMORY_ALLOCATION_EXT'
				, 'SOS_SCHEDULER_YIELD'
				, 'PAGEIOLATCH_EX'
				, 'MEMORY_ALLOCATION_EXT'
				, 'PAGELATCH_UP'
				, 'PAGEIOLATCH_SH'
				, 'WRITELOG'
				, 'LATCH_EX'
				, 'PAGELATCH_EX'
				, 'PAGELATCH_SH'
				, 'CMEMTHREAD'
				, 'LATCH_SH'
			);

		END;

		DROP TABLE IF EXISTS CCI_TEST_LOGGING_TABLE;
		CREATE TABLE CCI_TEST_LOGGING_TABLE (
			SESSION_ID INT,
			TABLE_NUMBER INT,
			START_TIME DATETIME,
			END_TIME DATETIME
		);

	END;

	WHILE @table_number BETWEEN 0 AND 576
	BEGIN
		SET @tablename = N'CCI_PARALLEL_RPT_TARGET_' + CAST(@table_number AS NVARCHAR(3));
		SET @SQLToExecute= N'DROP TABLE IF EXISTS ' + QUOTENAME(@tablename);
		IF @DebugMe = 1
		BEGIN
			PRINT @SQLToExecute;
		END
		ELSE
		BEGIN
			EXEC (@SQLToExecute);
		END;

		SET @SQLToExecute = N'SELECT * INTO ' + QUOTENAME(@tablename) +
		' FROM CCI_SOURCE WHERE 1 = 0';
		IF @DebugMe = 1
		BEGIN
			PRINT @SQLToExecute;
		END
		ELSE
		BEGIN
			EXEC (@SQLToExecute);
		END;

		SET @SQLToExecute = N'CREATE CLUSTERED COLUMNSTORE INDEX CCI ON ' + QUOTENAME(@tablename);
		IF @DebugMe = 1
		BEGIN
			PRINT @SQLToExecute;
		END
		ELSE
		BEGIN
			EXEC (@SQLToExecute);
		END;

		SET @table_number = @table_number + 1;
	END;

	IF @DebugMe = 0
	BEGIN
		DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
		DBCC FREEPROCCACHE WITH NO_INFOMSGS;
		DBCC FREESYSTEMCACHE('ALL');

		SELECT COUNT(*) FROM CCI_SOURCE; -- read into cache

		DBCC SQLPERF ("sys.dm_os_wait_stats", CLEAR) WITH NO_INFOMSGS;
		DBCC SQLPERF ("sys.dm_os_spinlock_stats", CLEAR) WITH NO_INFOMSGS;
	END;
END;

Finally we need a stored procedure that can be called to do as many CCI inserts as it can as quickly as possible. I used the following code:

CREATE OR ALTER PROCEDURE [dbo].[CCI_RUN_INSERTS] (@DebugMe INT = 0)
AS
BEGIN
	DECLARE @table_number INT,
	@tablename SYSNAME,
	@SQLToExecute NVARCHAR(4000),
	@start_loop_time DATETIME;

	SET NOCOUNT ON;

	SELECT @table_number = NEXT VALUE FOR CCI_PARALLEL_TEST_SEQ;

	WHILE @table_number BETWEEN 0 AND 576
	BEGIN
		SET @start_loop_time = GETDATE();

		SET @tablename = N'CCI_PARALLEL_RPT_TARGET_' + CAST(@table_number AS NVARCHAR(3));
		SET @SQLToExecute= N'
		INSERT INTO ' + QUOTENAME(@tablename) + N' WITH (TABLOCK)
		SELECT * FROM CCI_SOURCE WITH (TABLOCK)
		OPTION (MAXDOP 1)';

		IF @DebugMe = 1
		BEGIN
			PRINT @SQLToExecute;
		END
		ELSE
		BEGIN
			EXEC (@SQLToExecute);

			INSERT INTO CCI_TEST_LOGGING_TABLE VALUES (@@SPID, @table_number, @start_loop_time, GETDATE());
		END;

		SELECT @table_number = NEXT VALUE FOR CCI_PARALLEL_TEST_SEQ;
	END;
END;

To vary the number of concurrent queries I used sqlcmd. Each set of 12 sessions had the following format:

START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
START /B sqlcmd -d your_db_name -S your_server_name -Q "EXEC dbo.[CCI_RUN_INSERTS]" > nul
ping 192.2.0.1 -n 1 -w 1 > nul

My workflow was to perform this test was to add the desired number of calls to the stored procedure in the .bat file, run the test, and to check the results to make sure that the test was a good one. If it was I saved off results using the reset procedure. If not I ran the reset procedure and tried again.

The ping command is there to add a small delay between sets of queries. I found that adding such a delay led to less doubling up on schedulers. I picked 12 because that’s the number of soft-NUMA nodes. Sometimes tests would get quite a bit of SOS_SCHEDULER_YIELD waits which would mean that they couldn’t be accurately compared to other tests. My fix was to just run the test again. It required a bit of patience but I never had to run a test more than once. SOS waits weren’t eliminated but I’d say they fell to acceptable levels:

a27_SOS_CPU_wait_table

The right way to avoid SOS waits for testing like this (which might require every user session to go on its own scheduler) would be to set up 96 manual soft-NUMA nodes. But who has time for that?

Test Results

12 tests with different numbers of active queries were run. I picked numbers that somewhat evenly divided into 576 to try to keep work balanced between threads. The blue line in the chart below measures how long each test took from start to finish and the orange line represents how fast the test could have completed if there was no contention on the server whatsoever:

a27_No_T834_compare_chart

Naturally we can’t expect the two lines to match perfectly, but improvements in runtime stop after going past 32 threads. In fact, the workload takes longer with 96 threads compared to 32 threads.

The dominant wait event for the higher thread count runs is RESERVED_MEMORY_ALLOCATION_EXT. Below is a chart of all of the wait events worth mentioning for the 96 thread run:

a27_no_T834_wait_table

The total number of wait events for RESERVED_MEMORY_ALLOCATION_EXT is very consistent between runs. However, the average wait time significantly increases as the number of concurrent queries increases:

a27_memory_waits_chart

In fact, it could be said that that nearly all worker time past 32 threads is spent waiting on memory. The final column in the chart below is the total time spent in SQL Server minus wait time for RESERVED_MEMORY_ALLOCATION_EXT. The values in that column are remarkably consistent.

a27_No_TF_834_summary

In my experience, when we get into a situation with high memory waits caused by too much concurrent CCI activity all queries on the server that use a memory grant can be affected. For example, I’ve seen sp_whoisactive run for longer than 90 seconds.

It needs to be stated that not all CCIs will suffer from this scalability problem. I was able to achieve good scalability with some artificial tables, but all of the real target tables that I tested have excessive memory waits at high concurrency. Perhaps tables which require more CPU to compress naturally spread out their memory requests and the underlying OS is better able to keep up.

Test Results With Trace Flag 834

Microsoft strongly recommends against using trace flag 834 with columnstore tables. There’s even an article dedicated to that warning. I was desperate so I tried it anyway. The difference was night and day:

a27_T834_compare_chart

Enabling trace flag 834 does at least three things: SQL Server grows to max server memory on startup, AWE memory is used for memory access, and large pages are used for the buffer pool. I didn’t see gains when using LPIM (which uses AWE memory) or by growing memory to max before running tests with a conventional memory model, so I suspect that the large pages are making a key difference. Wait time for RESERVED_MEMORY_ALLOCATION_EXT is under a second for all tests.

Scalability is diminished a bit for the 96 thread run. There are some wait events that creep up:

TF_834_96_thread_wait_stats

All of the wait events can be troubleshooted in conventional ways except possibly for CMEMTHREAD. There’s no longer a large amount of time spent outside of SQL Server in the OS.

Other Workarounds

Without trace flag 834 this can be a difficult problem to work around. The two main strategies are to spread out CCI insert activity as much as possible during the ETL and to reduce memory usage of queries which run at the same time as the CCI inserts. For example, consider a query that inserts into a CCI that also performs a large hash join. If that hash join can be moved to somewhere else in the process then you might come out ahead in reducing contention on memory.

Other than that, there’s some evidence that virtualized servers are not a good fit for this type of workload. Large virtual guests experience the memory waits at an increased rate, but it isn’t yet clear if the problem can be avoided through some change in VM configuration.

Final Thoughts

It’s hard not to conclude that TF 834 is necessary to get scalability for columnstore ETLs on very large servers. Hopefully Microsoft will make TF 834 compatible with columnstore one day in the future. Thanks for reading!

Columnstore Bitmap Filters

Microsoft has introduced a few improvements to bitmap filters with batch mode. I don’t really define any terms in this post, so if you don’t have a solid grasp of the fundamentals you should consider this blog post by Paul White required reading.

Test Data

Let’s start with a few simple examples that don’t involve columnstore. All of the test queries are simple joins between a dimension and a fact table on a DATETIME column. I’ll use the same dimension table for all of them:

DROP TABLE IF EXISTS dbo.DimDim;

CREATE TABLE dbo.DimDim (
	dimDate DATETIME,
	PRIMARY KEY (dimDate)
);

INSERT INTO dbo.DimDim VALUES
('20170101'),
('20170102'),
('20170103'),
('20170104'),
('20170105'),
('20170106'),
('20170107'),
('20170108'),
('20170109'),
('20170110'),
('20171201'),
('20171202'),
('20171203'),
('20171204'),
('20171205'),
('20171206'),
('20171207'),
('20171208'),
('20171209'),
('20171210');

Twenty rows in total with 10 dates in January and 10 dates in December. Obviously this isn’t a proper dimension table, but it makes the demos a little simpler. The fact tables have about a million rows in all twelve months of 2017 for a total of about 12 million rows.

Rowstore Heaps

Let’s start with a rowstore heap for the fact table. Code to create and populate the table is below:

DROP TABLE IF EXISTS dbo.FactHeapNoPart;

CREATE TABLE dbo.FactHeapNoPart (
	factDate DATETIME,
	justTheFacts VARCHAR(100)
);

DECLARE @month INT = 1;
SET NOCOUNT ON;

WHILE @month <= 12
BEGIN
	INSERT INTO dbo.FactHeapNoPart
		WITH (TABLOCK)
	SELECT DATEADD(DAY, t.RN / 38500
	, DATEADD(MONTH, @month, '20161201'))
	, REPLICATE('FACTS', 20)
	FROM
	(
		SELECT TOP (1048576) ROW_NUMBER()
			OVER (ORDER BY (SELECT NULL)) RN
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
	) t
	OPTION (MAXDOP 1);

	SET @month = @month + 1;
END;

The first demo query is forced to run in serial:

SELECT *
FROM dbo.DimDim dd
INNER JOIN dbo.FactHeapNoPart f
	ON dd.dimDate = f.factDate
OPTION (MAXDOP 1);

With a serial query there is no visible bitmap operator as expected:

a23_rowstore_heap_serial

On my machine, the query requires 190656 logical reads and 2234 ms of CPU time to execute.

Bumping the query up to MAXDOP 2 results in a bitmap operator:

a23_rowstore_heap_parallel

The bitmap operator isn’t pushed in row due to the data type, but only 769998 rows are sent to the hash match as a result. CPU time required to execute the query falls to 1844 and logical reads stays the same. We did the same amount of IO, which seems perfectly reasonable here.

Rowstore Clustered Indexes

It’s not a very common plan type, but what happens with a hash join on the clustered key of a rowstore table? The code below creates a fact table with the same data as before but now we have a clustered index:

DROP TABLE IF EXISTS dbo.FactClustNoPart;

CREATE TABLE dbo.FactClustNoPart (
	factDate DATETIME,
	justTheFacts VARCHAR(100)
);

CREATE CLUSTERED INDEX CI ON FactClustNoPart (factDate);

DECLARE @month INT = 1;
SET NOCOUNT ON;

WHILE @month <= 12
BEGIN
	INSERT INTO dbo.FactClustNoPart
		WITH (TABLOCK)
	SELECT DATEADD(DAY, t.RN / 38500
	, DATEADD(MONTH, @month, '20161201'))
	, REPLICATE('FACTS', 20)
	FROM
	(
		SELECT TOP (1048576) ROW_NUMBER()
			OVER (ORDER BY (SELECT NULL)) RN
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
	) t
	OPTION (MAXDOP 1);

	SET @month = @month + 1;
END;

All rows are again read from the fact table. The bitmap operator isn’t quite as effective as before. 798498 rows are sent to the join but only 769998 remain after the join.

Could SQL Server do better? Building the hash table is a blocking operator and it should be trivial to keep track of the minimum and maximum join key while building it. In theory, one could imagine a plan with a clustered index seek on the fact table instead of a clustered index scan that takes advantage of the minimum and maximum values found during the hash build phase. This optimization would result in less overall IO. Perhaps this wasn’t implemented because this isn’t a very common plan shape.

Rowstore Partitioned Heaps

Now let’s move onto to a partitioned rowstore heap for the fact table. There is one partition per month and 12 partitions end up with data. Code to create and populate the table is below:

CREATE PARTITION FUNCTION hate_this_syntax_fun
(DATETIME)
AS RANGE RIGHT
FOR VALUES (
  '20161231'
, '20170101'
, '20170201'
, '20170301'
, '20170401'
, '20170501'
, '20170601'
, '20170701'
, '20170801'
, '20170901'
, '20171001'
, '20171101'
, '20171201'
, '20180101'
);

CREATE PARTITION SCHEME hate_this_syntax_scheme
AS PARTITION hate_this_syntax_fun
ALL TO ( [PRIMARY] );

DROP TABLE IF EXISTS dbo.FactHeapPart;

CREATE TABLE dbo.FactHeapPart (
	factDate DATETIME,
	justTheFacts VARCHAR(100)
) ON hate_this_syntax_scheme (factDate);
set statistics io, time on;

DECLARE @month INT = 1;
SET NOCOUNT ON;

WHILE @month <= 12
BEGIN
	INSERT INTO dbo.FactHeapPart
		WITH (TABLOCK)
	SELECT t2.factDate, t2.justTheFacts
	FROM
	(
		SELECT CAST(
			DATEADD(DAY, CAST(t.RN / 38500 AS INT)
			, DATEADD(MONTH, @month, '20161201')
			) AS DATETIME) factDate
		, REPLICATE('FACTS', 20) justTheFacts
		FROM
		(
			SELECT TOP (1048576) ROW_NUMBER()
				OVER (ORDER BY (SELECT NULL)) RN
			FROM master..spt_values t1
			CROSS JOIN master..spt_values t2
		) t
	) t2
	--WHERE $PARTITION.hate_this_syntax_fun(t2.factDate) = @month + 2
	OPTION (MAXDOP 1);

	SET @month = @month + 1;
END;

The code above takes a bit longer to execute than I would have liked. There’s a sort before inserting data even with a filter on the partitioning function. I’m not sure why I couldn’t make the sort go away. I expect that it has something to do with data types.

Running the same query as before:

SELECT *
FROM dbo.DimDim dd
INNER JOIN dbo.FactHeapPart f
	ON dd.dimDate = f.factDate
OPTION (MAXDOP 2);

The plan looks the same as before. The bitmap sends 790371 rows to the hash join. One thing to note is that all partitions are read from the table:

a23_all_partitions

We know that based on the data in the dimension table that SQL Server only needs to read two partitions from the fact table. Could the query optimizer in theory do better than it did? Consider the fact that a partitioned table has at most 15000 partitions. All of the partition values cannot overlap and they don’t change without a DDL operation. When building the hash table the query optimizer could keep track of which partitions have at least one row in them. By the end of the hash build we’ll know exactly which partitions could contain data, so the rest of the partitions could be skipped during the probe phase.

Perhaps this isn’t implemented because it’s important for the hash build to be independent of the probe. Maybe there’s no guarantee available at the right time that the bitmap operator will be pushed all the way down to the scan as opposed to a repartition streams operator. Perhaps this isn’t a common case and the optimization isn’t worth the effort. After all, how often do you join on the partitioning column instead of filtering by it?

It’s worth noting that the theoretical optimization described above still isn’t as good as the collocated join optimization blogged about by Paul White.

Columnstore

Now let’s build a columnstore index with 1048576 rows per rowgroup:

DROP TABLE IF EXISTS dbo.FactCCINoPart;

CREATE TABLE dbo.FactCCINoPart (
	factDate DATETIME,
	justTheFacts VARCHAR(100),
	INDEX CCI CLUSTERED COLUMNSTORE
);

DECLARE @month INT = 1;
SET NOCOUNT ON;

WHILE @month <= 12
BEGIN
	INSERT INTO dbo.FactCCINoPart
		WITH (TABLOCK)
	SELECT t2.factDate, t2.justTheFacts
	FROM
	(
		SELECT CAST(
			DATEADD(DAY, CAST(t.RN / 38500 AS INT)
			, DATEADD(MONTH, @month, '20161201')
			) AS DATETIME) factDate
		, REPLICATE('FACTS', 20) justTheFacts
		FROM
		(
			SELECT TOP (1048576) ROW_NUMBER()
				OVER (ORDER BY (SELECT NULL)) RN
			FROM master..spt_values t1
			CROSS JOIN master..spt_values t2
		) t
	) t2
	OPTION (MAXDOP 1);

	SET @month = @month + 1;
END;

Let’s return to our serial join query:

SELECT *
FROM dbo.DimDim dd
INNER JOIN dbo.FactCCINoPart f
	ON dd.dimDate = f.factDate
OPTION (MAXDOP 1);

A few interesting things happen. The first is that we get rowgroup elimination even though the dates in the dimension table are spread very far apart:

Table ‘FactCCINoPart’. Segment reads 2, segment skipped 10.

The following simple query doesn’t get rowgroup elimination:

SELECT *
FROM dbo.FactCCINoPart f
WHERE f.factDate IN ('20170101', '20171231')

You can read more about that limitation here. It’s fair to say that the bitmap filter does a better job than expected with rowgroup elimination. According to extended events this is known as an expression filter bitmap. The extended event has a few undocumented properties about the bitmap:

a23_cs bitmap_xe

I watched the extended events fly by a few times but it wasn’t clear to me what was going on internally. One possible implementation would be for the hash build to compare each build row to the rowgroup low and high values to figure out which rowgroups could never possibly return matched rows. I strongly suspect that is not the implementation that Microsoft chose. Perhaps they take advantage of the small expected size of the bitmap filter to send information about all of the build rows to do elimination. I don’t know a lot about computer science, but the usual structure of bitmap would not be sufficient because you can’t use such a bitmap to make any determination about inequality comparisons. It can only tell you if an individual row can’t match.

The second interesting thing is that we get an optimized bitmap even for a serial plan:

a23_CCI_opt_bitmap

Optimized Bitmaps

At some point optimized bitmaps were limited to parallel plans. I suspect that restriction was relaxed with the availability of batch mode in a plan but I don’t know for sure. The demos below show optimized bitmaps both improving and degrading performance. The script below creates three tables and takes about three minutes to run on my machine:

DROP TABLE IF EXISTS #t;

SELECT TOP (3000000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) RN into #t
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

-- rowstore dim with an index
-- that includes DimForeignKey
DROP TABLE IF EXISTS dbo.RSJoinIndex;
CREATE TABLE dbo.RSJoinIndex (
	IrrelevantKey BIGINT,
	SelectKey BIGINT,
	DimForeignKey BIGINT,
	FilterIndexed INT,
	PageFiller VARCHAR(3000),
	PRIMARY KEY (IrrelevantKey)
);

INSERT INTO dbo.RSJoinIndex WITH (TABLOCK)
SELECT t.RN, 1, t.RN, 1, REPLICATE('Z', 3000)
FROM #t t
OPTION (MAXDOP 1);

CREATE INDEX IX_RSJoinIndex ON dbo.RSJoinIndex
	(FilterIndexed) INCLUDE (DimForeignKey);
CREATE STATISTICS S1 ON dbo.RSJoinIndex (DimForeignKey)
	WITH FULLSCAN;

-- rowstore dim with an index
-- that does not include DimForeignKey
DROP TABLE IF EXISTS dbo.RSNoJoinIndex;
CREATE TABLE dbo.RSNoJoinIndex (
	IrrelevantKey BIGINT,
	SelectKey BIGINT,
	DimForeignKey BIGINT,
	FilterIndexed INT,
	PageFiller VARCHAR(3000),
	PRIMARY KEY (IrrelevantKey)
);

INSERT INTO dbo.RSNoJoinIndex WITH (TABLOCK)
SELECT t.RN, 1, t.RN, 1, REPLICATE('Z', 3000)
FROM #t t
OPTION (MAXDOP 1);

CREATE INDEX IX_RSNoJoinIndex ON dbo.RSNoJoinIndex
	(FilterIndexed);
CREATE STATISTICS S1 ON dbo.RSNoJoinIndex (DimForeignKey)
	WITH FULLSCAN;

-- CCI fact table
DROP TABLE IF EXISTS dbo.CCIFact;
CREATE TABLE dbo.CCIFact (
	DimForeignKey BIGINT,
	FilterCol BIGINT,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCIFact WITH (TABLOCK)
SELECT t.RN , 1 + t.RN % 1000
FROM #t t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.CCIFact (DimForeignKey)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.CCIFact (FilterCol)
	WITH FULLSCAN;

All tables have three million rows and the join between them results in 3000 rows. RSJoinIndex is a rowstore table with 3 million rows that contains the join column in a nonclustered index. RSNoJoinIndex is a rowstore table without the join column in a nonclustered index. CCIFact is a columnstore table with three million rows.

Optimized Bitmaps Gone Right

Consider the following query:

SELECT table0.SelectKey
FROM dbo.RSNoJoinIndex AS table0
INNER JOIN dbo.CCIFact AS table1
ON table0.DimForeignKey = table1.DimForeignKey
WHERE table0.FilterIndexed = 1
  AND table1.FilterCol = 1 -- 0.1% of the data
OPTION (MAXDOP 1);

The RSNoJoinIndex table has an index with a key column of FilterIndexed and an included column of DimForeignKey. Here’s the plan that I get:

a23_final_good_bitmap

The CCI is used as the build for the hash join. The highlighted filter is an optimized bitmap that’s applied after the index seek but before the key lookup to get the SelectKey column. The bitmap reduces the number of required key lookups from 3000000 to just 3000. The query executes and requires 21874 logical reads from the rowstore table and 437 ms of CPU time on my machine.

Overall this is a reasonable plan and an effective application of a bitmap filter.

Optimized Bitmaps Gone Wrong

Let’s query the table without the join column in the index:

SELECT table0.SelectKey
FROM dbo.RSNoJoinIndex AS table0
INNER JOIN dbo.CCIFact AS table1
ON table0.DimForeignKey = table1.DimForeignKey
WHERE table0.FilterIndexed = 1
  AND table1.FilterCol = 1 -- 0.1% of the data
OPTION (MAXDOP 1);

Here’s the plan:

a23_final_bad_bitmap

The position of the bitmap has changed so that it’s evaluated after the key lookup. That makes sense because the key lookup returns the column to be filtered against. However, the bitmap filter still reduces the estimated number of key lookups from 3000000 to 3000. This is impossible. The filter can only be applied after the key lookup, so it does not make sense for the bitmap to reduce the number of estimated executions of the key lookup.

Performance is significantly worse with the query now requiring 12199107 logical reads from the rowstore table and 13406 CPU time overall. We can see that the query did three million key lookups:

a23_3M_Keylookups

I could only get this type of plan to appear with a columnstore index somewhere present in the query. It can be triggered even with two rowstore tables as long as the query is batch mode eligible. I’ve filed a connect item for this issue. Please vote if you have a moment.

Final Thoughts

Batch mode and columnstore have brought some interesting (and as far as I can tell, undocumented) improvements to bitmap filters. Thanks for reading!

A Columnstore Compression Magic Trick

Columnstore compression is complicated, and in some cases, surprising.

The Setup

The source data for the CCI has enough rows to fit six perfect rowgroups. The ID column is just sequential integers from 1 to 6291456. The ID2 column is the ID column modulo 20001. Code to load the data into a temp table:

 

<span 				data-mce-type="bookmark" 				id="mce_SELREST_end" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>
DROP TABLE IF EXISTS #STG_DATA;
CREATE TABLE #STG_DATA (
	ID BIGINT NOT NULL,
	ID2 BIGINT NOT NULL,
	PRIMARY KEY (ID)
);

INSERT INTO #STG_DATA WITH (TABLOCK)
SELECT t.RN, t.RN % 20001
FROM
(
	SELECT TOP (6 * 1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

Here’s the table definition for the target CCI:

DROP TABLE IF EXISTS dbo.TARGET_CCI;
CREATE TABLE dbo.TARGET_CCI (
	ID2 BIGINT NOT NULL,
	ID BIGINT NOT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

The reversal of column order is important for the demo below.

Changing MAXDOP

First let’s load the ID2 column the temp table into the CCI. The order of data can matter for compression so I have a superfluous TOP expression to force SQL Server to read the data in clustered key order.

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 1);

The insert query takes 2765 ms of CPU time and 2771 ms of elapsed time on my machine. According to sys.dm_db_column_store_row_group_physical_stats each rowgroup has a size of 2098320 bytes:

a20_maxdop_1_rg_dmv

Now let’s move on to a parallel insert query with MAXDOP 2. The purpose of the second column in the CCI is to make the insert go parallel on my machine. It’s possible that you’ll need to use trace flag 8649 or some other trick to get a parallel insert. Here’s the code that I ran:

TRUNCATE TABLE dbo.TARGET_CCI;

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 2);

The insert query now takes 3594 ms of CPU time and 2112 ms of elapsed time on my machine. The size of each rowgroup did not change. It’s still 2098320 bytes. Even though this is a parallel query there’s no element of randomness in this case. In the query plan we can see that the source table was scanned in a serial zone and round robin distribution is to used to distribute exactly half of the rows to each parallel thread.

a20_parallel_insert

This seems like a reasonable plan given that TOP forces a serial zone and we need to preserve order. It’s possible to rewrite the query to encourage a parallel scan of the source table, but that would introduce an order-preserving gather streams operator.

I’m not satisfied with the runtime yet, so I’m going to bump up MAXDOP to 3:

TRUNCATE TABLE dbo.TARGET_CCI;

INSERT INTO dbo.TARGET_CCI WITH (TABLOCK)
SELECT TOP (9876543210) ID2, 0
FROM #STG_DATA
ORDER BY ID
OPTION (MAXDOP 3);

The insert query now takes 114172 ms of CPU time and 39208 ms of elapsed time to execute. However, each rowgroup now is just 54496 bytes.

a20_maxdop_3_rg_dmv

The INSERT took significantly longer than before, but we have 38X better compression compared to the table after the MAXDOP 2 query. What happened?

Revealing the Magic Trick

An interesting pattern for compressed data sizes appears when working with repeated integers for a single rowgroup. The query that I tested with was roughly of the following format:

INSERT INTO dbo.CCI
SELECT t.RN % @MOD_NUM
FROM
(
	SELECT TOP (@ROWS_INSERTED)
		ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

Below is a contour plot that shows how the compressed size for a single rowgroup varies as the number of rows and the modulus value changes:

a20_contour_size

Values that are repeated 64 or more times seem to be compressed much better than other values. This pattern definitely doesn’t always hold as you add more columns to the table which is why I made the ID2 column the first column in the target CCI. Why is this pattern relevant to the previous example?

Consider the MAXDOP 1 insert query. With a full rowgroup of 1048576 rows a value will be repeated at most 1048576/20001 = 53 times in each rowgroup. It doesn’t cross the threshold of 64 so we end up with a compressed size of 2098320 bytes.

Now consider the MAXDOP 2 insert query. The ordered data from the scan is distributed using round robin distribution on two threads. For the first 20001 rows from the scan, thread 0 gets all even values and thread 1 gets all odd values. For the next 20001 rows, thread 0 gets all odd values and thread 1 gets all even values. This occurs because 20001 isn’t divisible by 2. For all six compressed rowgroups we end up with the same data distribution as we had when doing MAXDOP 1 inserts. It makes sense that the compressed size remained at 2098320 bytes.

Now consider the MAXDOP 3 insert query. The query still uses round robin distribution but there are now three threads. 20001 is divisible by 3 so thread 0 only ends up with 6667 unique values from 0, 3, … to 19999. Thread 1 also ends up with 6667 unique values from 1, 4, … to 20000. Thread 2 follows a similar pattern. Each compressed rowgroup only has 6667 unique values instead of 20001. Each value shows up at least 157 times in the rowgroup, so all of the data qualifies for much better compression.

Final Thoughts

This has absolutely no practical value. Thanks for reading!

ROWGROUP_FLUSH Deadlocks

We recently observed many ROWGROUP_FLUSH deadlocks while doing concurrent inserts into CCIs. I’m not really a concurrency kind of guy but I figured that I should blog about this just so other people with the same problem can find some information about it.

Deadlock Reproduction

The schedulers of the involved sessions are important in some way, especially when going for a simple reproduction. It’s easiest to just make all new sessions go the same CPU:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU = 0;

Obviously you should never do that in production. After affinity has been addressed I recommend creating a nearly empty source table and a new CCI table:

DROP TABLE IF EXISTS dbo.CCI_DEADLOCKED;
CREATE TABLE dbo.CCI_DEADLOCKED (
	COL VARCHAR(1500),
	INDEX CCI CLUSTERED COLUMNSTORE
);

CREATE TABLE ##SOURCE_IDS (ID BIGINT NOT NULL);

INSERT INTO ##SOURCE_IDS WITH (TABLOCK)
SELECT TOP (1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

One way to see the deadlock is to quickly kick off two inserts into the CCI_DEADLOCKED table from different sessions. Inserting a larger amount of data means that you’ll have more time to kick off the second session before the first completes, but a longer rollback time on the first session. On my machine inserting 1048576 rows of VARCHAR(1500) data seems like a reasonable compromise:

INSERT INTO dbo.CCI_DEADLOCKED
SELECT REPLICATE('Z', 1500)
FROM ##SOURCE_IDS
OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);

The second session waits on the first with a LCK_M_IX wait event. The first session loads all of its rows into the delta store, then deadlocks and rolls them all back. You can see this happen in near real time by looking at  sys.dm_db_column_store_row_group_physical_stats:

a19_disappearing_delta_store_rows

Here’s the deadlock XML for those who are interested in that kind of thing:

<?xml version="1.0" encoding="UTF-8"?>
<deadlock>
   <victim-list>
      <victimProcess id="process250c6d59c28" />
   </victim-list>
   <process-list>
      <process id="process250c6d59c28" taskpriority="0" logused="0" waitresource="HOBT: 5:72057594084917248 [ROWGROUP_FLUSH]" waittime="3635" ownerId="8986700" transactionname="CloseDeltastore" lasttranstarted="2017-11-12T16:51:53.260" XDES="0x250b572c490" lockMode="X" schedulerid="1" kpid="6288" status="suspended" spid="56" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2017-11-12T16:51:42.827" lastbatchcompleted="2017-11-12T16:51:42.827" lastattention="1900-01-01T00:00:00.827" clientapp="Microsoft SQL Server Management Studio - Query" hostname="???????" hostpid="4404" loginname="???????\?" isolationlevel="read committed (2)" xactid="8775106" currentdb="5" lockTimeout="4294967295" clientoption1="671090784" clientoption2="390200">
         <executionStack>
            <frame procname="adhoc" line="1" stmtend="238" sqlhandle="0x020000004d878e20b5b8d311601f91ebfeb1174f657907d00000000000000000000000000000000000000000">unknown</frame>
         </executionStack>
         <inputbuf>INSERT INTO dbo.CCI_DEADLOCKED  SELECT REPLICATE('Z', 1500)  FROM ##SOURCE_IDS  OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);</inputbuf>
      </process>
      <process id="process250bacf04e8" taskpriority="0" logused="168" waitresource="HOBT: 5:72057594085179392 " waittime="13628" ownerId="8785145" transactionname="INSERT" lasttranstarted="2017-11-12T16:51:43.267" XDES="0x2508f1ac040" lockMode="IX" schedulerid="1" kpid="6856" status="suspended" spid="54" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2017-11-12T16:51:43.267" lastbatchcompleted="2017-11-12T16:51:43.250" lastattention="1900-01-01T00:00:00.250" clientapp="Microsoft SQL Server Management Studio - Query" hostname="???????" hostpid="4404" loginname="???????\?" isolationlevel="read committed (2)" xactid="8785145" currentdb="5" lockTimeout="4294967295" clientoption1="671090784" clientoption2="390200">
         <executionStack>
            <frame procname="adhoc" line="1" stmtend="238" sqlhandle="0x020000004d878e20b5b8d311601f91ebfeb1174f657907d00000000000000000000000000000000000000000">unknown</frame>
         </executionStack>
         <inputbuf>INSERT INTO dbo.CCI_DEADLOCKED  SELECT REPLICATE('Z', 1500)  FROM ##SOURCE_IDS  OPTION (MAXDOP 1, MAX_GRANT_PERCENT = 0);</inputbuf>
      </process>
   </process-list>
   <resource-list>
      <hobtlock hobtid="72057594084917248" subresource="ROWGROUP_FLUSH" dbid="5" objectname="D1.dbo.CCI_DEADLOCKED" indexname="CCI" id="lock250b52ab400" mode="S" associatedObjectId="72057594084917248">
         <owner-list>
            <owner id="process250bacf04e8" mode="S" />
         </owner-list>
         <waiter-list>
            <waiter id="process250c6d59c28" mode="X" requestType="wait" />
         </waiter-list>
      </hobtlock>
      <hobtlock hobtid="72057594085179392" subresource="FULL" dbid="5" objectname="D1.dbo.CCI_DEADLOCKED" indexname="CCI" id="lock250b5b8a280" mode="X" associatedObjectId="72057594085179392">
         <owner-list>
            <owner id="process250c6d59c28" mode="X" />
         </owner-list>
         <waiter-list>
            <waiter id="process250bacf04e8" mode="IX" requestType="convert" />
         </waiter-list>
      </hobtlock>
   </resource-list>
</deadlock>

SSMS can’t produce a deadlock graph for this type of deadlock. Below is the non-copy-and-pastable error message from it:

Failed to initialize deadlock control.
There is an error in XML document (1, 2497).
Instance validation error: ‘ROWGROUP_FLUSH’ is not a valid value for hobtlockSubresource.

Plan Explorer from SentryOne can help us:

a19_deadlock_graph

If you’re following along at home don’t forget to reset your affinity to whatever you had it before. The most common option:

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU = AUTO;

The Workarounds

We’ve only observed this deadlock with multiple concurrent sessions insert to the delta store for the same target CCI due to server memory pressure or very low cardinality estimates (less than 251 rows). The correct mitigation depends on why you’re seeing the issue in the first place. If you’re seeing it due to low cardinality estimates then fix your estimates, or at the very least get them above 250 rows. If you’re seeing them because the memory grant for the CCI build times out after 25 seconds then lower concurrency or increase server memory.

The problem can also be avoided by not doing concurrent inserts in the first place. In some cases a parallel insert may be a reasonable alterative. There’s also some evidence that the deadlock is only seen when the number of rows for insert is very close to 1048576, but we weren’t able to make any definitive conclusions around that.

Final Thoughts

Don’t despair if you run into a ROWGROUP_FLUSH deadlock! There’s probably something you can do in the application to avoid it. If you feel that you shouldn’t have to take such measures feel free to vote for my connect item here.