sql-server-2017 – ORDER BY (SELECT NULL)

SQL Server 2017 CU14 Change to CCI Insert Memory Time-out Behavior

Posted on April 16, 2019 by Joe Obbish

A KB article was recently published that suggested that memory time-out behavior was changed for CCI inserts:

When you try to create a Columnstore Index on a table in Microsoft SQL Server 2016 or 2017, the request may fail after 25 seconds with memory time-out error 8645, depending on how many memory consuming queries are running concurrently and/or how much memory is available at the time of the request. The memory time-out error may occur even when you configure longer memory request time-out at an instance level or at a resource pool level.

I’ve never seen that error occur for serial inserts. Serial inserts time out after 25 seconds and execute with minimum required memory. They write directly to the delta store instead of attempting compression. So it looked like this change affects parallel inserts but the details weren’t at all clear to me.

Time out for time-outs

Why should we care about memory grant timeouts for CCI insert queries? Simply put, lots of bad things can happen when those queries can time out, both for serial and for parallel inserts. For serial insert queries, I’ve observed deadlocks, extremely poor performance along with long SLEEP_TASK waits, and extremely long rollbacks. For parallel insert queries, I’ve observed queries that run seemingly forever, poor performance of the SELECT part, and error 8645. You probably don’t want any of that occurring in production. It would be very helpful if it was possible to extend the 25 second time-out for queries that insert into columnstore tables.

Time out for time-outs

I’ve found it extremely useful to keep around a 2017 RTM environment with no CUs to figure when issues related to columnstore were introduced into the product. First we’ll take a look at the behavior on 2017 RTM to see error 8645 in action. I want a relatively simple parallel insert query that will run for longer than 25 seconds and will take a large memory grant. I decided on the following query:

INSERT INTO dbo.TARGET_CCI_1 WITH (TABLOCK)
SELECT ca.SLOW, ca.INFLATE_GRANT
FROM (VALUES (0), (1), (2), (3)) v(x)
CROSS APPLY (
	SELECT TOP (1048576) sc.SLOW, sc.INFLATE_GRANT
	FROM dbo.SLOW_TO_COMPRESS sc
	WHERE sc.FOR_SEEKING = v.x
) ca
OPTION (MAXDOP 2);

I get the properties that I’m after with a few undocumented tricks. The inflate grant column is a VARCHAR(8000) column. That data type significantly increases the memory grant for parallel inserts even with all NULLs. For the SLOW column, I’m loading integers evenly distributed between 0 and 7999. That data pattern can take longer than expected to compress. The worst case is with around 16000 distinct evenly distributed integers. If you’d like to understand why check out this answer by Paul White. Finally, the CROSS APPLY pattern means that I’ll get demand-based parallelism with each nested loop execution reading exactly enough rows from SLOW_TO_COMPRESS to fill up one rowgroup. It may be helpful to look at the query plan:

a38_query_plan

The query takes about 40 seconds to execute on my machine. If you’d like to follow along at home, set max server memory to 8000 MB and run the following code:

DROP TABLE IF EXISTS dbo.SLOW_TO_COMPRESS;

CREATE TABLE dbo.SLOW_TO_COMPRESS (
	FOR_SEEKING INT NULL,
	SLOW BIGINT NULL,
	INFLATE_GRANT VARCHAR(8000) NULL
);

CREATE CLUSTERED INDEX CI ON dbo.SLOW_TO_COMPRESS
(FOR_SEEKING);

INSERT INTO dbo.SLOW_TO_COMPRESS WITH (TABLOCK)
SELECT q.RN / 1048576, RN % 10000, NULL
FROM
(
	SELECT TOP (4 * 1048576) -1 + ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

GO

DECLARE @table_id INT = 1,
@sql NVARCHAR(4000);

WHILE @table_id <= 6
BEGIN
	SET @sql = N'DROP TABLE IF EXISTS dbo.TARGET_CCI_'
	 + CAST(@table_id AS NVARCHAR(2))
	 + N'; CREATE TABLE dbo.TARGET_CCI_'
	 + CAST(@table_id AS NVARCHAR(2))
	 + N'(
		SLOW BIGINT NULL,
		INFLATE_GRANT VARCHAR(8000) NULL,
		INDEX CCI1 CLUSTERED COLUMNSTORE
	)';

	EXEC sp_executesql @sql;

	SET @table_id = @table_id + 1;
END;

GO

CREATE OR ALTER PROCEDURE dbo.INSERT_INTO_TARGET_CCI
(@table_id INT)
AS
BEGIN

DECLARE @sql NVARCHAR(4000) = N'INSERT INTO dbo.TARGET_CCI_'
+ CAST(@table_id AS NVARCHAR(2))
+ N' WITH (TABLOCK)
SELECT ca.SLOW, ca.INFLATE_GRANT
FROM (VALUES (0), (1), (2), (3)) v(x)
CROSS APPLY (
	SELECT TOP (1048576) sc.SLOW, sc.INFLATE_GRANT
	FROM SLOW_TO_COMPRESS sc
	WHERE sc.FOR_SEEKING = v.x
) ca
OPTION (MAXDOP 2)';

EXEC sp_executesql @sql;

END;

Error code 8645

Batch files that call sqlcmd are a convenient way to kick off lots of queries. For example:

START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 1;" > nul
START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 2;" > nul
START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 3;" > nul
START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 4;" > nul
START /B sqlcmd -d TEST -S .\SQL2017RTM -Q "EXEC dbo.INSERT_INTO_TARGET_CCI 5;" > nul

Note that I do not have Resource Governor enabled. If I kick off five queries at once using the batch file I don’t get an error. After 25 seconds two of the five queries are able to execute with the same memory grant as others:

a38_memory_grant_RTM

It does make me uncomfortable to see query memory grants exceed the target memory for the semaphore by so much, but at least it’s not over max server memory:

a38_semaphore_RTM

I ran the same test but kicked off a sixth query in SSMS. After 25 seconds I saw the following error for the sixth query:

Msg 8645, Level 17, State 1, Line 1
A timeout occurred while waiting for memory resources to execute the query in resource pool ‘default’ (2). Rerun the query.

I want my lawyer

On SQL Server 2017 CU14 I ran a variety of tests by changing the memory time-out settings at the Resource Governor query level or at the instance level. I tried different Resource Governor pools and even serial queries. I still saw a timeout of 25 seconds no matter what I did. I contacted the attorney that I keep on retainer to help me interpret SQL Server KB articles. Let’s review the relevant text again:

When you try to create a Columnstore Index on a table in Microsoft SQL Server 2016 or 2017, the request may fail after 25 seconds with memory time-out error 8645, depending on how many memory consuming queries are running concurrently and/or how much memory is available at the time of the request. The memory time-out error may occur even when you configure longer memory request time-out at an instance level or at a resource pool level.

He pointed out that the article doesn’t actually say that the time-out is now configurable. Just that it wasn’t configurable in the past. The symptom may strictly describe error 8645. So perhaps the adjustment was very narrow and has to do with avoiding that error only. Fair enough. I ran the same test that say error 8645 on RTM and the sixth query still hit error 8645.

Two of these things aren’t like the others

Let’s kick off five queries on CU14 and take another look at sys.dm_exec_query_memory_grants:

a38_memory_grant_new

That’s odd. The two queries that hit the 25 second timeout have lower values for max_used_memory_kb than the queries that didn’t time out, even though the memory grants are the same. Looking at sys.dm_db_column_store_row_group_physical_stats for one of the tables with the lower memory grant:

a38_rowgroup_stats

All rows were written to delta stores even though each thread got over 2 million rows. The query still takes its required memory grant but it doesn’t use most of the part reserved for columnstore compression. My best guess is that this is the change described in the KB article. A superficial glance suggests that the internal implementation is the same one used for writing to a delta store in serial:

a38_call_stacks

I think that I can understand the argument for making this change. However, I see intermittent intra-query parallel deadlocks when queries time out in this way:

Msg 1205, Level 13, State 78, Line 1
Transaction (Process ID 61) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

I’ve also seen this error as well when going too far above target memory:

Msg 701, Level 17, State 123, Line 1
There is insufficient system memory in resource pool ‘default’ to run this query.

Personally, I’d like to see clearly defined behavior that focus on stability. When writing code that does parallel inserts into columnstore indexes it may be desirable to catch errors 8645, 1205, and 701 and to retry the insert after rolling back, perhaps at MAXDOP 1.

Final thoughts

Kb articles for SQL Server fixes can sometimes be misleading because they may focus on how the problem was described in a support ticket even if the scope of the fix or the solution have little to nothing to do with said description. After reading KB article 4480641 you could easily think that error code 8645 no longer occurs with parallel columnstore inserts or that it’s possible to override the 25 second memory timeout for columnstore inserts. Neither one of those is true. Parallel inserts into columnstore tables can still exceed target memory, but they write to multiple delta stores if they hit their 25 second timeout. Thanks for reading!

Extra Documentation for KB Article 4462481

Posted on March 14, 2019 by Joe Obbish

SQL Server KB 4462481 is a bit light on details:

Assume that you use Microsoft SQL Server data warehousing. When you run parallel insert queries to insert data into a clustered columnstore index, the query operation freezes.

To quote one of the Top Men of SQL Server product support:

Hope this helps!

How to freeze your query

Consider a query execution that meets all of the following criteria:

A parallel INSERT INTO… SELECT into a columnstore table is performed
The SELECT part of the query contains a batch mode hash join
The query can’t immediate get a memory grant, hits the 25 second memory grant timeout and executes with required memory

The query may appear to get stuck. It no longer uses CPU and the parallel worker threads appear to wait on each other. Let’s go through an example on SQL Server 2017 RTM. First create all of the following tables:

CREATE PARTITION FUNCTION PART_FUN_REP_SGM
(BIGINT)
AS RANGE LEFT
FOR VALUES (
  1, 2, 3
); 

CREATE PARTITION SCHEME PART_SCHEME_REP_SGM
AS PARTITION PART_FUN_REP_SGM
ALL TO ( [PRIMARY] );

DROP TABLE IF EXISTS dbo.SOURCE_DATA_FOR_CCI;

CREATE TABLE dbo.SOURCE_DATA_FOR_CCI (
	PART_KEY BIGINT NOT NULL,
	ID BIGINT NOT NULL,
	FAKE_COLUMN VARCHAR(4000) NULL
) ON PART_SCHEME_REP_SGM (PART_KEY);

INSERT INTO dbo.SOURCE_DATA_FOR_CCI WITH (TABLOCK)
SELECT TOP (1048576) 1, ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) % 16000, NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

INSERT INTO dbo.SOURCE_DATA_FOR_CCI WITH (TABLOCK)
SELECT TOP (1048576) 2, ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) % 16000, NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS dbo.LARGE_BATCH_MODE_MEMORY_REQUEST;

CREATE TABLE dbo.LARGE_BATCH_MODE_MEMORY_REQUEST (
	ID VARCHAR(4000),
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.LARGE_BATCH_MODE_MEMORY_REQUEST
	WITH (TABLOCK)
SELECT TOP (2 * 1048576) CAST(ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) AS VARCHAR(8000))
	+ 'DYEL'
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

DROP TABLE IF EXISTS dbo.CCI_SLOW_TO_COMPRESS_TARGET_1;

CREATE TABLE dbo.CCI_SLOW_TO_COMPRESS_TARGET_1 (
	ID BIGINT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

DROP TABLE IF EXISTS dbo.CCI_SLOW_TO_COMPRESS_TARGET_2;

CREATE TABLE dbo.CCI_SLOW_TO_COMPRESS_TARGET_2 (
	ID BIGINT NULL,
	INDEX CCI CLUSTERED COLUMNSTORE
);

Consider the following query:

INSERT INTO dbo.CCI_SLOW_TO_COMPRESS_TARGET_1
	WITH (TABLOCK)
SELECT LEFT(t1.ID, 1)
FROM LARGE_BATCH_MODE_MEMORY_REQUEST t1
INNER JOIN LARGE_BATCH_MODE_MEMORY_REQUEST t2
	ON t1.ID = t2.ID

UNION ALL

SELECT ID
FROM dbo.SOURCE_DATA_FOR_CCI
OPTION (MAXDOP 2);

Here’s what the plan looks like:

a37_plan

I set Max Server Memory to 8000 MB and executed two queries with a maximum allowed memory grant of 60% via Resource Governor. The first query finished in about 40 seconds. The second query hasn’t finished after 30 minutes. During those 30 minutes the second query has only used 1184 ms of CPU time. The COLUMNSTORE_BUILD_THROTTLE wait type shows up in sys.dm_os_waiting_tasks:

a37_wait

Execution context id 2 is waiting on execution context id 1 with a wait type of HTDELETE. Execution context id 1 has a wait type of COLUMNSTORE_BUILD_THROTTLE. I don’t think that this wait is supposed to show up for parallel inserts. It can show up by design when creating or rebuilding a columnstore index in parallel:

When a columnstore index is being built, the memory grant estimate is based on a segment size of one million rows. The first segment of the index is built using a single thread so the real, required per-thread memory grant is found. Then the memory grants are given per thread and the other segments are built multi-threaded. Although all the threads for the operation are allocated at the start of the build, only one thread is used for the first segment and all the others incur a COLUMNSTORE_BUILD_THROTTLE wait.

The important point is that a wait type of COLUMNSTORE_BUILD_THROTTLE means that worker thread is waiting on another thread to do something. But so does a wait time of HTDELETE. There are only two worker threads and both of them are waiting on another thread to do something. As a result, the query makes no progress. I’ve let similar queries run for 72 hours before giving up and killing them.

How to unfreeze your query

Upgrading to SQL Server 2017 CU11 or higher fixes the issue. With that said, if you’re seeing this issue that means that you have columnstore insert queries waiting at least 25 seconds for a memory grant. That should be considered to be a problem even if the queries didn’t get stuck. If you find yourself in this situation, I strongly consider increasing memory available to SQL Server or limiting memory grants via Resource Governor or some other method.

Final Thoughts

If you have parallel insert queries into CCis that appear to be stuck with no CPU usage and long COLUMNSTORE_BUILD_THROTTLE wait times check your version of SQL Server. Upgrading to at least CU11 will resolve the immediate issue, but consider a more complete fix will include addressing the RESOURCE_SEMAPHORE waits that must be present on your system. Thanks for reading!

Resource Governor MAXDOP Setting Can Lead to Poor Plan Choices

Posted on October 11, 2018October 11, 2018 by Joe Obbish

Resource Governor can be used to enforce a hard cap on query MAXDOP, unlike the sp_configure setting. However, query plan compilation does not take such a MAXDOP limit into account. As a result, limiting MAXDOP through Resource Governor can lead to unexpected degradations in performance due to suboptimal query plan choices.

Create Your Tables

We start with the not often seen here three table demo. I’d rather not explain how I came up with this sample data, so I’m not going to. I did my testing on a server with max server memory set to 10000 MB. The following tables take about half a minute to create and populate and only take up about 1.5 GB of space:

DROP TABLE IF EXISTS dbo.SMALL;
CREATE TABLE dbo.SMALL (ID_U NUMERIC(18, 0));

INSERT INTO dbo.SMALL WITH (TABLOCK)
SELECT TOP (100) 5 * ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS dbo.MEDIUM;
CREATE TABLE dbo.MEDIUM (ID_A NUMERIC(18, 0));

INSERT INTO dbo.MEDIUM WITH (TABLOCK)
SELECT TOP (600000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS dbo.LARGE;
CREATE TABLE dbo.LARGE (
	ID_A NUMERIC(18, 0),
	ID_U NUMERIC(18, 0),
	FILLER VARCHAR(100)
);

INSERT INTO dbo.LARGE WITH (TABLOCK)
SELECT 2 * ( RN / 4), RN % 500, REPLICATE('Z', 100)
FROM
(
	SELECT TOP (8000000) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
	CROSS JOIN master..spt_values t3
) q
OPTION (MAXDOP 1)

CREATE INDEX IA ON LARGE (ID_A);
CREATE INDEX IU ON LARGE (ID_U);

The Long-Awaited Demo

I thought up the theory behind this demo on a car ride back from a SQL Saturday, but wasn’t able to immediately figure out a way to get the query plan that I wanted. I ended up finally seeing it in a totally different context and am now happy to share it with you. Consider the following query:

SELECT LARGE.ID_U
FROM dbo.SMALL
INNER JOIN dbo.LARGE ON SMALL.ID_U = LARGE.ID_U
INNER JOIN dbo.MEDIUM ON LARGE.ID_A = MEDIUM.ID_A
OPTION (MAXDOP 1);

The MAXDOP 1 hints results in a serial plan with two hash joins:

a36_MAXDOP1_plan

This is a perfectly reasonable plan given the size and structure of the tables. There are no bitmap filters because row mode bitmap filters are only supported for parallel plans. Batch mode is not considered for this query because I’m testing on SQL Server 2017 and there isn’t a columnstore index on any of the tables referenced in the query. On my machine a single query execution uses 2422 of CPU time and 2431 ms of elapsed time.

A parallel plan at MAXDOP 4 is able to run more quickly but with a much higher CPU time. A single execution of the MAXDOP 4 query uses 5875 ms of CPU time and 1617 ms of elapsed time. There are multiple bitmap filters present. I zoomed in on the most interesting part of the plan because I haven’t figured out how images work with WordPress yet:

a36_MAXDOP4_plan

Instead of doing a scan of the LARGE table, SQL Server instead chooses an index intersection plan. The cost of the additional hash join is reduced by multiple bitmap filters. There are only 2648396 and 891852 rows processed on the build and probe side instead of 8 million for each side, which is a significant gain.

Worse Than A Teen Running for Governor

Some end users really can’t be trusted with the power to run parallel plans. I thought about making a joke about an “erik” end user but I would never subject my readers to the same joke twice. After enforcing a MAXDOP of 1 at the Resource Governor level, you will probably not be shocked to learn that the query with the explicit MAXDOP 1 hint gets the same query plan as before and runs with the same amount of CPU and elapsed time.

If you skipped or forget the opening paragraph, you may be surprised to learn that the query with a MAXDOP 4 hint also gets the same query plan as before. The actual execution plan even has the parallel racing arrows. However, the query cannot execute in parallel. The parallelism and bitmap operators are skipped by the query processor and all of the rows are processed on one thread:

a36_MAXDOP4_with_RG

I uploaded the query plan here if you want to look at it. This type of scenario can happen even without Resource Governor. For example, a compiled parallel query may be downgraded all the way to MAXDOP 1 if it can’t get enough parallel threads.

The query performs significantly worse than before, which hopefully is not a surprise. A single execution took 12860 ms of CPU time and 13078 ms of elapsed time. Nearly all of the query’s time is spent on the hash join for the index intersection, with a tempdb spill and the processing of additional rows both playing a role. The tempdb spill occurs because SQL Server expected the build side of the hash join to be reduced to 1213170 rows. The bitmap filtering does not occur so 8 million rows were sent to the build side instead.

In this case, adding a MAXDOP 1 hint to the query will improve performance by about 5X. Larger differences in run times can be easily seen on servers with more memory than my desktop.

Final Thoughts

If you’re using using Resource Governor to limit MAXDOP to 1, consider adding explicit MAXDOP 1 hints at the query level if you truly need the best possible performance. The MAXDOP 1 hint may at first appear to be redundant, but it gives the query optimizer additional information which can result in totally different, and sometimes significantly more efficient, query plans. I expect that this problem could be avoided if query plan caching worked on a Resource Governor workload group level. Perhaps that is one of those ideas that sounds simple on paper but would be difficult for Microsoft to implement. Batch mode for row store can somewhat mitigate this problem because batch mode bitmap filters operate even under MAXDOP 1, but you can still get classic row mode bitmaps even on SQL Server 2019. Thanks for reading!

What To Do When Wait Stats Don’t Help

Posted on July 20, 2018May 25, 2019 by Joe Obbish

Some SQL Server workloads are slow even though there aren’t any hints in the wait stats that suggest ways to make them go faster. This blog post works through a columnstore example of such a workload.

The Test Setup

The test server has 96 physical cores. The test workload has many concurrent writers to the same columnstore target table. The source data is a simple staging table with ten columns. Each column contains sequential integers from 1 to 10485760. Here is T-SQL to create and populate the tables:

DROP TABLE IF EXISTS dbo.RS_SOURCE_10_COLUMN_SIMPLE;

CREATE TABLE dbo.RS_SOURCE_10_COLUMN_SIMPLE (
    ID1 BIGINT,
    ID2 BIGINT,
    ID3 BIGINT,
    ID4 BIGINT,
    ID5 BIGINT,
    ID6 BIGINT,
    ID7 BIGINT,
    ID8 BIGINT,
    ID9 BIGINT,
    ID10 BIGINT
);

INSERT INTO dbo.RS_SOURCE_10_COLUMN_SIMPLE
	WITH (TABLOCK)
SELECT q.ID, q.ID, q.ID, q.ID, q.ID
, q.ID, q.ID, q.ID, q.ID, q.ID
FROM
(
       SELECT TOP (10485760) ROW_NUMBER()
             OVER (ORDER BY (SELECT NULL)) ID
       FROM master..spt_values t1
       CROSS JOIN master..spt_values t2
       CROSS JOIN master..spt_values t3
) q
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS dbo.CCI_TARGET_10_COLUMN;

CREATE TABLE dbo.CCI_TARGET_10_COLUMN (
    ID1 BIGINT,
    ID2 BIGINT,
    ID3 BIGINT,
    ID4 BIGINT,
    ID5 BIGINT,
    ID6 BIGINT,
    ID7 BIGINT,
    ID8 BIGINT,
    ID9 BIGINT,
    ID10 BIGINT,
    INDEX CCI0 CLUSTERED COLUMNSTORE
);

Concurrency is controlled by changing the number of stored procedures kicked off by a SQLCMD batch file. Each stored procedure does a single MAXDOP 1 insert from the staging table into the columnstore table. This test is different from the previous columnstore insert test documented here in that all of the sessions write to the same table and the source data gets significantly worse compression. In addition, each stored procedure only does a single insert, so the total overall work done increases as the number of concurrent sessions increases. If I run a single insert like the following:

INSERT INTO dbo.CCI_TARGET_10_COLUMN
SELECT *
FROM dbo.RS_SOURCE_10_COLUMN_SIMPLE
OPTION (MAXDOP 1);

That takes about 25 seconds to complete on the test server. It would be ideal if the average run time of the stored procedure stays at around 25 seconds as concurrency increases.

Test Results

We are not in an ideal world. The average run time of each stored procedure execution quickly increases with the number of concurrent sessions:

a34_scale_table

It may be slightly unreasonable to expect 96 concurrent sessions to scale perfectly well and to end up with an average run time of 24 seconds. With that said, the observed overhead is almost 4X at that scale. Let’s start by looking at wait stats:

a34_waits_for_LPIM_96

Most of the entries are related to memory. SOS_PHYS_PAGE_CACHE shows up because LPIM is enabled. This might be a helpful clue, but the waits only add up to about 10 seconds of waits per second. On a server with 96 cores that alone cannot explain a 4X degradation in scalability.

There is nothing helpful in latches:

a34_latches_for_LPIM_96

A number of spinlock collisions occurred during the workload:

a34_spins_for_LPIM_96

This is entirely a judgment call on my part, but based on my experience on this server and with similar workloads I don’t find those spinlock numbers to be at all remarkable. With that said, one avenue of investigation would be to dig into those spinlocks more.

From my point of view, the most important clue is that the server experiences high CPU throughout the workload according to SQL Server (using sys.dm_resource_governor_resource_pools), resource monitor, and perfmon. Perfmon also reports high privileged time:

a34_perfmon

SQL Server appears to be doing something which requires significantly more CPU at higher concurrency. Therefore, it makes perfect sense that the wait stats and latch stats alone aren’t very helpful. It’s reasonable to expect most wait types and latch classes to result in low CPU during their waits as opposed to high CPU.

ETW Tracing

ETW tracing can be used to diagnose CPU issues on servers. I’m still pretty new at it, but I use PerfView. In particular, what I really want to see is sampled call stacks to get an idea as to where CPU is being spent. I took a ten second sample during the 96 concurrent session workload and grouped by module name:

a34_perfview_LPIM_modules

The circled modules are typically what I see in SQL Server call stacks. SQL Server is responsible for the majority of CPU work on the server, but the majority of the work is done in the ntoskrnl and hal modules. I have it on good authority that CPU spent in that context will be tracked as privileged time in perfmon, so this roughly lines up with the perfmon graph previously shown.

A different view of the data reveals that most of the work is done as part of kernalbase!MapuserPhysicalPages:

a34_perfview_LPIM_just_my_app

In addition, we can see methods that call kernalbase!MapuserPhysicalPages:

a34_perfview_LPIM_methods_call_mapPhysicalPages

Presenting the data in a different way, we can see what was called by sqldk!SOS_MemoryWorkSpace::FreePage:

a34_perfview_LPIM_callee_free_Page

It appears that most of the CPU during the workload is spent returning memory from SQL Server to the OS. Backing that up, the memory commit for SQL Server was observed to decrease during the workload through resource monitor, even though there wasn’t any pressure from the OS and the total commit size was only at about 50% of max server memory.

For this workload, we are lucky in that important call stack names can be tied back to something in SQL Server. For example, sqldk!SOS_MemoryWorkSpace::FreePage would appear to be associated with freeing pages and that can be tracked from within SQL Server using extended events.

Extending The Investigation

The page_allocated and page_freed debug extended events are helpful in understanding how memory is allocated and freed during a query’s execution. First I started with a histogram target on memory_clerk_name for the page_freed event page_freed filtered to a single session. Here’s what that looks like:

a34_histogram

Nearly all of the free page events are for the MEMORYCLERK_SQLQERESERVATIONS clerk. I believe this clerk is responsible for granted query memory so this makes sense. Next we can turn on causality tracking and include the page_allocated event as well. Here’s a screenshot of some of the data for a simple CCI insert that processes three rowgroups in total:

a34_XE_picture

There are many different patterns in the data. Sometimes there is a series of free events and sometimes there is a series of allocation events. They can also alternate as shown in the picture above. The number of pages freed or allocated is usually one plus a power of two. I saved the data to a table and calculated a running total of current memory allocations for the query. The running total increased whenever pages were allocated and decreased whenever pages were freed. I’m really phoning it in with this graph, but here are the results:

a34_running_total_graph

The x-axis is time and the y-axis is total pages currently allocated to the query. This query compresses three rowgroups in total. That’s why the pattern repeats three times. Other than that, the only thing I was able to tie back to the compression algorithm was that pages are rapidly freed as segments are finished:

a34_segment_deallocations

In summary, page allocations and page free events rapidly occur, sometimes in an alternating pattern. SQL Server will often free a number of pages just to immediately request allocations for a similar number of pages. If all of the free page events result in returned memory to the OS then the reason for the scalability bottleneck becomes clear. When running the full workaround with 96 concurrent sessions, a total of 341965 page freed operations were performed. Those events freed about 71.3 million pages in total. That amounts to about 584 GB of memory returned to the OS in total, based on the previous assumptions.

It seemed bizarre to me for SQL Server to return so much memory to the OS during columnstore compression even though it would very likely immediately need more memory for compression. A colleague suggested that this could be a deliberate design decision by Microsoft to reduce the likelihood of out-of-memory errors when allocating memory above the max server memory setting:

“Starting with SQL Server 2012 (11.x), SQL Server might allocate more memory than the value specified in the max server memory setting. This behavior may occur when the Total Server Memory (KB) value has already reached the Target Server Memory (KB) setting (as specified by max server memory). If there is insufficient contiguous free memory to meet the demand of multi-page memory requests (more than 8 KB) because of memory fragmentation, SQL Server can perform over-commitment instead of rejecting the memory request.

This behavior is typically observed during the following operations:

Large Columnstore index queries.
Columnstore index (re)builds, which use large volumes of memory to perform Hash and Sort operations.”

What is Normal?

All of the above observations around memory allocations and freed memory are interesting but some additional context is needed. I suspect that it doesn’t, but it’s possible that a traditional row mode query would show the same pattern of allocated and freed pages. To investigate that, I created a row mode query with a memory consuming operator that processes a result set of rows, releases the rows, and continues to do so in a loop. The query plan looks like this:

a34_query_plan

Note the Hash Match (Aggregate) operator on the inner side of the nested loop. For each row in the DRIVE_ME table, SQL Server will send nearly all of the rows from DISTINCT_ONE_COL into the hash match operator. The reason that I went for this type of query is that it felt like the most similar pattern to columnstore compression in that compression processes up to 1048576 rows per thread in what’s effectively a loop. For those following along at home here is T-SQL to generate that query plan:

drop table if exists DRIVE_ME;

create table DRIVE_ME (ID INT);

INSERT INTO DRIVE_ME VALUES
(1), (2), (3), (4), (5), (6), (7),
(8), (9), (10), (11), (12), (13), (14);

drop table if exists DISTINCT_ONE_COL;

create table DISTINCT_ONE_COL (ID INT);

INSERT INTO DISTINCT_ONE_COL WITH (TABLOCK)
SELECT TOP (10000000) ROW_NUMBER()
      OVER (ORDER BY (SELECT NULL)) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1)

GO

CREATE OR ALTER PROCEDURE
dbo.REPEATED_ROW_MODE_ALLOC AS
BEGIN
SET NOCOUNT ON;
SELECT ca.cnt
FROM DRIVE_ME d
OUTER APPLY
(
	SELECT COUNT(DISTINCT ID)
	FROM DISTINCT_ONE_COL
	WHERE ID > d.ID
) ca (cnt)
OPTION (MIN_GRANT_PERCENT = 2.5)
END;

On the test server executing a single query took around 90 seconds. That single query execution frees around 760000 memory pages in total. Memory is allocated and freed in cycles. There isn’t a pattern where allocation and free events trade off between each other.

The workload does very well at scale. With 96 concurrent sessions the average stored procedure execution remains at around 90 seconds. We’re freeing the same amount of pages in total as the columnstore insert workload without seeing a drop in scalability. We can again use PerfView to analyze the workload by modules:

a34_perfview_rowstore_modules

This time most of the CPU is spent on SQL Server managed code. Drilling down, the operation that uses the most CPU appears to involving inserting rows into a hash table that’s in a query’s memory grant:

a34_perfview_rowstore_biggest_cpu

There isn’t necessarily anything wrong with this. After all, something needs to use the most CPU during the workload and we do achieve very good scalability. To me, the most important part is how call stacks are different when looking at everything called by sqldk!SOS_MemoryWorkSpace::FreePage:

a34_perfview_rowstore_callee_free_Page

Within SQLOS we call “FreeBlock” instead of “DecommitBlock”. The windows OS calls are significantly different compared to columnstore compression and I wasn’t able to observe the memory commit for SQL Server changing during the workload. I can’t say for sure, but it looks like when memory is freed for memory-consuming row mode operators it is returned to SQL Server as opposed to returned to the OS. That makes a big difference in scalability.

I wanted to test with batch mode memory consuming operators, but couldn’t figure out a query pattern that resulted in the number of processed rows getting constantly reset for an operator that didn’t involve UDFs. I will say that I observed memory getting returned to the OS during batch mode queries, but I did not immediately see the performance bottleneck around freeing the pages. It’s possible that with more work a demo could be created to show this bottleneck. It’s also possible that the different pattern of freed pages means that it won’t realistically happen. In my limited testing I did not see batch mode operators free pages before they had done all of their work, which is significantly different from the columnstore compression pattern of allocating and freeing pages.

Changing the Memory Model

We know that the problem has to do with memory deallocation, so testing with the conventional and large pages memory models may be fruitful. The workload significantly degraded without lock pages in memory. The average insert time was about 368 seconds with 96 concurrent sessions. 91% of the workload time can be tied back to memory related waits:

a34_convential_waits

At first glance, this workload has a problem with memory allocation as opposed to freeing memory pages. It would be possible to use techniques within SQL Server to further drill into the scalability bottleneck, but I will not do that analysis here.

Enabling large pages for the buffer pool via TF 834 does not resolve this bottleneck. The average insert completion time is nearly identical to the LPIM test and I did not observe any significant differences in PerfView results. TF 834 is helpful with some columnstore workloads but it does not help with this one. Keep in mind that TF 834 is not supported with columnstore according to Microsoft.

Waiting for a Wait Type

I’m not a cloud guy, but I wonder if it would even be possible to troubleshoot an issue like this if the investigator doesn’t have access to the underlying OS or the ability to install ETW tracing tools. It would be helpful if Microsoft was able to capture and present this bottleneck in SQL Server. For example, a new wait type could be defined named something like “RESERVED_MEMORY_FREE_EXT” with the timer starting after sqldk!SOS_MemoryWorkSpace::DeCommitBlock and ending after the OS owned code finishes its work. A defined wait type would make it much more obvious that there’s a problem and allow affected end users to easily search for guidance or workarounds for the issue. This would be a cpu busy external wait. I don’t know if defining a wait like this is in line with Microsoft’s general philosophy for designing wait types, but it’s just an idea.

If the bottleneck in the OS truly can’t be solved then perhaps it could be protected by a spinlock within SQL Server. That spinlock might not accomplish anything other than letting someone who only has access to SQL Server know that there’s a problem. Again I don’t know if that’s an appropriate use of a spinlock. I’m out of my depth here when it comes to suggestions like these, but I do think there is a need to classify this bottleneck. Of course, Microsoft making code changes in SQL Server or even in Windows Server to resolve the bottleneck would be preferred.

Final Thoughts

This blog post covered the investigation of a slow workload for which wait stats were not helpful in diagnosing the problem. In summary:

ETW tracing and extended events revealed a bottleneck related to returning memory to the OS from SQL Server. This bottleneck can occur when loading data into columnstore tables at reasonably high concurrency.
Lock pages in memory can have a dramatic effect on workload performance even when paging to disk is not occurring. This must be a gross oversimplification, but with LPIM SQL Server uses different OS calls to manage memory and the difference in those calls can lead to scalability issues with and without LPIM. For example, highly concurrent workloads that call HASHBYTES have been observed to get better throughput with LPIM.
For columnstore workloads on SQL Server 2017, I recommend the use of LPIM unless there is a very good reason not to use it. I have seen LPIM cause performance issues on SQL Server 2016 but have not fully sourced those issues. LPIM still may be the right starting point on SQL Server 2016.
The investigation did not have a happy ending in that it isn’t clear how to resolve the bottleneck except by lowering concurrency. However, the work done here does provide a helpful foundation for reaching out to a VM or Windows admin or for opening a support ticket with Microsoft.

Thanks for reading!

Can Rowstore Compression Beat Columnstore Compression?

Posted on June 28, 2018 by Joe Obbish

Columnstore has quite a few different tricks for compressing data. This blog post explores if it’s possible for a rowstore table to beat columnstore compression, even in the best case scenario for the columnstore table (no delta stores and rowgropus of the maximum size).

Page Compression and String Data

Everybody knows that strings aren’t the best fit for columnstore tables. Let’s start by puting sequential integers from 1 to 1048576 into a page compressed table:

DROP TABLE IF EXISTS dbo.RS_VARCHAR33;
GO
CREATE TABLE dbo.RS_VARCHAR33 (
	ID1 VARCHAR(33)
) WITH (DATA_COMPRESSION = PAGE);

INSERT INTO dbo.RS_VARCHAR33 WITH (TABLOCK)
SELECT TOP (1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

According to sp_spaceused, the table takes up 11464 KB of space. Next load the same data into a columnstore table:

DROP TABLE IF EXISTS dbo.CCI_VARCHAR33;
GO
CREATE TABLE dbo.CCI_VARCHAR33 (
	ID1 VARCHAR(33),
	INDEX C CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_VARCHAR33 WITH (TABLOCK)
SELECT TOP (1048576) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) ID
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

The CCI takes up 14408 KB of space. The majority of the space is used for the dictionary. The follow query returns 11495348 bytes:

SELECT csd.on_disk_size
FROM sys.column_store_dictionaries csd
INNER JOIN sys.partitions sp
	ON csd.partition_id = sp.partition_id
WHERE sp.[object_id] = OBJECT_ID('CCI_VARCHAR33');

Data types matter here. If I change the CCI to have a VARCHAR(32) column instead then the table only requires 1800 KB of space. A dictionary is not created in this case. A takeaway is that SQL Server may choose to create a dictionary based on the length of the defined string column. Data sets with many unique strings may require a relatively large amount of space for the dictionary, even to the point where page compressed data can have overall a lower footprint on the database.

Page Compression and Non-string Data

It’s certainly more difficult to come up with a demo that works without string columns, but consider how the page compression algorithm works. Data can be compressed on page basis, which includes both multiple rows and multiple columns. That means that page compression can achieve a higher compression ratio when a row has identical values in different columns. Columnstore is only able to compress on an individual column basis and you won’t directly see better compression with repeated values in different columns for a single row (as far as I know).

The table defined and populated below requires 11912 KB of disk space:

DROP TABLE IF EXISTS dbo.RS_4_COLUMN;
GO
CREATE TABLE dbo.RS_4_COLUMN (
	ID1 BIGINT,
	ID2 BIGINT,
	ID3 BIGINT,
	ID4 BIGINT
) WITH (DATA_COMPRESSION = PAGE);

INSERT INTO dbo.RS_4_COLUMN WITH (TABLOCK)
SELECT ID, ID, ID, ID
FROM (
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) / 10 ID
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

If I rerun the code with five columns, the table now takes up 13000 KB of space. That’s only an increase of 9% in space to hold 25% more data. The same data loaded into a columnstore table with four columns takes 11272 KB of space. Adding one column results in a total size of 14088 KB, which is almost exactly a 25% increase. For this data set with five columns, page compression across rows is more efficient than standard columnstore compression.

Row Compression and Non-string Data

Row compression doesn’t allow for compression benefits from storing the same value in multiple columns. Is it possible to beat columnstore compression with row compression without string columns? You betcha!

DROP TABLE IF EXISTS #BATCH_MODE;
CREATE TABLE #BATCH_MODE (
	ID INT,
	INDEX C CLUSTERED COLUMNSTORE
);

DROP TABLE IF EXISTS dbo.RS_2_COLUMN_ROW_COMPRESSED;
GO
CREATE TABLE dbo.RS_2_COLUMN_ROW_COMPRESSED (
	ID1 BIGINT,
	ID2 BIGINT
) WITH (DATA_COMPRESSION = ROW);

INSERT INTO dbo.RS_2_COLUMN_ROW_COMPRESSED WITH (TABLOCK)
SELECT
  SUM(q.ID) OVER (ORDER BY q.ID)
, SUM(q.ID) OVER (ORDER BY q.ID)
FROM
(
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) ID
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) q
LEFT OUTER JOIN #BATCH_MODE ON 1 = 0
OPTION (MAXDOP 1);

EXEC sp_spaceused 'RS_2_COLUMN_ROW_COMPRESSED';

DROP TABLE IF EXISTS #BATCH_MODE;
CREATE TABLE #BATCH_MODE (
	ID INT,
	INDEX C CLUSTERED COLUMNSTORE
);

DROP TABLE IF EXISTS dbo.CCI_2_COLUMN;
GO
CREATE TABLE dbo.CCI_2_COLUMN (
	ID1 BIGINT,
	ID2 BIGINT,
	INDEX C CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_2_COLUMN WITH (TABLOCK)
SELECT
  SUM(q.ID) OVER (ORDER BY q.ID)
, SUM(q.ID) OVER (ORDER BY q.ID)
FROM
(
	SELECT TOP (1048576) ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) ID
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) q
LEFT OUTER JOIN #BATCH_MODE ON 1 = 0
OPTION (MAXDOP 1);

EXEC sp_spaceused 'CCI_2_COLUMN';

In this example, the rowstore table has a total size of 15496 KB but the columnstore has a total size of 16840 KB. I’m honestly not sure why this happens. I did try to pick the unfriendliest data set to columnstore compression that I could: no repeated values and no obvious patterns in data. Perhaps there’s some additional overhead of compression that pushes it over the row compressed data.

Final Thoughts

Columnstore doesn’t guarantee better compression than rowstore, even with perfectly sized rowgroups. Rowstore can provide better compression ratios for string columns, tables with repeated values across columns, and in other uncommon scenarios. A summary of test results is here:

a33_summary

The differences are small per rowgroup, but they can add up as more columns and more rows are added to the tables. I’ve seen tables in the real world that ended up bigger as columnstore compared to rowstore, which is what inspired me to look for some of these examples. Thanks for reading!

The NESTING_TRANSACTION_FULL Latch

Posted on June 15, 2018June 21, 2018 by Joe Obbish

This blog posts investigates the NESTING_TRANSACTION_FULL latch. This latch class can be a bottleneck in extreme ETL workloads. In case you need a quick definition of a latch:

A latch is a lightweight synchronization object that is used by various SQL Server components. A latch is primarily used to synchronize database pages. Each latch is associated with a single allocation unit. A latch wait occurs when a latch request cannot be granted immediately, because the latch is held by another thread in a conflicting mode. Unlike locks, a latch is released immediately after the operation, even in write operations. Latches are grouped into classes based on components and usage. Zero or more latches of a particular class can exist at any point in time in an instance of SQL Server.

Why is the Latch Needed?

Paul Randal has a good explanation here. My experience with it is isolated to parallel SELECT INTO (introduced in SQL Server 2014) and parallel insert into heaps and columnstore (introduced in SQL Server 2016). Each worker thread of the parallel insert has a subtransaction, but only the main transaction can modify the transaction log. Whenever a worker thread needs to modify the transaction log it needs to take an exclusive latch on a subresource under the NESTING_TRANSACTION_FULL latch class. Only one worker thread can hold the latch for the transaction at a time, so this can lead to contention. This is layperson’s explanation based on observed behavior, so please forgive any inaccuracies.

The Test Server

For testing performance and scalability of parallel inserts I prefer to use hardware with a large number of physical cores per socket. I have access to a virtualized test server that’s 4 X 24. I wanted to make tests be as fair as possible, so I decided to only use a single memory node of the server. It seemed logical to pick the node with the least amount of system processes. Here’s a query to view many of them:

SELECT scheduler_id, command
FROM sys.dm_exec_requests r
INNER JOIN sys.dm_exec_sessions s
	ON r.session_id = s.session_id
WHERE s.is_user_process = 0
AND scheduler_id IS NOT NULL
ORDER BY scheduler_id;

Here is a partial result set:

a32_system_processes

I omitted a few processes such as the LOG WRITERS which are on memory node 0. Memory node 1, which covers schedulers 24 to 47, seems like the best choice. All memory nodes have a LAZY WRITER so that can’t be avoided. I’m not doing any full text work that I’m aware of so that just leaves SYSTEM_HEALTH_MONITOR. We have the best, most healthy systems, so I’m sure that process doesn’t do a lot of work.

Resource governor is always enabled on this server, so I can send new sessions to schedulers on memory node 1 with the following commands:

ALTER RESOURCE POOL [default] WITH
	(AFFINITY SCHEDULER = (24 to 47));

ALTER RESOURCE GOVERNOR RECONFIGURE;

The Table

For testing I wanted to insert a moderate amount of data while varying MAXDOP. I needed to read enough data for parallel table scans to be effective up to MAXDOP 24 and to write enough data so that parallel insert could be effective. At the same time, I didn’t want to write too much data because that could make running dozens of tests impractical.

I settled on a two column table with an odd partitioning scheme. All of the data is loaded into a single partition so we can run parallel inserts which result in all rows getting sent to a single thread as needed. Most tests spread rows over all parallel worker threads. The FILLER column is there to give the table enough pages to make parallel scans effective. It’s also helpful to run a slower insert query as needed. Other than that, there’s nothing special about the definition or data of the table and it can be changed as desired.

CREATE PARTITION FUNCTION PF_throwaway_1
(BIGINT)
AS RANGE LEFT
FOR VALUES (0, 1, 2, 3); 

CREATE PARTITION SCHEME PS_throwaway_1
AS PARTITION PF_throwaway_1
ALL TO ( [PRIMARY] );

DROP TABLE IF EXISTS BASE_TABLE;

CREATE TABLE dbo.BASE_TABLE (
	ID BIGINT,
	FILLER VARCHAR(1000)
) ON PS_throwaway_1 (ID);

INSERT INTO dbo.BASE_TABLE WITH (TABLOCK)
SELECT 1, REPLICATE('Z', 100)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

The tempdb database seemed like a good target for writes because I already have 96 data files. There also may be some writing optimizations for tempdb which could be helpful.

A typical test query for observing latch waits looks like something like this:

SELECT ID, REPLICATE('Z', 1000) COL INTO #t
FROM BASE_TABLE
WHERE ID = 1
OPTION (MAXDOP 1);

With MAXDOP varying all the way from 1 to 24.

How Many Latches are Taken?

sys.dm_os_latch_stats cannot be used to figure out how many total latches were taken:

sys.dm_os_latch_stats does not track latch requests that were granted immediately, or that failed without waiting.

The only way that I know to do that is through extended events. The latch_acquired debug event filtered on the TRAN_NESTING_FULL class is helpful. For data storage targets I used event_counter and histogram. I imagine that these extended events can have a lot of overhead, but I’m doing my testing on a non-production server.

Let’s start with a query that’s not eligible for parallel insert:

DROP TABLE IF EXISTS #t;

SELECT ID, REPLICATE('Z', 1000) COL INTO #t
FROM BASE_TABLE
OPTION (MAXDOP 1);

I expected that no latches will be taken on NESTING_TRANSACTION_FULL. That is indeed what happens.

What about a query that runs at MAXDOP 2 but for which all of the rows are sent to a single worker thread? With four partitions and MAXDOP 2, each worker will be assigned a partition. The worker moves onto the next partition after it reads all of the rows. Skewed data in partitioned tables can lead to parallelism imbalances which can cause problems in real workloads here. It can also be used to our advantage in testing which is what I’m doing here.

DROP TABLE IF EXISTS #t;

SELECT ID, REPLICATE('Z', 1000) COL INTO #t
FROM BASE_TABLE
OPTION (MAXDOP 2);

There is a total of 347435 latch_acquired events for NESTING_TRANSACTION_FULL. The latch was acquired and released 347435 times. If I run a query with rows spread over all parallel workers, such as this one:

DROP TABLE IF EXISTS #t;

SELECT ID, REPLICATE('Z', 1000) COL INTO #t
FROM BASE_TABLE
WHERE ID = 1
OPTION (MAXDOP 2);

I get the same number of latch_acquired events.

I happened to notice that the query writes 347321 log records to the transaction log. That number is suspiciously close to the number of latches that were acquired. I can get the callstacks around the latch_acquired event by using the technique described here by Jonathan Kehayias. The top three buckets in the histogram each have 115192, 115189, and 115187 events. The callstacks seem to correspond to changing a PFS, GAM, or IAM page. They are reproduced below:

sqlmin.dll!XeSqlPkg::latch_acquired::Publish+0x1a9
sqlmin.dll!LatchBase::RecordAcquire+0x191
sqlmin.dll!LatchBase::AcquireInternal+0x499
sqlmin.dll!ParNestedXdes::GenerateLogRec+0x98
sqlmin.dll!PFSPageRef::ModifyPFSRow+0x68b
sqlmin.dll!ChangeExtStateInPFS+0x2a9
sqlmin.dll!AllocationReq::AllocateExtent+0x33e
sqlmin.dll!AllocationReq::AllocatePages+0x123b
sqlmin.dll!AllocationReq::Allocate+0xf3
sqlmin.dll!ExtentAllocator::PreAllocateExtents+0x457
sqlmin.dll!ExtentAllocatorSingleAlloc::PreAllocate+0x72
sqlmin.dll!ExtentAllocatorSingleAlloc::AllocateExtents+0x26c
sqlmin.dll!CBulkAllocator::AllocateExtent+0x226
sqlmin.dll!CBulkAllocator::AllocatePageId+0xe4
sqlmin.dll!CBulkAllocator::AllocateLinkedAndFormattedLeafPage+0xc1
sqlmin.dll!CHeapBuild::AllocateNextHeapPage+0x1f
sqlmin.dll!CHeapBuild::InsertRow+0x1b1
sqlmin.dll!RowsetBulk::InsertRow+0x23a9
sqlmin.dll!CValRow::SetDataX+0x5b
sqlTsEs.dll!CDefaultCollation::IHashW+0x227
sqlmin.dll!CQScanUpdateNew::GetRow+0x516
sqlmin.dll!CQScanXProducerNew::GetRowHelper+0x386
sqlmin.dll!CQScanXProducerNew::GetRow+0x15
sqlmin.dll!FnProducerOpen+0x5b

sqlmin.dll!XeSqlPkg::latch_acquired::Publish+0x1a9
sqlmin.dll!LatchBase::RecordAcquire+0x191
sqlmin.dll!LatchBase::AcquireInternal+0x499
sqlmin.dll!ParNestedXdes::GenerateLogRec+0x98
sqlmin.dll!PageRef::ModifyBits+0x3e0
sqlmin.dll!ModifyGAMBitAfterNewExtentFound+0xac
sqlmin.dll!AllocExtentFromGAMPage+0x8ef
sqlmin.dll!AllocationReq::AllocateExtent+0x1bb
sqlmin.dll!AllocationReq::AllocatePages+0x123b
sqlmin.dll!AllocationReq::Allocate+0xf3
sqlmin.dll!ExtentAllocator::PreAllocateExtents+0x457
sqlmin.dll!ExtentAllocatorSingleAlloc::PreAllocate+0x72
sqlmin.dll!ExtentAllocatorSingleAlloc::AllocateExtents+0x26c
sqlmin.dll!CBulkAllocator::AllocateExtent+0x226
sqlmin.dll!CBulkAllocator::AllocatePageId+0xe4
sqlmin.dll!CBulkAllocator::AllocateLinkedAndFormattedLeafPage+0xc1
sqlmin.dll!CHeapBuild::AllocateNextHeapPage+0x1f
sqlmin.dll!CHeapBuild::InsertRow+0x1b1
sqlmin.dll!RowsetBulk::InsertRow+0x23a9
sqlmin.dll!CValRow::SetDataX+0x5b
sqlTsEs.dll!CDefaultCollation::IHashW+0x227
sqlmin.dll!CQScanUpdateNew::GetRow+0x516
sqlmin.dll!CQScanXProducerNew::GetRowHelper+0x386
sqlmin.dll!CQScanXProducerNew::GetRow+0x15

sqlmin.dll!XeSqlPkg::latch_acquired::Publish+0x1a9
sqlmin.dll!LatchBase::RecordAcquire+0x191
sqlmin.dll!LatchBase::AcquireInternal+0x499
sqlmin.dll!ParNestedXdes::GenerateLogRec+0x98
sqlmin.dll!PageRef::ModifyBits+0x3e0
sqlmin.dll!ChangeExtStateInIAM+0x2ac
sqlmin.dll!AllocationReq::AllocatePages+0x17e5
sqlmin.dll!AllocationReq::Allocate+0xf3
sqlmin.dll!ExtentAllocator::PreAllocateExtents+0x457
sqlmin.dll!ExtentAllocatorSingleAlloc::PreAllocate+0x72
sqlmin.dll!ExtentAllocatorSingleAlloc::AllocateExtents+0x26c
sqlmin.dll!CBulkAllocator::AllocateExtent+0x226
sqlmin.dll!CBulkAllocator::AllocatePageId+0xe4
sqlmin.dll!CBulkAllocator::AllocateLinkedAndFormattedLeafPage+0xc1
sqlmin.dll!CHeapBuild::AllocateNextHeapPage+0x1f
sqlmin.dll!CHeapBuild::InsertRow+0x1b1
sqlmin.dll!RowsetBulk::InsertRow+0x23a9
sqlmin.dll!CValRow::SetDataX+0x5b
sqlTsEs.dll!CDefaultCollation::IHashW+0x227
sqlmin.dll!CQScanUpdateNew::GetRow+0x516
sqlmin.dll!CQScanXProducerNew::GetRowHelper+0x386
sqlmin.dll!CQScanXProducerNew::GetRow+0x15
sqlmin.dll!FnProducerOpen+0x5b
sqlmin.dll!FnProducerThread+0x80b

The callstack with the next most events only has 472, so I consider those three to be the important ones.

The data for the temp table takes up 7373280 KB of space. That’s about 115208 extents, and multiplying that again by 3 is again suspiciously close to the previous two numbers. It seems reasonable to conclude that the number of NESTING_TRANSACTION_FULL latches required for a minimally logged parallel insert into a heap will be approximately equal to 3 times the number of extents needed for the new data. Note that this is an approximation, and there are slight changes to the latch acquire count as MAXDOP changes.

Slowing Down the Insert

I changed the insert query to the following:

DROP TABLE IF EXISTS #t;
SELECT ID
, CASE WHEN CHARINDEX('NO U', FILLER) = 0
THEN REPLICATE('Z', 1000)
ELSE NULL END COL INTO #t
FROM BASE_TABLE
WHERE ID = 1
OPTION (MAXDOP 1);

The point is to require more CPU to insert the same volume of data as before. Here is a table showing elapsed time along with wait information for the latch:

a32_slow_no_serial

I didn’t include CPU time because SET STATISTICS TIME ON was wildly inaccurate for queries with higher DOP. Elapsed time decreases from MAXDOP 1 to MAXDOP 8 but starts to increase after MAXDOP 8. The total wait time dramatically increases as well. In addition, nearly all latch acquires at MAXDOP 16 or MAXDOP 24 had to be waited on.

We know that only one worker can get the exclusive latch for the transaction at a time. Let’s use a greatly simplified model for what each parallel worker does for this query. It reads a row, does processing for a row, and goes on to the next one. Once it has enough rows to write out a log record it tries to acquire the latch. If no one else has the latch in exclusive mode it can get the latch, update some structure in the parent transaction, release the latch, and continue reading rows. If another worker has the latch in exclusive mode then it adds itself to the FIFO wait queue for the latch subresource and suspends itself. When the resource is available the worker status changes from SUSPENDED to RUNNABLE. When it changes again from RUNNABLE to RUNNING it acquires the latch, updates some structure in the parent transaction, releases the latch, and continues working until it either needs to suspend again or hits the end of its 4 ms quantum. When it hits the end of its 4 ms quantum it will immediately select itself to run again because there are no other runnable workers on the scheduler.

So what determines the level of contention? One important factor is the number of workers that are contending over the same subresource. For this latch and type of query (rows are pretty evenly distributed between worker threads), this is simply MAXDOP. There’s a tipping point for this query where adding more workers is simply counterproductive.

For years I’ve seen people in the community state that running queries at MAXDOP that’s too high can be harmful. I’ve always been after simple demos that show why that can happen. The NESTING_TRANSACTION_FULL latch is an excellent example of why some queries run longer if MAXDOP is increased too far. There’s simply too much contention over a shared resource.

Speeding up the Insert

Let’s go back to the original query which is able to insert data at a faster rate:

DROP TABLE IF EXISTS #t;
SELECT ID, REPLICATE('Z', 1000) COL INTO #t
FROM BASE_TABLE
WHERE ID = 1
OPTION (MAXDOP 2);

Here is a table showing elapsed time along with wait information for the latch:

a32_fast_no_serial

We see a similar pattern to the previous query. However, run times are fairly close at MAXDOP 16 and 24. How can that be? Based on the MAXDOP 1 run times we know that this query only has to do about 50% of the CPU work compared to the other query.

Consider the rate of latch acquisition. Let’s suppose that we need to take around 360000 latches for both queries and suppose that their parallel workers never need to wait on anything. Based on the MAXDOP 1 runtime for this query we can work at a rate of 360000/(9565/4) = 150 latches per 4 ms quantum per worker. For the slower query, we can only work at a rate of 360000/(17101/4) = 84 latch acquires per 4 ms quantum. Of course, the assumption that none of the workers for these parallel queries will wait is wrong. We can see high wait times at high MAXDOP. The key is to think about what each worker does between waits. It’s true that the first query needs to do more CPU work overall. However, at MAXDOP 24 we can have up to 23 workers in the wait queue for the latch. It seems unlikely that a worker will be able to acquire many latches in a row without waiting. At high MAXDOP workers will often need to suspend themselves. As long as the amount of work between log records is significantly less than the 4 ms quantum then there won’t be a run time difference between the queries. The query with CHARINDEX will do useful CPU work while the query without it will wait. That’s why the query without CHARINDEX has more aggregate wait time at MAXDOP 24. The workers are able to enter a SUSPENDED state faster than the workers for the other queries, but that isn’t going to make the query complete any faster.

Adding a Busy Scheduler

In the previous tests we only had a single user query running on the 24 schedulers available to us. That isn’t a realistic real world scenario. There will often be other queries competing for CPU resources on the same schedulers. Now I’ll add a single MAXDOP 1 query which won’t finish its work for many hours. It’s designed to burn CPU as efficiently as possible in that there are very few possible waits. The worker thread should be able to use its full 4 ms quantum almost always. Here’s the query that I used:

SELECT TOP (1) t1.high + t2.high + t3.high + t4.high
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
CROSS JOIN master..spt_values t4
ORDER BY t1.high + t2.high + t3.high + t4.high
OPTION (MAXDOP 1, NO_PERFORMANCE_SPOOL)

I believe the query was assigned to scheduler 27. It stayed there until I cancelled the query.

What happens to the performance of the parallel SELECT INTO when one of the schedulers for its workers shares a scheduler with the MAXDOP 1 query? To make that happen I kept track of where the parallel enumerator was and made sure that 8 of the parallel workers were always assigned to the soft-NUMA node which contained the scheduler of the MAXDOP 1 query. I don’t love my readers enough to configure soft NUMA so I only have test results for MAXDOP 8, 16, and 24 (auto soft-NUMA splits a virtualized 24 scheduler memory node into 3 groups of 8 schedulers). Here are the test results:

a32_fast_serial

Query run times have dramatically increased. Why did that happen? Recall that latches can be requested at a high rate for the workers of our parallel workers. If a worker requests a latch that it can’t get it suspends itself. However, what happens if there’s another worker on that same scheduler that can use its full 4 ms quantum, such as our MAXDOP 1 query. Even if the latch resource is available after 1 ms, the worker for the parallel worker won’t be able to run for a minimum of 4 ms. It needs to wait for the MAXDOP 1 worker to yield the scheduler. The FIFO nature of the latch queue is harmful here from a query runtime point of view. A scheduling delay between RUNNABLE and RUNNING for a single worker can cause all other parallel worker threads to wait.

It is interesting that run time gets better with higher MAXDOP. This is the opposite pattern of before. Let’s look at the number of latch acquires bucketed by scheduler_id for the MAXDOP 8 query:

a32_fast_interference_schedulers_MAXDOP8

The worker sharing a scheduler with the MAXDOP 1 query does get about 10% fewer latches. However, the FIFO nature of the queue means that work cannot balance well between schedulers, even though the parallel scan and insert operators distribute rows on a demand basis. Based on the wait event counts, it’s fair to guess that the worker had to wait on the latch about 33/36 = 91.6% of the time. If the worker on scheduler 27 suspends itself it’ll need to wait 4 ms before it can start running again. That gives us a minimum run time for the query of 39961*4*(33/36) = 146523, which is fairly close to the true elapsed time of 141162.

Now consider the latch acquire distribution for the MAXDOP 16 query:

a32_fast_interference_schedulers_MAXDOP16

The latches are spread fairly evenly over worker threads. Scheduler 27 only had 20968 latch acquires, so we can calculate our guess for a run time floor as 20968*4*(35/36) = 81542 ms. This is close to the true elapsed time of 82245 ms.

The most important takeaway from this section is that query runtime for parallel inserts or parallel SELECT INTO can dramatically increase if there’s any other work happening on those schedulers. Increasing MAXDOP can apparently be helpful in working around scheduler contention, but it will make latch contention worse. I’ve never seen a practical example where that strategy works out.

Adding More CPU Work

Keeping the busy scheduler, let’s go back to the query with CHARINDEX:

DROP TABLE IF EXISTS #t;
SELECT ID
, CASE WHEN CHARINDEX('NO U', FILLER) = 0
THEN REPLICATE('Z', 1000)
ELSE NULL END COL INTO #t
FROM BASE_TABLE
WHERE ID = 1
OPTION (MAXDOP 1);

Here are the test results at MAXDOP 8, 16, and 24:

a32_slow_serial

I was surprised to see faster query execution times for the query that effectively needs to do more work. I ran the tests a few times and always saw the same pattern. It certainly makes sense that this query will spend less time on latch waits than the other one, but the exact mechanisms behind faster run times aren’t clear to me yet. Here’s the count of latches split by worker for the CHARINDEX query:

a32_slow_interference_schedulers_MAXDOP8_compare

Here’s the count for a different test run of the query without CHARINDEX:

a32_fast_interference_schedulers_MAXDOP8_compare

The CHARINDEX query has fewer latch acquires on scheduler 27. That allows the other schedulers to get more latches and to do more work. That explains the difference in run time, but I don’t understand why it happens. Perhaps the worker for the CHARINDEX query on scheduler 27 is able to exhaust its 4 ms quantum more often than the other query which allows the other workers to cycle through the latch while the MAXDOP 1 query is on scheduler 27. I may investigate this more another time.

Real World Problems

Adding a single MAXDOP 1 query to the workload in some cases made the parallel query take almost 30 times as long. Is this possible in the real world?

Some parallel inserts generate data to be inserted at a fast rate. Consider the results of a parallel batch mode hash join, for example. Multiple exclusive latches on NESTING_TRANSACTION_FULL seem to be required for every transaction log record that is generated. This can slow down queries and limit overall scalability. The FIFO nature of the queue is especially problematic when there are high signal waits (delays from changing worker status from RUNNABLE to RUNNING). A worker from another query on a scheduler can lead to waits for every parallel worker for an insert or SELECT INTO. There are many reasons for high signal waits: spinlock pressure, some external waits, nonyielding schedulers, and workers that choose not to yield for reasons only known to them. Query performance can degrade even with just a simple query that exhausts its 4 ms quantum every time, as was shown in this blog post.

Lowering MAXDOP can help, but it may not be enough for some workloads. The high rate of exclusive NESTING_TRANSACTION_FULL latch requests feels like a scalability problem that only Microsoft can solve. It would be great if each subtraction was able to update the log and the subtractions could be grouped as needed, such as for query rollback. I can’t speak to the complexities in making such a code change.

Final Thoughts

The NESTING_TRANSACTION_FULL latch can be a scalability and performance bottleneck on systems that do many concurrent parallel insert queries or parallel SELECT INTO queries. If you see this bottleneck for your workload, there are a few things that might be helpful to keep in mind:

The number of latches taken is proportional to the number of log records needed for the transaction. With minimal logging, the number of exclusive latches taken is about equal to 3 * number of extents.
Latch contention gets worse as MAXDOP increases.
Latch contention gets worse as the rate of latch requests increases. In other words, queries that can generate their data to be inserted very efficiently are more impacted.
Delays keeping parallel workers from moving to RUNNING from the RUNNABLE queue for even one parallel worker can have a disastrous effect on query performance. There are many possible scenarios in which this can happen.

Thanks for reading!

HASHBYTES Scalability

Posted on May 31, 2018August 28, 2018 by Joe Obbish

This blog post explores the scalability of the HASHBYTES function in SQL Server 2017 CU7.

Test Query

Start by putting 11 million rows into a heap. I don’t think that the value for the ID column particularly matters.

DROP TABLE IF EXISTS HB_DEMO;

CREATE TABLE HB_DEMO (ID VARCHAR(10));

INSERT INTO HB_DEMO WITH (TABLOCK)
SELECT TOP (11000000) '0'
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

The test query that I’ll be running is the following:

SELECT MAX(HASHBYTES('SHA2_256', R1.ID))
FROM HB_DEMO R1
OPTION (MAXDOP 1);

The purpose of the MAX aggregate is to limit the size of the result set. This is a cheap aggregate because it can be implemented as a stream aggregate. The operator can simply keep the maximum value that it’s found so far, compare the next value to the max, and update the maximum value when necessary. On my test server, the query takes about 20 seconds. If I run the query without the HASHBYTES call it takes about 3 seconds. That matches intuitively what I would expect. Reading 11 million rows from a small table out of the buffer pool should be less expensive than calculating 11 million hashes.

From my naive point of view, I would expect this query to scale well as the number of concurrent queries increases. It doesn’t seem like there should be contention over any shared resources, so as long as every query gets on its own scheduler I wouldn’t expect a large degradation in overall run time as the number of queries increases.

A Trillion Spins

Putting that theory to the test, I kicked off 96 concurrent queries using SQLCMD. The server has 96 CPUs and I verified that each session went to its own scheduler. The overall run time was about 8 minutes instead of the expected 20 seconds. That’s a 24X performance degradation at scale. Both resource monitor and SQL Server claim that CPU is used at a very high rate. There are effectively no latch waits. The only notable wait is 4684777 ms of MEMORY_ALLOCATION_EXT, which seems like a lot, but it only works out to 10% of the workload time because I’m running 96 concurrent queries. As is, I can only account for 15% of the workload time through waits and expected CPU utilization of the queries. However, spinlocks have quite a story to tell:

a31_spins

That’s a trillion spins on SOS_LARGEPAGE_ALLOCATOR in about eight minutes. It’s very hard to say that a specific number of spins, backoffs, or anything else related to spinlocks is “bad”. I view this as a sign of a potential problem because I can’t account for the workload time in any other way, the number of spins is so far ahead of second place, and I’ve run a lot of extreme workloads on this server and have never seen spins or backoffs approach levels like this in such a short period of time.

You may be wondering due to the LARGEPAGE part of the name if I have trace flag 834 enabled. I do not. SQL Server can use large pages for some internal structures even without TF 834. It just can’t use them for the buffer pool. This only works when LPIM is enabled, so a natural troubleshooting step is to disable LPIM.

A More Conventional Approach

With the conventional memory mode, the SOS_LARGEPAGE_ALLOCATOR spinlock disappears. This might seem like a good thing, but the workload still has the same scalability problems as before. The overall run still takes about eight minutes. MEMORY_ALLOCATION_EXT is by far the most prevalent wait:

a31_waits_96

However, wait times only account for about a third of overall workload time. This does seem like an important clue, but I’m unable to source the full workload time to anything visible within SQL Server.

Sometimes SQL Server Doesn’t Tell the Whole Truth

This feels like a good use case for ETW tracing in windows. Resource monitor suggests that SQL Server is using a lot of CPU time, but I don’t know how to account for that CPU time within SQL Server. PerfView can be used to analyze call stacks and to see how much CPU time is spent on different parts of the code. This blog post isn’t a tutorial on how to use PerfView with SQL Server. I’m very inexperienced with the program and can barely manage to get results. This might be because I refused to watch the tutorial video.

I wanted the first test to be a baseline. I gathered data over a handful of seconds while running just a single query in SQL Server. To be clear, this testing was done with the conventional memory manager. Here is a screenshot of the CPU stacks:

a31_perfview_1

If I’m interpreting the data correctly, 29% of CPU time was spent on bcrypt (I assume this is a windows assembly that HASHBYTES calls) and 49% of the time was spent in some way getting the hash value. That’s lower than I expected based on earlier testing results, but I am tracing a 96 core server and some of those other 95 cores will be doing things even if I’m not running user queries on them.

The second test was with 96 concurrent queries. I again gathered data over a handful of seconds, but didn’t take care to sample the exact same number of seconds. Here is a screenshot of the CPU stacks for the 96 query test:

a31_perfview_96

95% of the work done the server involves the HASHBYTES calls. That makes sense, but within those calls there’s a large amount of time spent allocating and deallocating pages. That wasn’t expected, especially because none of my queries even ask for a memory grant. However, I can finally account for the unexpected CPU time. Perhaps the MEMORY_ALLOCATION_EXT wait event is more important than I realized. It may be useful to try to look at page allocations within SQL Server.

Unreasonable Extended Events

The only way I know to track page allocations is through extended events. There’s an extended event called page_allocated with the following description: “Occurs when memory page is allocated”. I can’t imagine ever enabling this on a production server, but I’m on a development server and I don’t care about overhead. I created the event session with a scary sounding name, turned it on, ran a single query on an 11 million row table, and turned it off. That alone generated over 1.2 GB of data. The total number of events logged was 11003143, which almost perfectly matches with one allocation per row in the table. I can group by memory clerk to figure out what is doing the allocations:

a31_XE

I’m able to find that clerk in one of the memory DMVs:

SELECT *
FROM sys.dm_os_memory_clerks c
where c.type = 'MEMORYCLERK_SOSNODE';

However, I’m unable to figure out how to do anything useful with those DMVs. It seems odd to me that memory clerks have one row per soft NUMA node instead of per hard NUMA node, but I can’t say anything more on that subject.

Putting everything together that we’ve seen so far, each call to the HASHBYTES function requires a call to something in the Windows assembly bcrypt. To execute the code in Windows, the MEMORYCLERK_SOSNODE memory clerk within SQL Server needs to allocate and deallocate a page. Allocating and deallocating one page per row may not be a problem when running just one query, but it can lead to contention when running lots of queries. That contention may present itself within SQL Server as the MEMORY_ALLOCATION_EXT external wait type. On servers with LPIM enabled, this could be a large page and can result in trillions of spins on the SOS_LARGEPAGE_ALLOCATOR spinlock.

It may not be a coincidence that in total the workload does over a billion hashes and the total number of wait events for MEMORY_ALLOCATION_EXT is close to that.

What About Call Stacks?

We can get call stacks through Extended Events to further validate the theory. There are two possible triggers: page allocation and an external wait on MEMORY_ALLOCATION_EXT. For both triggers I see the same pattern in the call stack:

sqldk.dll!MemoryClerkInternal::AllocatePagesWithFailureMode+0x2ec
sqldk.dll!MemoryClerkInternal::AllocatePages+0x28
sqldk.dll!TVarPageMgr<0>::PviNewVarPage+0x36
sqldk.dll!TVarPageMgr<0>::PbAllocate+0x1e2
sqldk.dll!CMemObj::Alloc+0x47
sqllang.dll!CSECHash::GetHashValue+0x307
sqllang.dll!GetHashValue+0x25
sqllang.dll!HashBytes+0xb4
sqllang.dll!BytHashBytesByt+0xe3

I interpret that to mean that SQL Server needs to allocate a page for each execution of the HASHBYTES function. It isn’t conclusive evidence but it does match everything else that’s been observed so far.

Final Thoughts

I’m not able to find a workaround for this scalability problem. It’s unfortunate that SQL Server reports full CPU utilization when it experiences this contention. Without LPIM and a complete understanding of expected run time for queries running on the system, an administrator may underestimate the importance of the MEMORY_ALLOCATION_EXT wait times. I was able to observe contention even when running just a few queries at a time, both on the test server and on my laptop. It is even more difficult to observe the problem within SQL Server when running just a few queries. I don’t understand everything that’s involved, but it’s hard not to conclude that HASHBYTES could see significantly improved scalability if it wasn’t necessary to allocate and deallocate a page for every execution of the function. Thanks for reading!

A Row Goal Riddle

Posted on March 30, 2018March 30, 2018 by Joe Obbish

I’m the kind of guy who sometimes likes to look at execution plans with different hints applied to see how the plan shape and query operator costs change. Sometimes paying attention to operator costs can provide valuable clues as to why the query optimizer selected a plan that you didn’t like. Sometimes it doesn’t. My least favorite scenario is when I see inconsistent operator costs. This blog post covers a reproduction of such a case involving the dreaded and feared row goal.

The Data

For the data in my demo tables, I wanted a query of the following form:

SELECT *
FROM dbo.OUTER_HEAP o
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.INNER_CI i
    WHERE i.ID = o.ID
);

that returned zero rows. However, I wanted the query optimizer to think that a decent number of rows would be returned. This was more difficult to do than expected. I could have just hacked stats to easily accomplish what I wanted, but did not do so out of stubbornness and pride. I eventually found a data set through trial and error with a final cardinality estimate of 16935.2 rows, or 1.7% of the rows in the outer table:

DROP TABLE IF EXISTS dbo.OUTER_HEAP;
CREATE TABLE dbo.OUTER_HEAP (
	ID VARCHAR(10)
);

INSERT INTO dbo.OUTER_HEAP WITH (TABLOCK)
SELECT TOP (1000000) 1500000
	+ ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.OUTER_HEAP (ID)
WITH FULLSCAN;

DROP TABLE IF EXISTS dbo.INNER_CI;
CREATE TABLE dbo.INNER_CI (
	ID VARCHAR(10),
	PRIMARY KEY (ID)
);

INSERT INTO dbo.INNER_CI WITH (TABLOCK)
SELECT TOP (2000000) 500000
	+ ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.INNER_CI (ID)
WITH FULLSCAN;

The actual query of interest is the following one:

SELECT TOP (1) 1
FROM dbo.OUTER_HEAP o
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.INNER_CI i
    WHERE i.ID = o.ID
);

Such a query might be used as part of an ETL process as some kind of data validation step. The most interesting case to me is when the query returns no rows, but the query optimizer (for whatever reason) thinks otherwise.

Row Goal Problems

On my machine I get the following plan for the TOP (1) query:

a29_row_goal_nl

The plan has a total cost of 0.198516 optimizer units. If you know anything about row goals (this may be a good time for the reader to visit Paul White’s series on row goals) you might not be terribly surprised by the outcome. The row goal introduced by the TOP (1) makes the nested loop join very attractive in comparison to other plan choices. The cost of the scan on OUTER_HEAP is discounted from 2.93662 units to 0.0036173 units and the optimizer only expects to do 116 clustered index seeks against INNER_CI before it finds a row that does not match. However, we know our data and we know that all rows match. Therefore, the query will not return any rows and it must scan all rows in OUTER_HEAP if I execute the query. On my machine the query takes about two seconds:

CPU time = 1953 ms, elapsed time = 1963 ms.

If we’re going to read most of the rows anyway why not try a HASH JOIN hint? At least that plan won’t have a million clustered index seeks:

a29_HJ_PLAN

The new plan runs in MAXDOP 4 on my machine (although for not very long due to CPU cooling issues) and has a total cost of 19.9924 optimizer units. Query execution finishes in significantly less time:

CPU time = 1187 ms, elapsed time = 372 ms.

The Riddle

Can we do better than an ugly join hint? Microsoft blessed us with USE HINT functionality in SQL Server 2016 SP1, so let’s go ahead and try that to see if we can improve the performance of the query. To understand the full effect of the hint let’s get estimated plans for OPTION clauses of (USE HINT ('DISABLE_OPTIMIZER_ROWGOAL'), LOOP JOIN) and (USE HINT ('DISABLE_OPTIMIZER_ROWGOAL'), HASH JOIN). In theory, that will make it easy to compare the two queries:

a29_compare_plans

Something looks very wrong here. The loop join plan has a significantly lower cost than the hash join plan! In fact, the loop join plan has a total cost of 0.0167621 optimizer units. Why would disabling row goals for such a plan cause a decrease in total query cost?

I uploaded the estimated plans here for those who wish to examine them without going through the trouble of creating tables.

Let’s Try Adding Trace Flags

First let’s try the query with just a USE HINT and no join hint. I get the hash join plan shape that I was after with a total cost of 19.9924 optimizer units. That’s definitely a good thing, but why did the optimizer pick that plan? The plan with a loop join is quite a bargain at 0.0167621 optimizer units. The optimization level is FULL, but that doesn’t mean that every possible plan choice was evaluated. Perhaps the answer is as simple as the optimizer did not consider a nested loop join plan while evaluating possible plan choices.

There are a few different ways to see which optimizer rules were considered during query plan compilation. We can add trace flags 3604 and 8619 at the query level to get information about the rules that were applied during optimization. All of these trace flags are undocumented, don’t use them in production, blah blah blah. For clarity, the query now looks like this:

SELECT TOP (1) 1
FROM dbo.OUTER_HEAP o
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.INNER_CI i
    WHERE i.ID = o.ID
) OPTION (USE HINT ('DISABLE_OPTIMIZER_ROWGOAL')
, QUERYTRACEON 3604
, QUERYTRACEON 8619
);

Among other rules, we can see a few associated with nested loops:

Apply Rule: LeftSideJNtoIdxLookup - LOJN/LSJ/LASJ -> IDX LOOKUP
Apply Rule: LASJNtoNL - LASJN -> NL

So the optimizer did consider nested loop joins at some point but it went with the hash join plan instead. That’s interesting.

Let’s Try More Trace Flags

A logical next step is to try to get more operator costing information. Perhaps the cost of the nested loop join plan when considered during optimization is different from the value displayed in the query plan in SSMS. As the optimizer applies different rules and does different transformations the overall cost can change, so I suppose this isn’t a totally unreasonable theory. My first attempt at getting cost information for multiple plan options is by looking at the final memo for the query. This can be done with trace flag 8615.

For clarity, the query now looks like this:

SELECT TOP (1) 1
FROM dbo.OUTER_HEAP o
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.INNER_CI i
    WHERE i.ID = o.ID
) OPTION (USE HINT ('DISABLE_OPTIMIZER_ROWGOAL')
, QUERYTRACEON 3604
, QUERYTRACEON 8615
-- , QUERYTRACEON 8619
);

Group 8 is the only one with any costing information for joins:

Group 8: Card=16935.2 (Max=1.1e+006, Min=0)
   11 PhyOp_HashJoinx_jtLeftAntiSemiJoin 6.9 7.10 5.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 26.5569 (Distance = 1)
   4 PhyOp_HashJoinx_jtLeftAntiSemiJoin 6.4 7.3 5.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 38.4812 (Distance = 1)
   1 LogOp_RightAntiSemiJoin 7 6 5 (Distance = 1)
   0 LogOp_LeftAntiSemiJoin 6 7 5 (Distance = 0)

This is rather disappointing. There’s only costing information for a few hash join plans. We know that other join types were considered. Perhaps they were discarded from the memo. The final memo just doesn’t have enough information to answer our question.

We Must Not Be Using Enough Trace Flags

There’s a trace flag that adds memo arguments to trace flag 8619: 8620. Perhaps that will give us additional clues. For clarity, the query now looks like this:

SELECT TOP (1) 1
FROM dbo.OUTER_HEAP o
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.INNER_CI i
    WHERE i.ID = o.ID
) OPTION (USE HINT ('DISABLE_OPTIMIZER_ROWGOAL')
, QUERYTRACEON 3604
, QUERYTRACEON 8615
, QUERYTRACEON 8619
, QUERYTRACEON 8620
);

The output is again disappointing. I’ll skip covering it and jump to trace flag 8621. That one adds additional information about the rules and how they interact with memos. With this trace flag we find more information about the plans with merge or loop joins:

Rule Result: group=8 1 LogOp_RightAntiSemiJoin 7 6 5 (Distance = 1)
Rule Result: group=8 2 PhyOp_MergeJoin 1-M x_jtRightAntiSemiJoin 7 6 5 (Distance = 2)
Rule Result: group=8 3 PhyOp_HashJoinx_jtRightAntiSemiJoin 7 6 5 (Distance = 2)
Rule Result: group=8 4 PhyOp_HashJoinx_jtLeftAntiSemiJoin 6 7 5 (Distance = 1)
Rule Result: group=8 5 PhyOp_Applyx_jtLeftAntiSemiJoin 6 16 (Distance = 1)
Rule Result: group=8 6 PhyOp_Applyx_jtLeftAntiSemiJoin 6 16 (Distance = 1)
Rule Result: group=8 7 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 17 5 (Distance = 1)
Rule Result: group=8 8 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 7 5 (Distance = 1)
Rule Result: group=8 9 PhyOp_MergeJoin 1-M x_jtRightAntiSemiJoin 7 6 5 (Distance = 2)
Rule Result: group=8 10 PhyOp_HashJoinx_jtRightAntiSemiJoin 7 6 5 (Distance = 2)
Rule Result: group=8 11 PhyOp_HashJoinx_jtLeftAntiSemiJoin 6 7 5 (Distance = 1)
Rule Result: group=8 12 PhyOp_Applyx_jtLeftAntiSemiJoin 6 16 (Distance = 1)
Rule Result: group=8 13 PhyOp_Applyx_jtLeftAntiSemiJoin 6 16 (Distance = 1)
Rule Result: group=8 14 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 17 5 (Distance = 1)
Rule Result: group=8 15 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 17 5 (Distance = 1)
Rule Result: group=8 16 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 7 5 (Distance = 1)
Rule Result: group=8 17 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 7 5 (Distance = 1)

However, group 8 in the final memo still looks the same as before:

Group 8: Card=16935.2 (Max=1.1e+006, Min=0)
   11 PhyOp_HashJoinx_jtLeftAntiSemiJoin 6.9 7.10 5.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 26.5569 (Distance = 1)
   4 PhyOp_HashJoinx_jtLeftAntiSemiJoin 6.4 7.3 5.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 38.4812 (Distance = 1)
   1 LogOp_RightAntiSemiJoin 7 6 5 (Distance = 1)
   0 LogOp_LeftAntiSemiJoin 6 7 5 (Distance = 0)

My interpretation is that costing information for the other join types was discarded from the memo. For example, at one point there was an item 5 in group 8 which contained information relevant to costing for one of the nested loop join plans. All of that information is not present in the final plan because the hash join plan was cheaper.

Pray to the Trace Flag Gods

There is an undisclosed trace flag which does not allow items to be discarded from the memo. Obviously this is a dangerous thing to do. However, with that trace flag group 8 finally reveals all of its secrets:

Group 8: Card=16935.2 (Max=1.1e+006, Min=0)
   17 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 7 5.0 (Distance = 1)
   16 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 7 5.0 (Distance = 1)
   15 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 17 5.0 (Distance = 1)
   14 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 17 5.0 (Distance = 1)
   13 PhyOp_Applyx_jtLeftAntiSemiJoin 6 16 (Distance = 1)
   12 PhyOp_Applyx_jtLeftAntiSemiJoin 6 16 (Distance = 1)
   11 PhyOp_HashJoinx_jtLeftAntiSemiJoin 6.9 7.10 5.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 26.5569 (Distance = 1)
   10 PhyOp_HashJoinx_jtRightAntiSemiJoin 7.8 6 5 (Distance = 2)
   9 PhyOp_MergeJoin 1-M x_jtRightAntiSemiJoin 7.6 6 5 (Distance = 2)
   8 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 7 5.0 (Distance = 1)
   7 PhyOp_LoopsJoinx_jtLeftAntiSemiJoin 6 17 5.0 (Distance = 1)
   6 PhyOp_Applyx_jtLeftAntiSemiJoin 6 16 (Distance = 1)
   5 PhyOp_Applyx_jtLeftAntiSemiJoin 6 16 (Distance = 1)
   4 PhyOp_HashJoinx_jtLeftAntiSemiJoin 6.4 7.3 5.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 38.4812 (Distance = 1)
   3 PhyOp_HashJoinx_jtRightAntiSemiJoin 7.3 6.4 5.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 48.5597 (Distance = 2)
   2 PhyOp_MergeJoin 1-M x_jtRightAntiSemiJoin 7.1 6.2 5.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 106.564 (Distance = 2)
   1 LogOp_RightAntiSemiJoin 7 6 5 (Distance = 1)
   0 LogOp_LeftAntiSemiJoin 6 7 5 (Distance = 0)

Let’s consider item 5 because it’s one of the more reasonable loop join plan options. Item 5 doesn’t give us direct costing information but it directs us to look to group 16:

Group 16: Card=1 (Max=1, Min=0)
   2 PhyOp_Range 1 ASC  5.0  Cost(RowGoal 0,ReW 0,ReB 999999,Dist 1e+006,Total 1e+006)= 165.607 (Distance = 2)
   1 PhyOp_Range 1 ASC  5.0  Cost(RowGoal 0,ReW 0,ReB 999999,Dist 1e+006,Total 1e+006 s)= 165.607 (Distance = 2)
   0 LogOp_SelectIdx 15 5 (Distance = 1)

We can see a cost of 165.607 for the clustered index seeks on the inner side of the join. If that was the cost used when comparing plans then it explains why the optimizer opted for a hash join over the loop join. We might try looking at a query plan for the following:

DECLARE @TOP INT = 1;

SELECT TOP (@TOP) 1
FROM dbo.OUTER_HEAP o
WHERE NOT EXISTS
(
    SELECT 1
    FROM dbo.INNER_CI i
    WHERE i.ID = o.ID
)
OPTION (OPTIMIZE FOR (@TOP = 987654321)
, LOOP JOIN
, MAXDOP 1);

With the above syntax we will get the same query results as before but the effective TOP (1) cannot change the cost of the query. Here are the operator details for the clustered index seek on INNER_CI:

a29_seek_costs

It’s an exact match. The total cost of the query is 172.727 optimizer units, which is significantly more than the price tag of 19.9924 units for the hash join plan.

Improving Displayed Costing

So is there really a problem with the displayed cost of the plan with hints matching (USE HINT ('DISABLE_OPTIMIZER_ROWGOAL'), LOOP JOIN)? After all, the USE HINT seemed to give the correct plan shape without the join hints. Let’s compare the TOP plan side by side with it:

a29_compare_nl_plans

I also uploaded the estimated plans here.

In my view, the costs of the USE HINT plan are nonsensical. The scan and the seeks have matching operator costs of 0.0167621 units. Bizarrely, this also matches the final query cost. 0.0167621 + 0.0167621 = 0.0167621. It’s totally unclear where these numbers come from, and if a TOP (1) can aggressively discount all of the operators in a plan then it seems like the DISABLE_OPTIMIZER_ROWGOAL hint will not have its intended practical impact. Certainly a TOP (1) will not discount the costs of blocking operators, so plans without them (such as loop joins) will be costed cheaper than plans with them (like hash joins).

I would prefer to see matching query costs for both subtrees. Of course, the TOP (1) can influence the costs of other parts of the plan that depend on this subtree, so the estimate needs to obey the TOP expression. I just feel that it shouldn’t affect the cost of the immediate subtree.

If you’re curious where the 0.016762 number comes from, I suspect that it has something to do with the TOP (1) capping the cost of the nested loop join. The following can be found in the final memo:

Group 11: Card=1 (Max=1, Min=0)
   1 PhyOp_Top 8.2 9.0 10.0  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 0.016762 (Distance = 1)

Group 8: Card=16935.2 (Max=1.1e+006, Min=0)
   2 PhyOp_Applyx_jtLeftAntiSemiJoin 6.4 15.2  Cost(RowGoal 0,ReW 0,ReB 0,Dist 0,Total 0)= 172.723 (Distance = 1)

In addition, the query cost slightly increases when I change the query to return the first 2 rows. You could argue that query plan costs are just numbers and there’s no reason to care about them as long as you get the right plan, but that’s not quite true. I’m sure this isn’t an exhaustive list, but query plan costs can affect at least the following:

If the query cost is too low it won’t be eligible for a parallel plan.
The number of seconds that a query will wait for a memory grant before throwing an error.
The amount of steps taken by the optimizer during query plan creation.
If the query is sent to a small query resource semaphore.
If the query is not able to run due to the query governor cost limit.

Final Thoughts

Congrats if you made it to the end. Thanks for reading!

The Return of RESERVED_MEMORY_ALLOCATION_EXT

Posted on March 13, 2018 by Joe Obbish

It is possible to see a scalability bottleneck in the form of high average wait time for the RESERVED_MEMORY_ALLOCATION_EXT wait if a highly concurrent workload is run on a server that consumes memory with batch mode operators. I believe that the severity of the bottleneck depends on unknown factors in the server’s initial memory state and the rate of memory actually used by queries to run batch mode operations. This blog post shares a reproduction of the issue along with a call to action.

The Test Query

Data prep for the test query is very simple. We throw the first ten million sequential integers into two single column BIGINT tables. Additionally, an empty CCI is created to add eligibility for batch mode as desired:

DROP TABLE IF EXISTS DUMMY_CCI;
CREATE TABLE DUMMY_CCI (
	ID INT,
	INDEX CCI CLUSTERED COLUMNSTORE
);

DROP TABLE IF EXISTS HASH_JOIN_SCALE_TEST_1;
CREATE TABLE HASH_JOIN_SCALE_TEST_1 (
	ID BIGINT NOT NULL
);

INSERT INTO HASH_JOIN_SCALE_TEST_1 WITH (TABLOCK)
SELECT TOP (10000000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS HASH_JOIN_SCALE_TEST_2;
CREATE TABLE HASH_JOIN_SCALE_TEST_2 (
ID BIGINT NOT NULL
);

INSERT INTO HASH_JOIN_SCALE_TEST_2 WITH (TABLOCK)
SELECT TOP (10000000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (MAXDOP 1);

The SELECT query that we’re going to test with joins them together and returns a count of distinct IDs:

SELECT COUNT(DISTINCT t1.ID)
FROM dbo.HASH_JOIN_SCALE_TEST_1 t1
LEFT OUTER JOIN dbo.HASH_JOIN_SCALE_TEST_2 t2
	ON t1.ID = t2.ID
OPTION (MAXDOP 1);

This is certainly an odd thing to do, but the point of this query is to create a simple test case that uses a bit of memory to create hash tables. Without any indexes on the table we’re very likely to get a hash join. In row mode the query takes about 17 seconds on the test server and uses 1257528 KB of memory.

As far as I can tell this query is a good fit for batch mode. The join is on BIGINT columns and everything except for the rowstore scans can execute in batch mode:

a28_query_plan

Even with the hidden conversions of 20 million rows from row mode to batch mode the query finishes in about 5 seconds. It uses significantly less memory compared to the row mode query: 386112 KB. All of this seems very reasonable.

Testing at Scale

The situation becomes less reasonable once there are many concurrent queries running at once. We go from the batch mode query running in a third of the time of row mode to sometimes running slower. Testing was done with 96 concurrent queries each running queries until 576 total queries were completed. Below is a graph of OS CPU and OS privileged time CPU based on perfmon data:

a28_cpu_graphs

The first third of the graph corresponds to the batch mode workload. CPU never hits 100% and is extremely variable during the run. We also see a high amount of privileged time in the OS. Within SQL Server, we use a total of 222,400,512 KB of query memory. The total wait time for RESERVED_MEMORY_ALLOCATION_EXT can vary, but for this one it was around 11.5 million ms.

The second third of the graph is the part that resets the server memory state. As stated in the introduction, this scalability bottleneck doesn’t always happen and seems to have some kind of dependency on the server’s initial memory state. Here we do a SELECT COUNT(*) on a 1 TB table to add pressure to the rowstore buffer pool.

The final third of the graph corresponds to the row mode workload. CPU jumps to near 100% very quickly and doesn’t waver much during the run. There is very little privileged time in the OS. Within SQL Server, we use a total of 724,336,128 KB of query memory. The wait time for RESERVED_MEMORY_ALLOCATION_EXT is always significantly reduced compared to batch mode. For this run it was around 107k ms.

In conclusion, we see about 1% of memory wait time with the row mode workload compared to the batch mode workload. Usually the row mode workload finishes a little faster than the batch mode one, despite needing over 3X as much total memory and the per-query efficiency advantage of batch mode.

Speculation on the Problem

There may be a hint in the memory management architecture guide:

Starting with SQL Server 2012, SQL Server might allocate more memory than the value specified in the max server memory setting. This behavior may occur when the Total Server Memory (KB) value has already reached the Target Server Memory (KB) setting (as specified by max server memory). If there is insufficient contiguous free memory to meet the demand of multi-page memory requests (more than 8 KB) because of memory fragmentation, SQL Server can perform over-commitment instead of rejecting the memory request.
…
This behavior is typically observed during the following operations: Large Columnstore index queries.

There are observable differences in how memory is allocated for the same operator when switching from batch mode to row mode. The number of wait events for RESERVED_MEMORY_ALLOCATION_EXT is significantly higher for row mode compared to batch mode. In addition, it remains nearly constant when TF 834 is enabled:

a28_memory_waits

TF 834 appears to fully resolve the scalability issue with batch mode, just like last time. It offers a dramatic improvement in workload completion time:

a28_summary_of_timings

Perhaps some batch mode operators require contiguous free memory for their allocations but the same operation done through row mode does not have that restriction. I can’t go any further than this because I don’t know anything about OS internals or memory management, but it could explain why we see such different behavior between row mode and batch mode.

Use Your Voice

Since SQL Server 2012, Microsoft has listed TF 834 as incompatible with columnstore. There are lots of scary consequences:

A non-yielding scheduler error and associated memory dumps in the SQL Server Error log.

Columnstore queries trigger severe performance issues.

A SQL Server instance triggers access violations when you execute Columnstore queries.

You encounter the following error when you run sp_createstats

However, I’ve been told that many of the SQL Server columnstore benchmarks for the TPC-H benchmarks use TF 834. It seems to be a natural fit for columnstore, and there are some workloads (seen in the lab and reproduced in the last two blog posts) where it resolves a key scalability bottleneck which is difficult to address through other means. I have created a feedback item on UserVoice requesting that Microsoft support TF 834 with columnstore.

Final Thoughts

Please provide feedback on the UserVoice item if you are able to do so. Thanks for reading!

A Method to Find Trace Flags

Posted on January 16, 2018January 16, 2018 by Joe Obbish

Trace flags are often-hidden configuration switches within SQL Server that can be used to change advanced configuration options, enable or disable fixes, or provide additional diagnostic information. Microsoft recently (on the time scale of the product) published a list of supported trace flags. The community has found other, undocumented trace flags through various means. The best repository that I’m aware of is maintained by Konstantin Ktaranov here and published here. This blog post contains a stored procedure that can find trace flags. The code contained in this blog post is extremely dangerous and should never be run in production. Be prepared to say goodbye to any server that you run it on.

Hunting Technique

Microsoft gives us a few clues about the range of valid trace flags. All of the documented trace flags (other than the one to make trace flags globally scoped, -1) are positive integers less than 11025. Looking through the list, we can see that similar trace flags are often grouped together. However, the most important hint is that DBCC TRACEON and DBCC TRACEOFF threw an error for any trace flag that’s above some hardcoded maximum. In SQL Server 2017 CU2 the largest allowed value is 11498. What if we had a stored procedure which could take a query to run, turn on one trace flag at a time from 1 to 11498, save off the XML from the estimated plan, turn off the trace flag, go to the next one, and compare all of the generated XML? That would give us a way to check for changes to the XML for everything in the known range of possible trace flags.

The technique described above will not work for trace flags that can only be activated at server startup. You’ll need something like Brent’s method here for that. It also won’t work well for trace flags which need other trace flags to be enabled at the same time. Some trace flags have an effect which cannot be observed through an estimated plan, such as some server settings and aggregate pushdown for CCIs. Still, with the right test query and a little patience you can get some interesting results.

Prepare for the Hunt

The stored procedure requires a few tables to exist and be populated. There is a set of three tables that exclude trace flags in different ways:

CREATE TABLE dbo.STACK_DUMP_TFS (TF INT, PRIMARY KEY (TF));

CREATE TABLE dbo.TFS_TO_EXCLUDE (TF INT, PRIMARY KEY (TF));

CREATE TABLE dbo.KNOWN_TFS (TF INT, PRIMARY KEY (TF));

The first table, STACK_DUMP_TFS, contains trace flags that cause stack dumps which you never want to enable. I found two such trace flags during my testing, but there could be others depending on server configuration and the query that you’re testing. The TFS_TO_EXCLUDE table contains trace flags that you don’t want to enable for one reason or another. Maybe you want to enable a trace flag throughout the lifetime of a test or you’ve found a previously unknown trace flag and you don’t want to continue to show up your test results. The KNOWN_TFS table contains all trace flags found in Konstantin’s trace flag guide. Of course, you can populate the tables with whatever you want, but I put code to populate the tables with the dataset that I use on pastebin.

Finally, the stored procedure logs what it finds to the following table:

CREATE TABLE dbo.LOG_TF_XML_COMPARE (
	TEST_NAME VARCHAR(100) NOT NULL,
	TF_SCOPE VARCHAR(10) NOT NULL,
	TF_NUMBER INTEGER NOT NULL,
	TF_IS_KNOWN INT,
	LOG_TIME DATETIME,
	QUERY_PLAN_XML XML,
	QUERY_ERROR_TEXT NVARCHAR(4000),
	PRIMARY KEY (TEST_NAME, TF_SCOPE, TF_NUMBER)
);

The structure of the table will make more sense after I go through an example later on in this post.

Set Your Bait

The stored procedure has a few parameters to define the search:

@test_name VARCHAR(100),
@query_text NVARCHAR(3900),
@tf_scope VARCHAR(10) = 'GLOBAL',
@first_TF_to_search INT = 1,
@last_TF_to_search INT = 11498,
@skip_known_TFs INT = 1

The @test_name parameter controls what is logged to the TEST_NAME column of LOG_TF_XML_COMPARE. The stored procedure deletes all existing data from the table with a matching @test_name and @tf_scope.

The @query_text parameter contains the query for which an estimated plan to be generated. Currently, the procedure only supports single statement queries, so you can’t define a variable or create a table in the query text and do something else after.

The @tf_scope parameter controls if the trace flags are enabled at the session, global, or query level with QUERYTRACEON. Allowed inputs are 'GLOBAL', 'SESSION', and 'QUERY'. The query level can only be used if @query_text contains a '{{QUERYTRACEON}}' or '{{QUERYHINT}}' placeholder. '{{QUERYTRACEON}}' should be used as a substitute for QUERYTRACEON in the query and '{{QUERYHINT}}' should be used to create the OPTION part of a query, so you’d only use '{{QUERYTRACEON}}' if you need to specify other hints at the query level.

@first_TF_to_search is the first trace flag to search with a default value of 1. Trace flags are always searched in ascending order.

@last_TF_to_search is the last trace flag to search with a default value of 11498. I haven’t done any testing on lower versions of SQL Server, so it’s possible that you’ll see errors when trying trace flags with a higher value on some product versions.

If @skip_known_TFs is set to 0 then trace flags in the KNOWN_TFS table will be skipped during the stored procedure run. Trace flags in the STACK_DUMP_TFS and TFS_TO_EXCLUDE tables are always skipped.

Begin the Hunt

The procedure enables a trace flag, generates a cached plan, does some cleanup on the plan, and saves it into the logging table if it’s different from the plan without any trace flags. The full code of the procedure is below:

-- THIS CODE DOES BAD THINGS!!!!
CREATE OR ALTER PROCEDURE [dbo].[FIND_TRACE_FLAGS] (
@test_name VARCHAR(100),
@query_text NVARCHAR(3900),
@tf_scope VARCHAR(10) = 'GLOBAL',
@first_TF_to_search INT = 1,
@last_TF_to_search INT = 11498,
@skip_known_TFs INT = 1
)
AS
BEGIN
DECLARE
	@plan_handle VARBINARY(64),
	@plan_xml XML,
	@query_error nvarchar(4000),
	@TF INT,
	@trace_sql VARCHAR(1000),
	@standard_plan_xml_as_string NVARCHAR(MAX),
	@TF_is_known INT,
	@query_text_to_run_w_placeholders NVARCHAR(4000),
	@query_text_to_run NVARCHAR(4000)

;
SET NOCOUNT ON;

IF @tf_scope NOT IN ('GLOBAL', 'SESSION', 'QUERY')
BEGIN
	THROW 50001, 'Fix @tf_scope', 1;
	RETURN;
END;

IF @tf_scope = 'QUERY' AND @query_text NOT LIKE '%{{QUERYTRACEON}}%' AND @query_text NOT LIKE '%{{QUERYHINT}}%'
BEGIN
	THROW 50001, '@query_text needs {{QUERYTRACEON}} or {{QUERYHINT}} placeholder', 1;
	RETURN;
END;

IF @first_TF_to_search NOT BETWEEN 1 AND 11498 -- max that doesn't error out in DBCC traceon is 11498
BEGIN
	THROW 50001, 'Fix @first_TF_to_search', 1;
	RETURN;
END;

IF @last_TF_to_search NOT BETWEEN 1 AND 11498 -- max that doesn't error out in DBCC traceon is 11498
BEGIN
	THROW 50001, 'Fix @@last_TF_to_search', 1;
	RETURN;
END;

IF @tf_scope = 'QUERY' AND @query_text NOT LIKE '%{{QUERYTRACEON}}%' AND @query_text NOT LIKE '%{{QUERYHINT}}%'
BEGIN
	THROW 50001, '@query_text needs {{QUERYTRACEON}} or {{QUERYHINT}} placeholder', 1;
END;

DBCC TRACEON(8757) with NO_INFOMSGS; -- disable trivial plans

DELETE FROM dbo.LOG_TF_XML_COMPARE WITH (TABLOCK)
WHERE TEST_NAME = @test_name AND TF_SCOPE = @tf_scope
OPTION (RECOMPILE);

DBCC FREEPROCCACHE with NO_INFOMSGS;

SET @query_text_to_run_w_placeholders = N'SET NOEXEC ON; /* FIND_ME */' + @query_text;

IF @tf_scope <> 'QUERY'
BEGIN
	SET @query_text_to_run = @query_text_to_run_w_placeholders
END
ELSE
BEGIN
	SET @query_text_to_run = REPLACE(REPLACE(@query_text_to_run_w_placeholders, N'{{QUERYTRACEON}}', N''), N'{{QUERYHINT}}', N'');
END;

BEGIN TRY
	EXEC (@query_text_to_run);
END TRY
BEGIN CATCH
	SET @query_error = ERROR_MESSAGE();
END CATCH;

IF @query_error IS NOT NULL
BEGIN
	THROW 50001, @query_error, 1;
	RETURN;
END;

SELECT /* HIDE_ME */ @plan_handle = ecp.plan_handle
, @plan_xml = eqp.query_plan
FROM sys.dm_exec_cached_plans ecp
CROSS APPLY sys.dm_exec_sql_text(ecp.plan_handle) est
CROSS APPLY sys.dm_exec_query_plan (ecp.plan_handle) eqp
WHERE est.text LIKE '%/* FIND_ME */%'
AND est.text NOT LIKE '%/* HIDE_ME */%';

DBCC FREEPROCCACHE (@plan_handle) WITH NO_INFOMSGS;

SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple[1]
');

SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/@StatementCompId
');			

SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/@CompileTime
');

SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/@CompileCPU
');

SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/@CompileMemory
');

SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/TraceFlags
');

SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/OptimizerHardwareDependentProperties/@MaxCompileMemory
');

IF @tf_scope = 'QUERY' 	-- only wipe out text and other stuff when QUERYTRACEON is used
BEGIN
	SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
	delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/@StatementText
	');

	SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
	delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/@QueryHash
	');

	SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
	delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/@QueryPlanHash
	');
END;

INSERT INTO dbo.LOG_TF_XML_COMPARE VALUES (@test_name, @tf_scope, 0, 0, GETDATE(), @plan_xml, @query_error);

SET @standard_plan_xml_as_string = CAST(@plan_xml AS NVARCHAR(MAX));

SET @TF = @first_TF_to_search;
WHILE @TF <= @last_TF_to_search
BEGIN
	SET @query_error = NULL;
	SET @plan_handle = NULL;
	SET @plan_xml = NULL;
	SET @TF_is_known = 0;

	IF @TF = 8757 OR EXISTS (SELECT 1 FROM dbo.STACK_DUMP_TFS WHERE TF = @TF) OR EXISTS (SELECT 1 FROM dbo.TFS_TO_EXCLUDE WHERE TF = @TF)
	BEGIN
		SET @TF = @TF + 1;
		CONTINUE;
	END;

	IF EXISTS (SELECT 1 FROM dbo.KNOWN_TFS WHERE TF = @TF)
	BEGIN
		SET @TF_is_known = 1;
	END;

	IF @TF_is_known = 1 AND @skip_known_TFs = 1
	BEGIN
		SET @TF = @TF + 1;
		CONTINUE;
	END;

	-- set trace flag at right level
	IF @tf_scope = 'GLOBAL'
	BEGIN
		SET @trace_sql = 'DBCC TRACEON(' + CAST(@TF AS VARCHAR(5)) + ', -1) with NO_INFOMSGS';
		EXEC (@trace_sql);
	END
	ELSE IF @tf_scope = 'SESSION'
	BEGIN
		SET @trace_sql = 'DBCC TRACEON(' + CAST(@TF AS VARCHAR(5)) + ') with NO_INFOMSGS';
		EXEC (@trace_sql);
	END
	ELSE
	BEGIN
		SET @query_text_to_run = REPLACE(REPLACE(@query_text_to_run_w_placeholders, N'{{QUERYTRACEON}}', N', QUERYTRACEON ' + CAST(@TF AS NVARCHAR(5)))
				, N'{{QUERYHINT}}', N'OPTION(QUERYTRACEON ' + CAST(@TF AS NVARCHAR(5)) + N')' );
	END;

	BEGIN TRY
		EXEC (@query_text_to_run);
	END TRY
	BEGIN CATCH
		SET @query_error = ERROR_MESSAGE();
	END CATCH;

	IF @query_error IS NULL
	BEGIN
		SELECT /* HIDE_ME */ @plan_handle = ecp.plan_handle, @plan_xml = eqp.query_plan
		FROM sys.dm_exec_cached_plans ecp
		CROSS APPLY sys.dm_exec_sql_text(ecp.plan_handle) est
		CROSS APPLY sys.dm_exec_query_plan (ecp.plan_handle) eqp
		WHERE est.text LIKE '%/* FIND_ME */%'
		AND est.text NOT LIKE '%/* HIDE_ME */%';

		IF @plan_handle IS NOT NULL
		BEGIN
			DBCC FREEPROCCACHE (@plan_handle) WITH NO_INFOMSGS;

			SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
			delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple[1]
			');

			SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
			delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/@StatementCompId
			');			

			SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
			delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/@CompileTime
			');

			SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
			delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/@CompileCPU
			');

			SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
			delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/@CompileMemory
			');

			SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
			delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/TraceFlags
			');

			SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
			delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/OptimizerHardwareDependentProperties/@MaxCompileMemory
			');

			IF @tf_scope = 'QUERY' 	-- only wipe out text and other stuff when QUERYTRACEON is used
			BEGIN
				SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
				delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/@StatementText
				');

				SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
				delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/@QueryHash
				');

				SET @plan_xml.modify('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";
				delete /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/@QueryPlanHash
				');
			END;

		END;
	END;

	IF @standard_plan_xml_as_string <> CAST(@plan_xml AS NVARCHAR(MAX))
	BEGIN
		INSERT INTO dbo.LOG_TF_XML_COMPARE VALUES (@test_name, @tf_scope, @TF, @TF_is_known, GETDATE(), @plan_xml, @query_error);
	END;

	-- set trace flag at right level
	IF @tf_scope = 'GLOBAL'
	BEGIN
		SET @trace_sql = 'DBCC TRACEOFF(' + CAST(@TF AS VARCHAR(5)) + ', -1) with NO_INFOMSGS';
		EXEC (@trace_sql);
	END
	ELSE IF @tf_scope = 'SESSION'
	BEGIN
		SET @trace_sql = 'DBCC TRACEOFF(' + CAST(@TF AS VARCHAR(5)) + ') with NO_INFOMSGS';
		EXEC (@trace_sql);
	END;

	SET @TF = @TF + 1;
END;

SELECT *
FROM LOG_TF_XML_COMPARE
WHERE TEST_NAME = @test_name AND TF_SCOPE = @tf_scope
ORDER BY TEST_NAME
, TF_SCOPE
, TF_NUMBER;

DBCC TRACEOFF(8757) with NO_INFOMSGS; -- enable trivial plans

END;

There are some limitations. The current version can only handle single statement queries up to 4000 characters. The trace flag to disable trivial plans is automatically set during the run to avoid some trace flags that are missed otherwise. The run time is based on the complexity of the query plan. The XML parsing isn’t very efficient but I can run one test on average every two minutes.

Running this procedure will clean the plan cache, clear any trace flags already set, and possibly do other bad things including, but not limited to, causing stack dumps or data integrity issues.

The First Kill

Looking for trace flags related to adaptive joins is a good way to test the procedure. Dima already posted about a few undocumented trace flags here so it should be easy to verify the results. I used the same table that I created as part of this blog post. Here’s the code that I ran for the test:

EXEC [dbo].[FIND_TRACE_FLAGS]
'DEMO_FOR_BLOG',
N'SELECT *
FROM dbo.MY_FIRST_CCI o
INNER JOIN dbo.SEEK_ME i ON o.JOIN_ID = i.JOIN_ID';

After 101 seconds here are the results:

I can click on the plans and diff them to get clues as to what the listed trace flags do.

Final Thoughts

Remember, this code is extremely dangerous and should never be run in production or even in an important test environment. Feel free to use it to hunt for trace flags if you wish. My only request is that you share anything that you find with the community in some form. Thanks for reading!