Columnstore Bitmap Filters

Microsoft has introduced a few improvements to bitmap filters with batch mode. I don’t really define any terms in this post, so if you don’t have a solid grasp of the fundamentals you should consider this blog post by Paul White required reading.

Test Data

Let’s start with a few simple examples that don’t involve columnstore. All of the test queries are simple joins between a dimension and a fact table on a DATETIME column. I’ll use the same dimension table for all of them:

DROP TABLE IF EXISTS dbo.DimDim;

CREATE TABLE dbo.DimDim (
	dimDate DATETIME,
	PRIMARY KEY (dimDate)
);

INSERT INTO dbo.DimDim VALUES
('20170101'),
('20170102'),
('20170103'),
('20170104'),
('20170105'),
('20170106'),
('20170107'),
('20170108'),
('20170109'),
('20170110'),
('20171201'),
('20171202'),
('20171203'),
('20171204'),
('20171205'),
('20171206'),
('20171207'),
('20171208'),
('20171209'),
('20171210');

Twenty rows in total with 10 dates in January and 10 dates in December. Obviously this isn’t a proper dimension table, but it makes the demos a little simpler. The fact tables have about a million rows in all twelve months of 2017 for a total of about 12 million rows.

Rowstore Heaps

Let’s start with a rowstore heap for the fact table. Code to create and populate the table is below:

DROP TABLE IF EXISTS dbo.FactHeapNoPart;

CREATE TABLE dbo.FactHeapNoPart (
	factDate DATETIME,
	justTheFacts VARCHAR(100)
);

DECLARE @month INT = 1;
SET NOCOUNT ON;

WHILE @month <= 12
BEGIN
	INSERT INTO dbo.FactHeapNoPart
		WITH (TABLOCK)
	SELECT DATEADD(DAY, t.RN / 38500
	, DATEADD(MONTH, @month, '20161201'))
	, REPLICATE('FACTS', 20)
	FROM
	(
		SELECT TOP (1048576) ROW_NUMBER()
			OVER (ORDER BY (SELECT NULL)) RN
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
	) t
	OPTION (MAXDOP 1);

	SET @month = @month + 1;
END;

The first demo query is forced to run in serial:

SELECT *
FROM dbo.DimDim dd
INNER JOIN dbo.FactHeapNoPart f
	ON dd.dimDate = f.factDate
OPTION (MAXDOP 1);

With a serial query there is no visible bitmap operator as expected:

a23_rowstore_heap_serial

On my machine, the query requires 190656 logical reads and 2234 ms of CPU time to execute.

Bumping the query up to MAXDOP 2 results in a bitmap operator:

a23_rowstore_heap_parallel

The bitmap operator isn’t pushed in row due to the data type, but only 769998 rows are sent to the hash match as a result. CPU time required to execute the query falls to 1844 and logical reads stays the same. We did the same amount of IO, which seems perfectly reasonable here.

Rowstore Clustered Indexes

It’s not a very common plan type, but what happens with a hash join on the clustered key of a rowstore table? The code below creates a fact table with the same data as before but now we have a clustered index:

DROP TABLE IF EXISTS dbo.FactClustNoPart;

CREATE TABLE dbo.FactClustNoPart (
	factDate DATETIME,
	justTheFacts VARCHAR(100)
);

CREATE CLUSTERED INDEX CI ON FactClustNoPart (factDate);

DECLARE @month INT = 1;
SET NOCOUNT ON;

WHILE @month <= 12
BEGIN
	INSERT INTO dbo.FactClustNoPart
		WITH (TABLOCK)
	SELECT DATEADD(DAY, t.RN / 38500
	, DATEADD(MONTH, @month, '20161201'))
	, REPLICATE('FACTS', 20)
	FROM
	(
		SELECT TOP (1048576) ROW_NUMBER()
			OVER (ORDER BY (SELECT NULL)) RN
		FROM master..spt_values t1
		CROSS JOIN master..spt_values t2
	) t
	OPTION (MAXDOP 1);

	SET @month = @month + 1;
END;

All rows are again read from the fact table. The bitmap operator isn’t quite as effective as before. 798498 rows are sent to the join but only 769998 remain after the join.

Could SQL Server do better? Building the hash table is a blocking operator and it should be trivial to keep track of the minimum and maximum join key while building it. In theory, one could imagine a plan with a clustered index seek on the fact table instead of a clustered index scan that takes advantage of the minimum and maximum values found during the hash build phase. This optimization would result in less overall IO. Perhaps this wasn’t implemented because this isn’t a very common plan shape.

Rowstore Partitioned Heaps

Now let’s move onto to a partitioned rowstore heap for the fact table. There is one partition per month and 12 partitions end up with data. Code to create and populate the table is below:

CREATE PARTITION FUNCTION hate_this_syntax_fun
(DATETIME)
AS RANGE RIGHT
FOR VALUES (
  '20161231'
, '20170101'
, '20170201'
, '20170301'
, '20170401'
, '20170501'
, '20170601'
, '20170701'
, '20170801'
, '20170901'
, '20171001'
, '20171101'
, '20171201'
, '20180101'
);

CREATE PARTITION SCHEME hate_this_syntax_scheme
AS PARTITION hate_this_syntax_fun
ALL TO ( [PRIMARY] );

DROP TABLE IF EXISTS dbo.FactHeapPart;

CREATE TABLE dbo.FactHeapPart (
	factDate DATETIME,
	justTheFacts VARCHAR(100)
) ON hate_this_syntax_scheme (factDate);
set statistics io, time on;

DECLARE @month INT = 1;
SET NOCOUNT ON;

WHILE @month <= 12
BEGIN
	INSERT INTO dbo.FactHeapPart
		WITH (TABLOCK)
	SELECT t2.factDate, t2.justTheFacts
	FROM
	(
		SELECT CAST(
			DATEADD(DAY, CAST(t.RN / 38500 AS INT)
			, DATEADD(MONTH, @month, '20161201')
			) AS DATETIME) factDate
		, REPLICATE('FACTS', 20) justTheFacts
		FROM
		(
			SELECT TOP (1048576) ROW_NUMBER()
				OVER (ORDER BY (SELECT NULL)) RN
			FROM master..spt_values t1
			CROSS JOIN master..spt_values t2
		) t
	) t2
	--WHERE $PARTITION.hate_this_syntax_fun(t2.factDate) = @month + 2
	OPTION (MAXDOP 1);

	SET @month = @month + 1;
END;

The code above takes a bit longer to execute than I would have liked. There’s a sort before inserting data even with a filter on the partitioning function. I’m not sure why I couldn’t make the sort go away. I expect that it has something to do with data types.

Running the same query as before:

SELECT *
FROM dbo.DimDim dd
INNER JOIN dbo.FactHeapPart f
	ON dd.dimDate = f.factDate
OPTION (MAXDOP 2);

The plan looks the same as before. The bitmap sends 790371 rows to the hash join. One thing to note is that all partitions are read from the table:

a23_all_partitions

We know that based on the data in the dimension table that SQL Server only needs to read two partitions from the fact table. Could the query optimizer in theory do better than it did? Consider the fact that a partitioned table has at most 15000 partitions. All of the partition values cannot overlap and they don’t change without a DDL operation. When building the hash table the query optimizer could keep track of which partitions have at least one row in them. By the end of the hash build we’ll know exactly which partitions could contain data, so the rest of the partitions could be skipped during the probe phase.

Perhaps this isn’t implemented because it’s important for the hash build to be independent of the probe. Maybe there’s no guarantee available at the right time that the bitmap operator will be pushed all the way down to the scan as opposed to a repartition streams operator. Perhaps this isn’t a common case and the optimization isn’t worth the effort. After all, how often do you join on the partitioning column instead of filtering by it?

It’s worth noting that the theoretical optimization described above still isn’t as good as the collocated join optimization blogged about by Paul White.

Columnstore

Now let’s build a columnstore index with 1048576 rows per rowgroup:

DROP TABLE IF EXISTS dbo.FactCCINoPart;

CREATE TABLE dbo.FactCCINoPart (
	factDate DATETIME,
	justTheFacts VARCHAR(100),
	INDEX CCI CLUSTERED COLUMNSTORE
);

DECLARE @month INT = 1;
SET NOCOUNT ON;

WHILE @month <= 12
BEGIN
	INSERT INTO dbo.FactCCINoPart
		WITH (TABLOCK)
	SELECT t2.factDate, t2.justTheFacts
	FROM
	(
		SELECT CAST(
			DATEADD(DAY, CAST(t.RN / 38500 AS INT)
			, DATEADD(MONTH, @month, '20161201')
			) AS DATETIME) factDate
		, REPLICATE('FACTS', 20) justTheFacts
		FROM
		(
			SELECT TOP (1048576) ROW_NUMBER()
				OVER (ORDER BY (SELECT NULL)) RN
			FROM master..spt_values t1
			CROSS JOIN master..spt_values t2
		) t
	) t2
	OPTION (MAXDOP 1);

	SET @month = @month + 1;
END;

Let’s return to our serial join query:

SELECT *
FROM dbo.DimDim dd
INNER JOIN dbo.FactCCINoPart f
	ON dd.dimDate = f.factDate
OPTION (MAXDOP 1);

A few interesting things happen. The first is that we get rowgroup elimination even though the dates in the dimension table are spread very far apart:

Table ‘FactCCINoPart’. Segment reads 2, segment skipped 10.

The following simple query doesn’t get rowgroup elimination:

SELECT *
FROM dbo.FactCCINoPart f
WHERE f.factDate IN ('20170101', '20171231')

You can read more about that limitation here. It’s fair to say that the bitmap filter does a better job than expected with rowgroup elimination. According to extended events this is known as an expression filter bitmap. The extended event has a few undocumented properties about the bitmap:

a23_cs bitmap_xe

I watched the extended events fly by a few times but it wasn’t clear to me what was going on internally. One possible implementation would be for the hash build to compare each build row to the rowgroup low and high values to figure out which rowgroups could never possibly return matched rows. I strongly suspect that is not the implementation that Microsoft chose. Perhaps they take advantage of the small expected size of the bitmap filter to send information about all of the build rows to do elimination. I don’t know a lot about computer science, but the usual structure of bitmap would not be sufficient because you can’t use such a bitmap to make any determination about inequality comparisons. It can only tell you if an individual row can’t match.

The second interesting thing is that we get an optimized bitmap even for a serial plan:

a23_CCI_opt_bitmap

Optimized Bitmaps

At some point optimized bitmaps were limited to parallel plans. I suspect that restriction was relaxed with the availability of batch mode in a plan but I don’t know for sure. The demos below show optimized bitmaps both improving and degrading performance. The script below creates three tables and takes about three minutes to run on my machine:

DROP TABLE IF EXISTS #t;

SELECT TOP (3000000) ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL)) RN into #t
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

-- rowstore dim with an index
-- that includes DimForeignKey
DROP TABLE IF EXISTS dbo.RSJoinIndex;
CREATE TABLE dbo.RSJoinIndex (
	IrrelevantKey BIGINT,
	SelectKey BIGINT,
	DimForeignKey BIGINT,
	FilterIndexed INT,
	PageFiller VARCHAR(3000),
	PRIMARY KEY (IrrelevantKey)
);

INSERT INTO dbo.RSJoinIndex WITH (TABLOCK)
SELECT t.RN, 1, t.RN, 1, REPLICATE('Z', 3000)
FROM #t t
OPTION (MAXDOP 1);

CREATE INDEX IX_RSJoinIndex ON dbo.RSJoinIndex
	(FilterIndexed) INCLUDE (DimForeignKey);
CREATE STATISTICS S1 ON dbo.RSJoinIndex (DimForeignKey)
	WITH FULLSCAN;

-- rowstore dim with an index
-- that does not include DimForeignKey
DROP TABLE IF EXISTS dbo.RSNoJoinIndex;
CREATE TABLE dbo.RSNoJoinIndex (
	IrrelevantKey BIGINT,
	SelectKey BIGINT,
	DimForeignKey BIGINT,
	FilterIndexed INT,
	PageFiller VARCHAR(3000),
	PRIMARY KEY (IrrelevantKey)
);

INSERT INTO dbo.RSNoJoinIndex WITH (TABLOCK)
SELECT t.RN, 1, t.RN, 1, REPLICATE('Z', 3000)
FROM #t t
OPTION (MAXDOP 1);

CREATE INDEX IX_RSNoJoinIndex ON dbo.RSNoJoinIndex
	(FilterIndexed);
CREATE STATISTICS S1 ON dbo.RSNoJoinIndex (DimForeignKey)
	WITH FULLSCAN;

-- CCI fact table
DROP TABLE IF EXISTS dbo.CCIFact;
CREATE TABLE dbo.CCIFact (
	DimForeignKey BIGINT,
	FilterCol BIGINT,
	INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCIFact WITH (TABLOCK)
SELECT t.RN , 1 + t.RN % 1000
FROM #t t
OPTION (MAXDOP 1);

CREATE STATISTICS S1 ON dbo.CCIFact (DimForeignKey)
	WITH FULLSCAN;
CREATE STATISTICS S2 ON dbo.CCIFact (FilterCol)
	WITH FULLSCAN;

All tables have three million rows and the join between them results in 3000 rows. RSJoinIndex is a rowstore table with 3 million rows that contains the join column in a nonclustered index. RSNoJoinIndex is a rowstore table without the join column in a nonclustered index. CCIFact is a columnstore table with three million rows.

Optimized Bitmaps Gone Right

Consider the following query:

SELECT table0.SelectKey
FROM dbo.RSNoJoinIndex AS table0
INNER JOIN dbo.CCIFact AS table1
ON table0.DimForeignKey = table1.DimForeignKey
WHERE table0.FilterIndexed = 1
  AND table1.FilterCol = 1 -- 0.1% of the data
OPTION (MAXDOP 1);

The RSNoJoinIndex table has an index with a key column of FilterIndexed and an included column of DimForeignKey. Here’s the plan that I get:

a23_final_good_bitmap

The CCI is used as the build for the hash join. The highlighted filter is an optimized bitmap that’s applied after the index seek but before the key lookup to get the SelectKey column. The bitmap reduces the number of required key lookups from 3000000 to just 3000. The query executes and requires 21874 logical reads from the rowstore table and 437 ms of CPU time on my machine.

Overall this is a reasonable plan and an effective application of a bitmap filter.

Optimized Bitmaps Gone Wrong

Let’s query the table without the join column in the index:

SELECT table0.SelectKey
FROM dbo.RSNoJoinIndex AS table0
INNER JOIN dbo.CCIFact AS table1
ON table0.DimForeignKey = table1.DimForeignKey
WHERE table0.FilterIndexed = 1
  AND table1.FilterCol = 1 -- 0.1% of the data
OPTION (MAXDOP 1);

Here’s the plan:

a23_final_bad_bitmap

The position of the bitmap has changed so that it’s evaluated after the key lookup. That makes sense because the key lookup returns the column to be filtered against. However, the bitmap filter still reduces the estimated number of key lookups from 3000000 to 3000. This is impossible. The filter can only be applied after the key lookup, so it does not make sense for the bitmap to reduce the number of estimated executions of the key lookup.

Performance is significantly worse with the query now requiring 12199107 logical reads from the rowstore table and 13406 CPU time overall. We can see that the query did three million key lookups:

a23_3M_Keylookups

I could only get this type of plan to appear with a columnstore index somewhere present in the query. It can be triggered even with two rowstore tables as long as the query is batch mode eligible. I’ve filed a connect item for this issue. Please vote if you have a moment.

Final Thoughts

Batch mode and columnstore have brought some interesting (and as far as I can tell, undocumented) improvements to bitmap filters. Thanks for reading!

Advertisements

The Referential Integrity Operator

SQL Server 2016 introduced a new query plan operator that mostly flew under the radar: foreign key references check. It doesn’t even show up in the list of operators:

a22_missing_operator

I’m not a foreign key guy, but a new operator is interesting enough for me to poke around. It’s described as the referential integrity operator in most of the docs so I’ll also describe it that way here for the rest of the post.

What Problem Does This Operator Solve?

Let’s start with a simple parent table and a simple child table that has two columns which both reference the parent table:

DROP TABLE IF EXISTS dbo.FK_PARENT_TABLE;
CREATE TABLE dbo.FK_PARENT_TABLE (
    FKey BIGINT NOT NULL,
    PRIMARY KEY (FKey)
); 

INSERT INTO FK_PARENT_TABLE VALUES (1);

DROP TABLE IF EXISTS dbo.CHILD_TABLE_BIG;
CREATE TABLE dbo.CHILD_TABLE_BIG (
	SurrogateKey BIGINT NOT NULL,
	FKey1 BIGINT NOT NULL,
	CONSTRAINT FK_CHILD_TABLE_BIG_1 FOREIGN KEY (FKey1)
		REFERENCES FK_PARENT_TABLE (FKey),
	FKey2 BIGINT NOT NULL,
	CONSTRAINT FK_CHILD_TABLE_BIG_2 FOREIGN KEY (FKey2)
		REFERENCES FK_PARENT_TABLE (FKey),
);

What happens if we need to delete a row from the parent table? SQL Server has to check that none of the children table values match the value we want to delete. The check can be seen in the execution plan for the delete query:

a22_simple_delete_2_references

In this example, we have two unindexed columns in the child table. The query optimizer does a table scan for each referencing row. CHILD_TABLE_BIG is currently empty so this isn’t a problem.

What would happen if a parent table was referenced by hundreds of child tables, such as for a date dimension table? Deleting or updating a row in the parent table would create a query plan with at least one join per incoming foreign key reference. Creating a query plan for that statement is equivalent to creating a query plan for a query containing hundreds or even thousands of joins. That query plan could take a long time to compile or could even time out. For example, I created a simple query with 2500 joins and it still hadn’t finished compiling after 15 minutes. That’s why I assume a table is limited to 253 incoming foreign key references in SQL Server 2014.

That restriction won’t be hit often but could be pretty inconvenient to work around. The referential integrity operator introduced with compatibility level 130 raises the limit from 253 to 10000. All of the joins are collapsed into a single operator which can reduce compile time and avoid errors. From a blog post by Gjorgji Gjeorgjievski:

SQL Server 2016 introduces a new Referential Integrity Operator (under compatibility level 130) that increases the limit on the number of other tables with foreign key references to a primary or unique key of a given table (incoming references), from 253 to 10,000. The new query execution operator does the referential integrity checks in place, by comparing the modified row to the rows in the referencing tables, to verify that the modification will not break the referential integrity. This results in much lower compilation times for such plans and comparable execution times.

Many of SQL Server’s query plan operators are designed to be small in scope and reusable for lots of different purposes. For example, when STRING_AGG() was released in SQL Server 2017 it wasn’t necessary for Microsoft to create a new query plan operator. That function was implemented with a combination of existing operators. It’s unfortunate that the referential integrity operator is an “all-in-one” operator. Perhaps this was the only practical way to increase the number of incoming references.

Running Out Of Stack Space

First I’ll create tables similar to before, but CHILD_TABLE_BIG will have 253 references to FK_PARENT_TABLE. The code to create those tables is too tedious and long for this blog post, and that’s saying something. After creating the tables, the next step is to set the compatibility level to 120:

ALTER DATABASE tempdb SET COMPATIBILITY_LEVEL = 120;

Now I try to create one more incoming reference:

CREATE TABLE dbo.JUST_ONE_MORE_FK (
	SurrogateKey BIGINT NOT NULL,
	FKey254 BIGINT NOT NULL,
		CONSTRAINT FK_254 FOREIGN KEY (FKey1)
		REFERENCES FK_PARENT_TABLE (FKey)
);

It worked. So did generating a plan to delete a row from the parent table, which is a bit surprising. On SQL Server 2016 or later with compatibility level 120, it’s possible to create a query plan for a delete or update against a table with more than 253 incoming references. It’s certainly possible to get into trouble with too many incoming references. For example, I get the following error when attempting to delete from a parent table with about 2000 incoming references:

Msg 8621, Level 17, State 1, Line 2015
The query processor ran out of stack space during query optimization. Please simplify the query.

Referential Integrity Operator to the Rescue

Now I’ll reset the compatibility level to 140 and reset all of the tables. Now there are only 253 incoming references to the parent table.

ALTER DATABASE tempdb SET COMPATIBILITY_LEVEL = 140;

DELETE FROM dbo.FK_PARENT_TABLE
WHERE FKey = 1;

The query plan is a long one:

a22_simple_delete_253_references

Looks like we don’t get the new operator by default with just 253 incoming references. Let’s add one more reference:

DROP TABLE IF EXISTS dbo.JUST_ONE_MORE_FK;

CREATE TABLE dbo.JUST_ONE_MORE_FK (
	SurrogateKey BIGINT NOT NULL,
	FKey254 BIGINT NOT NULL,
		CONSTRAINT FK_254 FOREIGN KEY (FKey254)
		REFERENCES FK_PARENT_TABLE (FKey)
);

Here it is:

a22_first_time_seeing_operator

This operator is made available through the RefIntegrityMaintainer optimizer rule. If I disable that optimizer rule with 254 incoming references I get the usual error:

Msg 8622, Level 16, State 1, Line 278
Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

It is possible to force a plan without the new operator (more on that later). It just isn’t as easy as disabling the new optimizer rule.

Is it possible to get this operator with 253 or fewer incoming references? Yes:

a22_ref_operator_1_fk

The foreign key references check operator is used here with just one incoming reference. This was done via a USE HINT query hint:

DELETE FROM dbo.FK_PARENT_TABLE
WHERE FKey = 1
OPTION (USE PLAN N'<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.6" Build="14.0.3008.27" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple>
          <StatementSetOptions />
          <QueryPlan >
            <MemoryGrantInfo SerialRequiredMemory="0" SerialDesiredMemory="0" />
            <OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="0" EstimatedPagesCached="0" EstimatedAvailableDegreeOfParallelism="0" MaxCompileMemory="0" />
            <RelOp AvgRowSize="0" EstimateCPU="0" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="0" LogicalOp="Foreign Key References Check" NodeId="0" Parallel="false" PhysicalOp="Foreign Key References Check" EstimatedTotalSubtreeCost="0">
              <OutputList />
              <ForeignKeyReferencesCheck>
                <RelOp AvgRowSize="0" EstimateCPU="0" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="0" LogicalOp="Delete" NodeId="1" Parallel="false" PhysicalOp="Clustered Index Delete" EstimatedTotalSubtreeCost="0">
                  <OutputList>
                    <ColumnReference Database="" Schema="" Table="" Column="" />
                  </OutputList>
                  <SimpleUpdate DMLRequestSort="false">
                    <Object Database="[D1]" Schema="[dbo]" Table="[FK_PARENT_TABLE]" Index="[PK__FK_PAREN__A19DDDFB5081F0A5]" IndexKind="Clustered" Storage="RowStore" />
                    <SeekPredicateNew>
                      <SeekKeys>
                        <Prefix ScanType="EQ">
                          <RangeColumns>
                            <ColumnReference Database="" Schema="" Table="" Column="" />
                          </RangeColumns>
                          <RangeExpressions>
                            <ScalarOperator ScalarString="">
                              <Const ConstValue="" />
                            </ScalarOperator>
                          </RangeExpressions>
                        </Prefix>
                      </SeekKeys>
                    </SeekPredicateNew>
                  </SimpleUpdate>
                </RelOp>
                <ForeignKeyReferenceCheck>
                  <IndexScan Ordered="false">
                    <Object Database="" Schema="" Table="" IndexKind="Heap" Storage="RowStore" />
                  </IndexScan>
                </ForeignKeyReferenceCheck>
              </ForeignKeyReferencesCheck>
            </RelOp>
            <ParameterList>
              <ColumnReference Column="@1" ParameterDataType="tinyint" ParameterCompiledValue="(1)" />
            </ParameterList>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>'
);

I suspect that this plan shape can also be achieved through the query store or a plan guide but did not test that.

Striking Gold With Extended Events

I got stuck on something during my investigation so I begrudgingly went into the extended events area of SSMS. The only relevant event that I found was reason_many_foreign_keys_operator_not_used, but oh what an event it was!

a22_extended_events_gold

Is trace flag 9448 a previously unknown (to the community) trace flag? Sure looks like it:

a22_world_first_trace_flag

Not a very useful one, but I can cross “discover a trace flag” off my bucket list. I searched around a bit in sys.dm_xe_map_values but couldn’t find any other freebies.

Some of the other reasons are helpful as well (I had mistakenly thought that an alternate plan couldn’t be forced before seeing that). I have no idea what some of them mean (MERGE_OPERATION), but this blog post isn’t going to dig into all of the reasons why you might not get the operator. I’m mainly looking to understand the basics of when it can show up, what it does, and when it can go wrong.

Operator Mechanics

The estimated plan details for the operator and the XML give you some level of information, but it isn’t too clear what’s going on under the hood. Below is a theory that I formed through some testing:

For plan creation, loop through each incoming foreign key in object_id order. Look for a suitable index on that column. A suitable index is a non-filtered one with all of the foreign key columns existing as a left subset of the index key columns. If there are multiple suitable indexes, pick the one with the highest index_id.

The possible access paths I was able to get were nonclustered index seek, clustered index scan, and table scan. For query execution, process each row one at a time in the operator. Check all foreign keys in object_id order using the access method that was determined earlier. Something like the equivalent query is run for each row getting deleted or updated and for each incoming foreign key reference:

SELECT TOP 1 1
FROM CHILD_TABLE WITH (INDEX (?))
WHERE ID = ?
OPTION (MAXDOP 1);

If any query returns a row, immediately quit and throw an error. I’m not claiming that SQL Server actually runs a separate query for each check that it needs to do. In fact, I suspect that it takes some kind of shortcut using the storage engine. That could be why the referential integrity operator currently doesn’t support incoming foreign key references from columnstore indexes.

The important takeaways here are that this operator truly operates rows by row and isn’t very flexible when it comes to index selection. It’s easy to imagine a scenario in which thousands of even millions of table scans would be performed by this operator, so indexing key columns on the child tables are important for performance of this operator.

XML Structure

There’s some information in the show plan for this operator but I had trouble decoding it. There’s a partial matching indexes count field and I couldn’t figure out how to get that to be anything but zero. Filtering indexes, covering indexes, and indexes with the first column of a two column foreign key all didn’t work.

The other annoying thing is that all table accesses are described as IndexScan, even when the equivalent of an index seek will be used:

a22_annoying_show_plan

A seek won’t have a predicate specified. You can also tell apart scans and seeks by paying attention to the amount of detail in the XML.  Here’s what an index seek looks like:

<?xml version="1.0" encoding="UTF-8"?>
<ForeignKeyReferenceCheck>
   <IndexScan Ordered="false">
      <Object Database="[D1]" Schema="[dbo]" Table="[CHILD_TABLE_DIFFERENT_INDEXES]" Index="[IX_FKey2_INDEXED]" IndexKind="NonClustered" Storage="RowStore" />
   </IndexScan>
</ForeignKeyReferenceCheck>

Here’s what a clustered index scan looks like:

<?xml version="1.0" encoding="UTF-8"?>
<ForeignKeyReferenceCheck>
   <IndexScan Ordered="false">
      <Object Database="[D1]" Schema="[dbo]" Table="[CHILD_TABLE_DIFFERENT_INDEXES]" Index="[PK__CHILD_TA__A8D35536FD3AA097]" IndexKind="Clustered" Storage="RowStore" />
      <Predicate>
         <ScalarOperator ScalarString="[D1].[dbo].[CHILD_TABLE_DIFFERENT_INDEXES].[FKey1_NO_INDEX]=[D1].[dbo].[FK_PARENT_TABLE].[FKey]">
            <Compare CompareOp="EQ">
               <ScalarOperator>
                  <Identifier>
                     <ColumnReference Database="[D1]" Schema="[dbo]" Table="[CHILD_TABLE_DIFFERENT_INDEXES]" Column="FKey1_NO_INDEX" />
                  </Identifier>
               </ScalarOperator>
               <ScalarOperator>
                  <Identifier>
                     <ColumnReference Database="[D1]" Schema="[dbo]" Table="[FK_PARENT_TABLE]" Column="FKey" />
                  </Identifier>
               </ScalarOperator>
            </Compare>
         </ScalarOperator>
      </Predicate>
   </IndexScan>
</ForeignKeyReferenceCheck>

The operator details do tell you how many unmatched foreign keys that you have which is nice. But it might not be easy to immediately identify the problem column.

Compile Time and Best Case Performance

For the next set of tests I created a single parent table along with child tables of 100 columns each. The child tables had 1000 rows along with a foreign key and an index on each column. I forced the new operator to appear with fewer than 254 incoming references via a USE HINT query plan and I prevented the new operator from appearing with TF 9448 with more than 253 incoming references. For each test I deleted half of the parent table, 500 rows.

Here’s the code that never produces a plan with the referential integrity operator:

BEGIN TRANSACTION;
SET STATISTICS TIME ON;

DELETE FROM FK_PARENT_TABLE
WHERE Fkey BETWEEN 501 AND 1000
OPTION (RECOMPILE, QUERYTRACEON 9448);

SET STATISTICS TIME OFF;
ROLLBACK;

BEGIN TRANSACTION;
SET STATISTICS TIME ON;

DELETE FROM FK_PARENT_TABLE
WHERE Fkey BETWEEN 501 AND 1000
OPTION (RECOMPILE, QUERYTRACEON 9448, LOOP JOIN);

SET STATISTICS TIME OFF;
ROLLBACK;

Here’s part of the code to get the new operator:

BEGIN TRANSACTION;

DELETE FROM FK_PARENT_TABLE
WHERE Fkey BETWEEN 501 AND 1000
OPTION (RECOMPILE,
USE PLAN N'<?xml version="1.0" encoding="utf-16"?>
...
</ShowPlanXML>'
);

ROLLBACK;

Here is a chart of my results:

a22_compile_chart

Here are some graphs, because good blog posts have graphs:

a22_super_cool_chart

The referential integrity operator wins pretty handily here, although I wonder what all of the fuss around the former 253 incoming limit was about. I’m also impressed that the new operator performs so much better even when comparing to forced nested loop joins. The number of logical reads reported through SET STATISTICS IO is very similar but not identical.

Worst Case Performance

The big disadvantage of the operator (other than us not knowing what’s going on) is that table access and plan options are very limited. In addition, there seem to be no query rules or query hints which will affect the performance of the operator. I’m not saying that deleting rows from a parent table with unindexed children tables is a good practice, but you have more options in that situation without the new operator.

For this next test, I created a 100 column child table without 100 foreign keys but no indexes. The child table has 1045000 rows because I accidentally inserted too many rows into it.

Deleting 10 rows from the parent table with the new operator results in 50000 = 10 * 100 table scans, although the reporting isn’t quite correct:

Table ‘CHILD_TABLE_1’. Scan count 100, logical reads 117040000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 161204 ms, elapsed time = 163982 ms.

The table has 117040 data pages so the scan count should have been reported as 1000.

Deleting 500 rows from the parent table without the new operator results in a table scan on the inner side of a nested loop:

a22_row_goal

The cost of the scan is discounted due to the row goal. It is amusing to consider that the query optimizer assumes that a matching row will quickly be found when such a row would violate the assert for the foreign keys. Disabling the row goal gives me 100 index eager spools, which is good only for causing nightmares:

a22_nightmare

Forcing a MERGE JOIN results in parallel zones in the foreign key checking section which is interesting to see:

a22_merge_join

This query finishes faster than the query with the referential integrity operator:

Table ‘CHILD_TABLE_1’. Scan count 500, logical reads 11704000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘FK_PARENT_TABLE’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 165521 ms, elapsed time = 89318 ms.

I expect the elapsed time to be better on a machine with better hardware than mine. If I delete negative foreign keys then I get even better performance:

SQL Server Execution Times:
CPU time = 56515 ms, elapsed time = 57350 ms.

The easiest way to get a large performance difference is by creating a wide table with just one foreign key column. The referential integrity can only do scans but traditional plans have more options. Code below:

DROP TABLE IF EXISTS CHILD_TABLE_THREE_COL;
DROP TABLE IF EXISTS dbo.FK_PARENT_TABLE

CREATE TABLE FK_PARENT_TABLE (
    FKey BIGINT NOT NULL
    PRIMARY KEY (FKey)
); 

INSERT INTO FK_PARENT_TABLE
SELECT TOP (2000) -1 + ROW_NUMBER()
	OVER (ORDER BY (SELECT NULL))
FROM master..spt_values;

CREATE TABLE dbo.CHILD_TABLE_THREE_COL (
	FKey1 BIGINT NOT NULL,
		FOREIGN KEY (FKey1)
		REFERENCES FK_PARENT_TABLE (FKey),
	OtherKey BIGINT NOT NULL,
	FILLER VARCHAR(3500)
);

INSERT INTO CHILD_TABLE_THREE_COL WITH (TABLOCK)
SELECT TOP (1000 * 1000) RN / 1000
, 1
, REPLICATE('Z', 3500)
FROM
(
	SELECT ROW_NUMBER()
		OVER (ORDER BY (SELECT NULL)) RN
	FROM master..spt_values t1
	CROSS JOIN master..spt_values t2
) t;

CREATE INDEX COVERING ON CHILD_TABLE_THREE_COL (OtherKey, FKey1);

For this delete:

DELETE FROM FK_PARENT_TABLE
WHERE Fkey BETWEEN 1010 AND 1110;

Here are the stats for the refential integrity operator:

Table ‘CHILD_TABLE_THREE_COL’. Scan count 1, logical reads 50500101, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘FK_PARENT_TABLE’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 31062 ms, elapsed time = 31100 ms.

Here are the stats for a standard plan with an index spool:

Table ‘Worktable’. Scan count 101, logical reads 2955922, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘CHILD_TABLE_THREE_COL’. Scan count 1, logical reads 500001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘FK_PARENT_TABLE’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 2219 ms, elapsed time = 2223 ms.

The index spool is so effective here because the temporary object doesn’t include the FILLER column. Oddly, I wasn’t able to get the query optimizer to use a covering index, so this example will have to do.

MS Response

I contacted Microsoft to express my concerns about performance. They responded with the following:

a22_MSFT_response

Disabling the Operator

If the new operator is causing you some kind of problem you do have a few options. You can use undocumented trace flag 9448 to disable the feature, although I can’t recommend that for production. Columnstore indexes still aren’t supported with the operator in SQL Server 2017, so you could create an empty columnstore table with foreign keys to any relevant parent tables. You can also run your delete or update query from a database with a compatibility level of 120. It isn’t necessary for the database to contain the parent table or even any user tables.

The only performance problem that I ran into involved missing indexes, so adding those or temporarily disabling foreign keys as needed is probably the most production-friendly way of doing it.

Final Thoughts

Microsoft is doing some pretty sneaky things with the referential integrity operator. I object to the concept from an operator design standpoint but it seems to significantly reduce compile times and to provide surprisingly good performance in most cases. Thanks for reading!