I have a table (named #Rule in the sample below) composed of several range min and max pairs and a second table (#TestSet) that contains many individual values that I want to test to determine if they fall (or don't fall) within each corresponding range in Rule. (You can think along the lines of attributes of a person and then a table of rules to apply given those attributes.) I am looking for suggested query or algorithm changes that allow for 'quickly' determining if all the values in the TestSet table fall within all ranges for at least one row in the Rule table. Rule table row count could be in the 100's of thousands to 10 million. TestSet could be 100's of millions but can be chunked and processed as needed.
Looking for the fastest way to get an answer for each row in TestSet if it is 'covered' or not (all values fall within each corresponding range pair) by a row in Rule. It would be a nice to have to also get a count of rules it is 'covered' by. The sample is SQL Server. Any suggestions? Any way to force the query to only take a single pass over each table when joining on all the inequality clauses?
--TEST DATA SET UP 'RULES'
CREATE TABLE #TempRule (
AMin INT,
AMax INT,
BMin INT,
BMax INT,
CMin INT,
CMax INT,
DMin INT,
DMax INT,
EMin INT,
EMax INT,
FMin INT,
FMax INT
);
INSERT INTO #TempRule (AMin, AMax, BMin, BMax, CMin, CMax, DMin, DMax, EMin, EMax, FMin, FMax)
SELECT
AMin,
CASE WHEN AMin > AMax THEN AMin ELSE AMax END AS AMax,
BMin,
CASE WHEN BMin > BMax THEN BMin ELSE BMax END AS BMax,
CMin,
CASE WHEN CMin > CMax THEN CMin ELSE CMax END AS CMax,
DMin,
CASE WHEN DMin > DMax THEN DMin ELSE DMax END AS DMax,
EMin,
CASE WHEN EMin > EMax THEN EMin ELSE EMax END AS EMax,
FMin,
CASE WHEN FMin > FMax THEN FMin ELSE FMax END AS FMax
FROM (
SELECT
FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS AMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS AMax,
(FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS BMin,
(FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS BMax,
FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS CMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS CMax,
FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS DMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS DMax,
FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS EMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS EMax,
FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS FMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS FMax
FROM
(SELECT TOP 400000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM sys.objects AS o1 CROSS JOIN sys.objects AS o2 CROSS JOIN sys.objects AS o3) AS Numbers
) AS RandomValues;
-- Create an index that covers every column
CREATE CLUSTERED INDEX idx_TempRules_AllColumns ON #TempRule (AMin, AMax, BMin, BMax, CMin, CMax, DMin, DMax, EMin, EMax, FMin, FMax);
--TEST DATA SET UP 'Set of Values'
CREATE TABLE #TestSet (
A INT,
B INT,
C INT,
D INT,
E INT,
F INT
);
INSERT INTO #TestSet (A, B, C, D, E, F)
SELECT
FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS A,
(FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS B,
FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS C,
FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS D,
FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS E,
FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS F
FROM
(SELECT TOP 15000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM master.dbo.spt_values AS a CROSS JOIN master.dbo.spt_values AS b) AS Numbers;
SELECT COUNT(*) FROM #TestSet
CREATE CLUSTERED INDEX idx_TestSet ON #TestSet (A, B, C, D, E, F);
CREATE TABLE #TestSetChunk (
A INT,
B INT,
C INT,
D INT,
E INT,
F INT
);
CREATE CLUSTERED INDEX idx_TestSetChunk ON #TestSetChunk (A, B, C, D, E, F);
-- Create the results temp table
CREATE TABLE #Results (
A INT,
B INT,
C INT,
D INT,
E INT,
F INT
);
-- Variables to control the loop
DECLARE @BatchSize INT = 1000000;
DECLARE @Offset INT = 0;
DECLARE @TotalRows INT;
DECLARE @ProcessedRows INT = 0;
-- Get the total number of rows in #TestSet
SELECT @TotalRows = COUNT(*) FROM #TestSet;
-- Loop to process 1 million records at a time
WHILE @Offset < @TotalRows
BEGIN
DELETE FROM #TestSetChunk;
INSERT INTO #TestSetChunk (A, B, C, D, E, F)
SELECT A, B, C, D, E, F
FROM (
SELECT A, B, C, D, E, F,
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum
FROM #TestSet
) AS n
WHERE n.RowNum BETWEEN @Offset + 1 AND @Offset + @BatchSize;
---THIS IS SLOW running part that I am trying to optimize
INSERT INTO #Results WITH (TABLOCK) (A, B, C, D, E, F)
SELECT n.A, n.B, n.C, n.D, n.E, n.F
FROM #TestSetChunk n WITH (INDEX(idx_TestSetChunk))
WHERE NOT EXISTS (
SELECT 1
FROM #TempRule t WITH (INDEX(idx_TempRules_AllColumns))
WHERE n.A >= t.AMin AND n.A <= t.AMax
AND n.B >= t.BMin AND n.B <= t.BMax
AND n.C >= t.CMin AND n.C <= t.CMax
AND n.D >= t.DMin AND n.D <= t.DMax
AND n.E >= t.EMin AND n.E <= t.EMax
AND n.F >= t.FMin AND n.F <= t.FMax
)
ORDER BY n.A;
-- Increment the offset
SET @Offset = @Offset + @BatchSize;
SET @ProcessedRows = @ProcessedRows + @BatchSize;
-- Print progress
PRINT 'Processed ' + CAST(@ProcessedRows AS VARCHAR) + ' rows out of ' + CAST(@TotalRows AS VARCHAR);
END
-- Select results (Need to do more. Just a simplification)
SELECT count(*) FROM #Results;
Here is the the execution plan which shows table scans
|--Table Insert(OBJECT:([tempdb].[dbo].[#Results]), SET:([tempdb].[dbo].[#Results].[A] = [tempdb].[dbo].[#TestSetChunk].[A] as [n].[A],[tempdb].[dbo].[#Results].[B] = [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B],[tempdb].[dbo].[#Results].[C] = [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C],[tempdb].[dbo].[#Results].[D] = [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D],[tempdb].[dbo].[#Results].[E] = [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E],[tempdb].[dbo].[#Results].[F] = [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]))
|--Nested Loops(Left Anti Semi Join, WHERE:([tempdb].[dbo].[#TestSetChunk].[A] as [n].[A]>=[tempdb].[dbo].[#TempRule].[AMin] as [t].[AMin] AND [tempdb].[dbo].[#TestSetChunk].[A] as [n].[A]<=[tempdb].[dbo].[#TempRule].[AMax] as [t].[AMax] AND [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B]>=[tempdb].[dbo].[#TempRule].[BMin] as [t].[BMin] AND [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B]<=[tempdb].[dbo].[#TempRule].[BMax] as [t].[BMax] AND [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C]>=[tempdb].[dbo].[#TempRule].[CMin] as [t].[CMin] AND [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C]<=[tempdb].[dbo].[#TempRule].[CMax] as [t].[CMax] AND [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D]>=[tempdb].[dbo].[#TempRule].[DMin] as [t].[DMin] AND [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D]<=[tempdb].[dbo].[#TempRule].[DMax] as [t].[DMax] AND [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E]>=[tempdb].[dbo].[#TempRule].[EMin] as [t].[EMin] AND [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E]<=[tempdb].[dbo].[#TempRule].[EMax] as [t].[EMax] AND [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]>=[tempdb].[dbo].[#TempRule].[FMin] as [t].[FMin] AND [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]<=[tempdb].[dbo].[#TempRule].[FMax] as [t].[FMax]))
|--Table Scan(OBJECT:([tempdb].[dbo].[#TestSetChunk] AS [n]))
|--Table Scan(OBJECT:([tempdb].[dbo].[#TempRule] AS [t]))
Current timings are about 23 Minutes to process 400,000 values in TestSet against about 6.3 Million values in TempRule
I have a table (named #Rule in the sample below) composed of several range min and max pairs and a second table (#TestSet) that contains many individual values that I want to test to determine if they fall (or don't fall) within each corresponding range in Rule. (You can think along the lines of attributes of a person and then a table of rules to apply given those attributes.) I am looking for suggested query or algorithm changes that allow for 'quickly' determining if all the values in the TestSet table fall within all ranges for at least one row in the Rule table. Rule table row count could be in the 100's of thousands to 10 million. TestSet could be 100's of millions but can be chunked and processed as needed.
Looking for the fastest way to get an answer for each row in TestSet if it is 'covered' or not (all values fall within each corresponding range pair) by a row in Rule. It would be a nice to have to also get a count of rules it is 'covered' by. The sample is SQL Server. Any suggestions? Any way to force the query to only take a single pass over each table when joining on all the inequality clauses?
--TEST DATA SET UP 'RULES'
CREATE TABLE #TempRule (
AMin INT,
AMax INT,
BMin INT,
BMax INT,
CMin INT,
CMax INT,
DMin INT,
DMax INT,
EMin INT,
EMax INT,
FMin INT,
FMax INT
);
INSERT INTO #TempRule (AMin, AMax, BMin, BMax, CMin, CMax, DMin, DMax, EMin, EMax, FMin, FMax)
SELECT
AMin,
CASE WHEN AMin > AMax THEN AMin ELSE AMax END AS AMax,
BMin,
CASE WHEN BMin > BMax THEN BMin ELSE BMax END AS BMax,
CMin,
CASE WHEN CMin > CMax THEN CMin ELSE CMax END AS CMax,
DMin,
CASE WHEN DMin > DMax THEN DMin ELSE DMax END AS DMax,
EMin,
CASE WHEN EMin > EMax THEN EMin ELSE EMax END AS EMax,
FMin,
CASE WHEN FMin > FMax THEN FMin ELSE FMax END AS FMax
FROM (
SELECT
FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS AMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS AMax,
(FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS BMin,
(FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS BMax,
FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS CMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS CMax,
FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS DMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS DMax,
FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS EMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS EMax,
FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS FMin,
FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS FMax
FROM
(SELECT TOP 400000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM sys.objects AS o1 CROSS JOIN sys.objects AS o2 CROSS JOIN sys.objects AS o3) AS Numbers
) AS RandomValues;
-- Create an index that covers every column
CREATE CLUSTERED INDEX idx_TempRules_AllColumns ON #TempRule (AMin, AMax, BMin, BMax, CMin, CMax, DMin, DMax, EMin, EMax, FMin, FMax);
--TEST DATA SET UP 'Set of Values'
CREATE TABLE #TestSet (
A INT,
B INT,
C INT,
D INT,
E INT,
F INT
);
INSERT INTO #TestSet (A, B, C, D, E, F)
SELECT
FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS A,
(FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS B,
FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS C,
FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS D,
FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS E,
FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS F
FROM
(SELECT TOP 15000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM master.dbo.spt_values AS a CROSS JOIN master.dbo.spt_values AS b) AS Numbers;
SELECT COUNT(*) FROM #TestSet
CREATE CLUSTERED INDEX idx_TestSet ON #TestSet (A, B, C, D, E, F);
CREATE TABLE #TestSetChunk (
A INT,
B INT,
C INT,
D INT,
E INT,
F INT
);
CREATE CLUSTERED INDEX idx_TestSetChunk ON #TestSetChunk (A, B, C, D, E, F);
-- Create the results temp table
CREATE TABLE #Results (
A INT,
B INT,
C INT,
D INT,
E INT,
F INT
);
-- Variables to control the loop
DECLARE @BatchSize INT = 1000000;
DECLARE @Offset INT = 0;
DECLARE @TotalRows INT;
DECLARE @ProcessedRows INT = 0;
-- Get the total number of rows in #TestSet
SELECT @TotalRows = COUNT(*) FROM #TestSet;
-- Loop to process 1 million records at a time
WHILE @Offset < @TotalRows
BEGIN
DELETE FROM #TestSetChunk;
INSERT INTO #TestSetChunk (A, B, C, D, E, F)
SELECT A, B, C, D, E, F
FROM (
SELECT A, B, C, D, E, F,
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum
FROM #TestSet
) AS n
WHERE n.RowNum BETWEEN @Offset + 1 AND @Offset + @BatchSize;
---THIS IS SLOW running part that I am trying to optimize
INSERT INTO #Results WITH (TABLOCK) (A, B, C, D, E, F)
SELECT n.A, n.B, n.C, n.D, n.E, n.F
FROM #TestSetChunk n WITH (INDEX(idx_TestSetChunk))
WHERE NOT EXISTS (
SELECT 1
FROM #TempRule t WITH (INDEX(idx_TempRules_AllColumns))
WHERE n.A >= t.AMin AND n.A <= t.AMax
AND n.B >= t.BMin AND n.B <= t.BMax
AND n.C >= t.CMin AND n.C <= t.CMax
AND n.D >= t.DMin AND n.D <= t.DMax
AND n.E >= t.EMin AND n.E <= t.EMax
AND n.F >= t.FMin AND n.F <= t.FMax
)
ORDER BY n.A;
-- Increment the offset
SET @Offset = @Offset + @BatchSize;
SET @ProcessedRows = @ProcessedRows + @BatchSize;
-- Print progress
PRINT 'Processed ' + CAST(@ProcessedRows AS VARCHAR) + ' rows out of ' + CAST(@TotalRows AS VARCHAR);
END
-- Select results (Need to do more. Just a simplification)
SELECT count(*) FROM #Results;
Here is the the execution plan which shows table scans
|--Table Insert(OBJECT:([tempdb].[dbo].[#Results]), SET:([tempdb].[dbo].[#Results].[A] = [tempdb].[dbo].[#TestSetChunk].[A] as [n].[A],[tempdb].[dbo].[#Results].[B] = [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B],[tempdb].[dbo].[#Results].[C] = [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C],[tempdb].[dbo].[#Results].[D] = [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D],[tempdb].[dbo].[#Results].[E] = [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E],[tempdb].[dbo].[#Results].[F] = [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]))
|--Nested Loops(Left Anti Semi Join, WHERE:([tempdb].[dbo].[#TestSetChunk].[A] as [n].[A]>=[tempdb].[dbo].[#TempRule].[AMin] as [t].[AMin] AND [tempdb].[dbo].[#TestSetChunk].[A] as [n].[A]<=[tempdb].[dbo].[#TempRule].[AMax] as [t].[AMax] AND [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B]>=[tempdb].[dbo].[#TempRule].[BMin] as [t].[BMin] AND [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B]<=[tempdb].[dbo].[#TempRule].[BMax] as [t].[BMax] AND [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C]>=[tempdb].[dbo].[#TempRule].[CMin] as [t].[CMin] AND [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C]<=[tempdb].[dbo].[#TempRule].[CMax] as [t].[CMax] AND [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D]>=[tempdb].[dbo].[#TempRule].[DMin] as [t].[DMin] AND [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D]<=[tempdb].[dbo].[#TempRule].[DMax] as [t].[DMax] AND [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E]>=[tempdb].[dbo].[#TempRule].[EMin] as [t].[EMin] AND [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E]<=[tempdb].[dbo].[#TempRule].[EMax] as [t].[EMax] AND [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]>=[tempdb].[dbo].[#TempRule].[FMin] as [t].[FMin] AND [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]<=[tempdb].[dbo].[#TempRule].[FMax] as [t].[FMax]))
|--Table Scan(OBJECT:([tempdb].[dbo].[#TestSetChunk] AS [n]))
|--Table Scan(OBJECT:([tempdb].[dbo].[#TempRule] AS [t]))
Current timings are about 23 Minutes to process 400,000 values in TestSet against about 6.3 Million values in TempRule
Share Improve this question edited 2 days ago BSchroeder asked Feb 7 at 23:39 BSchroederBSchroeder 233 bronze badges 11 | Show 6 more comments2 Answers
Reset to default 0One potential alternate approach is to first identify all of the #TestSet
rows that match one or more rules and then run a final query that selects all rows not in that list.
This works best if you add an identity column (or other single column PK) to the #TestSet
table so that we can track which rows match and easily exclude them from the final result.
Using indexes, SQL Server can more efficiently look up values that match a range, than in can searching for ranges that match a value. For that reason, it is more efficient to search for #TestSet
rows that match a #TempRule
row, than it is to search for #TempRule
rows that match a #TestSet
row. However, an index range seek can only efficiently narrow results for one range (such as AMin
.. AMax
). The other range conditions in #TempRule
will just be checked against the retrieved data. That said, even narrowing the search by one column is an advantage.
Given the following modified #TestSet
table definition:
CREATE TABLE #TestSet (
Id INT IDENTITY(1,1)
CONSTRAINT PK_TestSet PRIMARY KEY CLUSTERED,
A INT,
B INT,
C INT,
D INT,
E INT,
F INT,
INDEX IX_TestSet_AX (A) INCLUDE (B, C, D, E, F),
);
The following will find all #TestSet
rows having no matching #TempRule
entry.
DECLARE @MatchedSamples TABLE(Id INT PRIMARY KEY);
-- First find the test set rows that match one or more rules
INSERT INTO @MatchedSamples
SELECT DISTINCT n.Id
FROM #TempRule t
JOIN #TestSet n
ON n.A >= t.AMin AND n.A <= t.AMax -- Expect seek using IX_TestSet_AX
AND n.B >= t.BMin AND n.B <= t.BMax -- The rest is checked on the fly
AND n.C >= t.CMin AND n.C <= t.CMax
AND n.D >= t.DMin AND n.D <= t.DMax
AND n.E >= t.EMin AND n.E <= t.EMax
AND n.F >= t.FMin AND n.F <= t.FMax;
-- Now select everything not in that match list
INSERT INTO #Results
SELECT n.A, n.B, n.C, n.D, n.E, n.F
FROM #TestSet n
WHERE n.Id NOT IN (SELECT m.Id FROM @MatchedSamples m);
SELECT count(*) FROM #Results;
For simplicity, I have eliminated all of the "chunking" logic from the following code. If really needed, perhaps that could be added back on to the INSERT INTO #Results ...
step.
For the randomly generated test data, the execution was faster than the original logic, running in about 40% of the original query time. Results may be significantly different with meaningful live data.
Another alternative is to separately search for matches for each of the 6 ranges (A-F) and use INTERSECT
to combine the results. The upside is that a separate index can be defined to support each match. The downside is that the extra cost of running separate queries and combining the results can offset the gains from more efficient index usage.
Given the following modified #TestSet
table definition:
CREATE TABLE #TestSet (
Id INT IDENTITY(1,1)
CONSTRAINT PK_TestSet PRIMARY KEY CLUSTERED,
A INT
INDEX IX_TestSet_A,
B INT
INDEX IX_TestSet_B,
C INT
INDEX IX_TestSet_C,
D INT
INDEX IX_TestSet_D,
E INT
INDEX IX_TestSet_E,
F INT
INDEX IX_TestSet_F,
);
The following will yield the same results as before.
DECLARE @MatchedSamples TABLE(Id INT PRIMARY KEY);
-- First find the test set rows that match one or more rules
INSERT INTO @MatchedSamples
SELECT DISTINCT A.Id
FROM #TempRule t
CROSS APPLY (
-- Expect multiple seek using IX_TestSet_A .. IX_TestSet_F
SELECT n.Id FROM #TestSet n WHERE n.A >= t.AMin AND n.A <= t.AMax
INTERSECT
SELECT n.Id FROM #TestSet n WHERE n.B >= t.BMin AND n.B <= t.BMax
INTERSECT
SELECT n.Id FROM #TestSet n WHERE n.C >= t.CMin AND n.C <= t.CMax
INTERSECT
SELECT n.Id FROM #TestSet n WHERE n.D >= t.DMin AND n.D <= t.DMax
INTERSECT
SELECT n.Id FROM #TestSet n WHERE n.E >= t.EMin AND n.E <= t.EMax
INTERSECT
SELECT n.Id FROM #TestSet n WHERE n.F >= t.FMin AND n.F <= t.FMax
) A;
-- Now select everything not in that match list
INSERT INTO #Results
SELECT n.A, n.B, n.C, n.D, n.E, n.F
FROM #TestSet n
WHERE n.Id NOT IN (SELECT m.Id FROM @MatchedSamples m);
SELECT count(*) FROM #Results;
Using the randomly generated test data, the above actually ran significantly slower than both the original code and the earlier version above. However, as before, live data might yield different performance results.
See this db<>fiddle for a demo.
instead of the covering index you could try an index for each range, and if you can give each rule a unique id then you could take the intersection of the ranges to get the matching rules.. not sure if it would be faster but it can be done with seeks.
here's a sqlfiddle but I had to take a couple zeroes off the sample size for it to not time out.
I tried this on a real server w/400k in TestSet and 6.3M in TempRule and it took 3 minutes. It seems to do it one pass over TestSet:
create index A on #TempRule (AMin,AMax,Id);
create index B on #TempRule (BMin,BMax,Id);
create index C on #TempRule (CMin,CMax,Id);
create index D on #TempRule (DMin,DMax,Id);
create index E on #TempRule (EMin,EMax,Id);
create index F on #TempRule (FMin,FMax,Id);
select count(*) from #TestSet t
outer apply (select top 1 * from (
select Id from #TempRule where t.A between AMin and AMax intersect
select Id from #TempRule where t.B between BMin and BMax intersect
select Id from #TempRule where t.C between CMin and CMax intersect
select Id from #TempRule where t.D between DMin and DMax intersect
select Id from #TempRule where t.E between EMin and EMax intersect
select Id from #TempRule where t.F between FMin and FMax)_
) first_matched_rule where first_matched_rule.Id is null
option (maxdop 24)
pk > @lastValue
to your batch logic. Another approach might be to just retrieve all selected PK values in a single pass and save those to a temp table. After that, you can then process the selected PKs in batches, retrieve the selected rows, and insert that selected data into the result table. – T N Commented Feb 8 at 3:24SELECT count(*) FROM #Results;
, or was that just a simplification for demonstration or testing? – T N Commented Feb 8 at 3:31