最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql - Query optimization when joining on several inequality columns - Stack Overflow

programmeradmin2浏览0评论

I have a table (named #Rule in the sample below) composed of several range min and max pairs and a second table (#TestSet) that contains many individual values that I want to test to determine if they fall (or don't fall) within each corresponding range in Rule. (You can think along the lines of attributes of a person and then a table of rules to apply given those attributes.) I am looking for suggested query or algorithm changes that allow for 'quickly' determining if all the values in the TestSet table fall within all ranges for at least one row in the Rule table. Rule table row count could be in the 100's of thousands to 10 million. TestSet could be 100's of millions but can be chunked and processed as needed.

Looking for the fastest way to get an answer for each row in TestSet if it is 'covered' or not (all values fall within each corresponding range pair) by a row in Rule. It would be a nice to have to also get a count of rules it is 'covered' by. The sample is SQL Server. Any suggestions? Any way to force the query to only take a single pass over each table when joining on all the inequality clauses?


--TEST DATA SET UP 'RULES'
CREATE TABLE #TempRule (
    AMin INT,
    AMax INT,
    BMin INT,
    BMax INT,
    CMin INT,
    CMax INT,
    DMin INT,
    DMax INT,
    EMin INT,
    EMax INT,
    FMin INT,
    FMax INT
);


INSERT INTO #TempRule (AMin, AMax, BMin, BMax, CMin, CMax, DMin, DMax, EMin, EMax, FMin, FMax)
SELECT
    AMin,
    CASE WHEN AMin > AMax THEN AMin ELSE AMax END AS AMax,
    BMin,
    CASE WHEN BMin > BMax THEN BMin ELSE BMax END AS BMax,
    CMin,
    CASE WHEN CMin > CMax THEN CMin ELSE CMax END AS CMax,
    DMin,
    CASE WHEN DMin > DMax THEN DMin ELSE DMax END AS DMax,
    EMin,
    CASE WHEN EMin > EMax THEN EMin ELSE EMax END AS EMax,
    FMin,
    CASE WHEN FMin > FMax THEN FMin ELSE FMax END AS FMax
FROM (
    SELECT
        FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS AMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS AMax,
        (FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS BMin,
        (FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS BMax,
        FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS CMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS CMax,
        FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS DMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS DMax,
        FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS EMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS EMax,
        FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS FMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS FMax
    FROM
        (SELECT TOP 400000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM sys.objects AS o1 CROSS JOIN sys.objects AS o2 CROSS JOIN sys.objects AS o3) AS Numbers
) AS RandomValues;


-- Create an index that covers every column
CREATE CLUSTERED INDEX idx_TempRules_AllColumns ON #TempRule (AMin, AMax, BMin, BMax, CMin, CMax, DMin, DMax, EMin, EMax, FMin, FMax);


--TEST DATA SET UP 'Set of Values'
CREATE TABLE #TestSet (
    A INT,
    B INT,    
    C INT,
    D INT,
    E INT,       
    F INT
);

INSERT INTO #TestSet (A, B, C, D, E, F)
SELECT
    FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS A,
    (FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS B,
    FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS C,
    FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS D,
    FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS E,
    FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS F
FROM
    (SELECT TOP 15000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM master.dbo.spt_values AS a CROSS JOIN master.dbo.spt_values AS b) AS Numbers;

SELECT COUNT(*) FROM #TestSet
CREATE CLUSTERED INDEX idx_TestSet ON #TestSet (A, B, C, D, E, F);

CREATE TABLE #TestSetChunk (
    A INT,
    B INT,    
    C INT,
    D INT,
    E INT,       
    F INT
);
CREATE CLUSTERED INDEX idx_TestSetChunk ON #TestSetChunk (A, B, C, D, E, F);



-- Create the results temp table
CREATE TABLE #Results (
    A INT,
    B INT,
    C INT,
    D INT,
    E INT,
    F INT
);



-- Variables to control the loop
DECLARE @BatchSize INT = 1000000;
DECLARE @Offset INT = 0;
DECLARE @TotalRows INT;
DECLARE @ProcessedRows INT = 0;

-- Get the total number of rows in #TestSet
SELECT @TotalRows = COUNT(*) FROM #TestSet;

-- Loop to process 1 million records at a time
WHILE @Offset < @TotalRows
BEGIN
    
    DELETE FROM #TestSetChunk;  
    INSERT INTO #TestSetChunk (A, B, C, D, E, F)
    SELECT A, B, C, D, E, F
    FROM (
        SELECT A, B, C, D, E, F, 
               ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum
        FROM #TestSet
    ) AS n
    WHERE n.RowNum BETWEEN @Offset + 1 AND @Offset + @BatchSize;
    
    
    ---THIS IS SLOW running part that I am trying to optimize 
    INSERT INTO #Results WITH (TABLOCK) (A, B, C, D, E, F)
    SELECT n.A, n.B, n.C, n.D, n.E, n.F
    FROM #TestSetChunk n WITH (INDEX(idx_TestSetChunk)) 
    WHERE NOT EXISTS (
        SELECT 1
        FROM #TempRule t WITH (INDEX(idx_TempRules_AllColumns)) 
        WHERE n.A >= t.AMin AND n.A <= t.AMax
          AND n.B >= t.BMin AND n.B <= t.BMax
          AND n.C >= t.CMin AND n.C <= t.CMax
          AND n.D >= t.DMin AND n.D <= t.DMax
          AND n.E >= t.EMin AND n.E <= t.EMax
          AND n.F >= t.FMin AND n.F <= t.FMax
    )
    ORDER BY n.A;
   

    -- Increment the offset
    SET @Offset = @Offset + @BatchSize;
    SET @ProcessedRows = @ProcessedRows + @BatchSize;

    -- Print progress
    PRINT 'Processed ' + CAST(@ProcessedRows AS VARCHAR) + ' rows out of ' + CAST(@TotalRows AS VARCHAR);
    

   END



-- Select results (Need to do more.  Just a simplification)
SELECT count(*) FROM #Results;

Here is the the execution plan which shows table scans

       |--Table Insert(OBJECT:([tempdb].[dbo].[#Results]), SET:([tempdb].[dbo].[#Results].[A] = [tempdb].[dbo].[#TestSetChunk].[A] as [n].[A],[tempdb].[dbo].[#Results].[B] = [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B],[tempdb].[dbo].[#Results].[C] = [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C],[tempdb].[dbo].[#Results].[D] = [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D],[tempdb].[dbo].[#Results].[E] = [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E],[tempdb].[dbo].[#Results].[F] = [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]))
            |--Nested Loops(Left Anti Semi Join, WHERE:([tempdb].[dbo].[#TestSetChunk].[A] as [n].[A]>=[tempdb].[dbo].[#TempRule].[AMin] as [t].[AMin] AND [tempdb].[dbo].[#TestSetChunk].[A] as [n].[A]<=[tempdb].[dbo].[#TempRule].[AMax] as [t].[AMax] AND [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B]>=[tempdb].[dbo].[#TempRule].[BMin] as [t].[BMin] AND [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B]<=[tempdb].[dbo].[#TempRule].[BMax] as [t].[BMax] AND [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C]>=[tempdb].[dbo].[#TempRule].[CMin] as [t].[CMin] AND [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C]<=[tempdb].[dbo].[#TempRule].[CMax] as [t].[CMax] AND [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D]>=[tempdb].[dbo].[#TempRule].[DMin] as [t].[DMin] AND [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D]<=[tempdb].[dbo].[#TempRule].[DMax] as [t].[DMax] AND [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E]>=[tempdb].[dbo].[#TempRule].[EMin] as [t].[EMin] AND [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E]<=[tempdb].[dbo].[#TempRule].[EMax] as [t].[EMax] AND [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]>=[tempdb].[dbo].[#TempRule].[FMin] as [t].[FMin] AND [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]<=[tempdb].[dbo].[#TempRule].[FMax] as [t].[FMax]))
                 |--Table Scan(OBJECT:([tempdb].[dbo].[#TestSetChunk] AS [n]))
                 |--Table Scan(OBJECT:([tempdb].[dbo].[#TempRule] AS [t]))

Current timings are about 23 Minutes to process 400,000 values in TestSet against about 6.3 Million values in TempRule

I have a table (named #Rule in the sample below) composed of several range min and max pairs and a second table (#TestSet) that contains many individual values that I want to test to determine if they fall (or don't fall) within each corresponding range in Rule. (You can think along the lines of attributes of a person and then a table of rules to apply given those attributes.) I am looking for suggested query or algorithm changes that allow for 'quickly' determining if all the values in the TestSet table fall within all ranges for at least one row in the Rule table. Rule table row count could be in the 100's of thousands to 10 million. TestSet could be 100's of millions but can be chunked and processed as needed.

Looking for the fastest way to get an answer for each row in TestSet if it is 'covered' or not (all values fall within each corresponding range pair) by a row in Rule. It would be a nice to have to also get a count of rules it is 'covered' by. The sample is SQL Server. Any suggestions? Any way to force the query to only take a single pass over each table when joining on all the inequality clauses?


--TEST DATA SET UP 'RULES'
CREATE TABLE #TempRule (
    AMin INT,
    AMax INT,
    BMin INT,
    BMax INT,
    CMin INT,
    CMax INT,
    DMin INT,
    DMax INT,
    EMin INT,
    EMax INT,
    FMin INT,
    FMax INT
);


INSERT INTO #TempRule (AMin, AMax, BMin, BMax, CMin, CMax, DMin, DMax, EMin, EMax, FMin, FMax)
SELECT
    AMin,
    CASE WHEN AMin > AMax THEN AMin ELSE AMax END AS AMax,
    BMin,
    CASE WHEN BMin > BMax THEN BMin ELSE BMax END AS BMax,
    CMin,
    CASE WHEN CMin > CMax THEN CMin ELSE CMax END AS CMax,
    DMin,
    CASE WHEN DMin > DMax THEN DMin ELSE DMax END AS DMax,
    EMin,
    CASE WHEN EMin > EMax THEN EMin ELSE EMax END AS EMax,
    FMin,
    CASE WHEN FMin > FMax THEN FMin ELSE FMax END AS FMax
FROM (
    SELECT
        FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS AMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS AMax,
        (FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS BMin,
        (FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS BMax,
        FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS CMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS CMax,
        FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS DMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS DMax,
        FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS EMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS EMax,
        FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS FMin,
        FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS FMax
    FROM
        (SELECT TOP 400000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM sys.objects AS o1 CROSS JOIN sys.objects AS o2 CROSS JOIN sys.objects AS o3) AS Numbers
) AS RandomValues;


-- Create an index that covers every column
CREATE CLUSTERED INDEX idx_TempRules_AllColumns ON #TempRule (AMin, AMax, BMin, BMax, CMin, CMax, DMin, DMax, EMin, EMax, FMin, FMax);


--TEST DATA SET UP 'Set of Values'
CREATE TABLE #TestSet (
    A INT,
    B INT,    
    C INT,
    D INT,
    E INT,       
    F INT
);

INSERT INTO #TestSet (A, B, C, D, E, F)
SELECT
    FLOOR(RAND(CHECKSUM(NEWID())) * 100000) + 1 AS A,
    (FLOOR(RAND(CHECKSUM(NEWID())) * 31) * 10) + 1690 AS B,
    FLOOR(RAND(CHECKSUM(NEWID())) * 51) * 1000 AS C,
    FLOOR(RAND(CHECKSUM(NEWID())) * 3) + 1 AS D,
    FLOOR(RAND(CHECKSUM(NEWID())) * 5) + 1 AS E,
    FLOOR(RAND(CHECKSUM(NEWID())) * 4) + 1 AS F
FROM
    (SELECT TOP 15000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM master.dbo.spt_values AS a CROSS JOIN master.dbo.spt_values AS b) AS Numbers;

SELECT COUNT(*) FROM #TestSet
CREATE CLUSTERED INDEX idx_TestSet ON #TestSet (A, B, C, D, E, F);

CREATE TABLE #TestSetChunk (
    A INT,
    B INT,    
    C INT,
    D INT,
    E INT,       
    F INT
);
CREATE CLUSTERED INDEX idx_TestSetChunk ON #TestSetChunk (A, B, C, D, E, F);



-- Create the results temp table
CREATE TABLE #Results (
    A INT,
    B INT,
    C INT,
    D INT,
    E INT,
    F INT
);



-- Variables to control the loop
DECLARE @BatchSize INT = 1000000;
DECLARE @Offset INT = 0;
DECLARE @TotalRows INT;
DECLARE @ProcessedRows INT = 0;

-- Get the total number of rows in #TestSet
SELECT @TotalRows = COUNT(*) FROM #TestSet;

-- Loop to process 1 million records at a time
WHILE @Offset < @TotalRows
BEGIN
    
    DELETE FROM #TestSetChunk;  
    INSERT INTO #TestSetChunk (A, B, C, D, E, F)
    SELECT A, B, C, D, E, F
    FROM (
        SELECT A, B, C, D, E, F, 
               ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNum
        FROM #TestSet
    ) AS n
    WHERE n.RowNum BETWEEN @Offset + 1 AND @Offset + @BatchSize;
    
    
    ---THIS IS SLOW running part that I am trying to optimize 
    INSERT INTO #Results WITH (TABLOCK) (A, B, C, D, E, F)
    SELECT n.A, n.B, n.C, n.D, n.E, n.F
    FROM #TestSetChunk n WITH (INDEX(idx_TestSetChunk)) 
    WHERE NOT EXISTS (
        SELECT 1
        FROM #TempRule t WITH (INDEX(idx_TempRules_AllColumns)) 
        WHERE n.A >= t.AMin AND n.A <= t.AMax
          AND n.B >= t.BMin AND n.B <= t.BMax
          AND n.C >= t.CMin AND n.C <= t.CMax
          AND n.D >= t.DMin AND n.D <= t.DMax
          AND n.E >= t.EMin AND n.E <= t.EMax
          AND n.F >= t.FMin AND n.F <= t.FMax
    )
    ORDER BY n.A;
   

    -- Increment the offset
    SET @Offset = @Offset + @BatchSize;
    SET @ProcessedRows = @ProcessedRows + @BatchSize;

    -- Print progress
    PRINT 'Processed ' + CAST(@ProcessedRows AS VARCHAR) + ' rows out of ' + CAST(@TotalRows AS VARCHAR);
    

   END



-- Select results (Need to do more.  Just a simplification)
SELECT count(*) FROM #Results;

Here is the the execution plan which shows table scans

       |--Table Insert(OBJECT:([tempdb].[dbo].[#Results]), SET:([tempdb].[dbo].[#Results].[A] = [tempdb].[dbo].[#TestSetChunk].[A] as [n].[A],[tempdb].[dbo].[#Results].[B] = [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B],[tempdb].[dbo].[#Results].[C] = [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C],[tempdb].[dbo].[#Results].[D] = [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D],[tempdb].[dbo].[#Results].[E] = [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E],[tempdb].[dbo].[#Results].[F] = [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]))
            |--Nested Loops(Left Anti Semi Join, WHERE:([tempdb].[dbo].[#TestSetChunk].[A] as [n].[A]>=[tempdb].[dbo].[#TempRule].[AMin] as [t].[AMin] AND [tempdb].[dbo].[#TestSetChunk].[A] as [n].[A]<=[tempdb].[dbo].[#TempRule].[AMax] as [t].[AMax] AND [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B]>=[tempdb].[dbo].[#TempRule].[BMin] as [t].[BMin] AND [tempdb].[dbo].[#TestSetChunk].[B] as [n].[B]<=[tempdb].[dbo].[#TempRule].[BMax] as [t].[BMax] AND [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C]>=[tempdb].[dbo].[#TempRule].[CMin] as [t].[CMin] AND [tempdb].[dbo].[#TestSetChunk].[C] as [n].[C]<=[tempdb].[dbo].[#TempRule].[CMax] as [t].[CMax] AND [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D]>=[tempdb].[dbo].[#TempRule].[DMin] as [t].[DMin] AND [tempdb].[dbo].[#TestSetChunk].[D] as [n].[D]<=[tempdb].[dbo].[#TempRule].[DMax] as [t].[DMax] AND [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E]>=[tempdb].[dbo].[#TempRule].[EMin] as [t].[EMin] AND [tempdb].[dbo].[#TestSetChunk].[E] as [n].[E]<=[tempdb].[dbo].[#TempRule].[EMax] as [t].[EMax] AND [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]>=[tempdb].[dbo].[#TempRule].[FMin] as [t].[FMin] AND [tempdb].[dbo].[#TestSetChunk].[F] as [n].[F]<=[tempdb].[dbo].[#TempRule].[FMax] as [t].[FMax]))
                 |--Table Scan(OBJECT:([tempdb].[dbo].[#TestSetChunk] AS [n]))
                 |--Table Scan(OBJECT:([tempdb].[dbo].[#TempRule] AS [t]))

Current timings are about 23 Minutes to process 400,000 values in TestSet against about 6.3 Million values in TempRule

Share Improve this question edited 2 days ago BSchroeder asked Feb 7 at 23:39 BSchroederBSchroeder 233 bronze badges 11
  • A question needs request a solution in a single technology, so I have removed the big-query tag since you've provided an example for SQL Server. That said, I'm not really sure what you are asking. Does the query you show run to slow? If so provide all table and index definitions and then the execution plan, using Paste The plan. SQL is a declarative language and as such the structure of the query does not correlate directly to the performance of a query. You need to choose your technology, build your query, then performance tune it. – Dale K Commented Feb 8 at 3:19
  • 1 Offhand, I can think of no good way to optimize the filter condition. SQL server is going to have to scan the entire test set and check the values against the rules. (An N-Dimentional spatial index might help, but that technology does not exists in SQL Server.) Hopefully, the rules will at least be cached in memory to aid in the efficient accesses. Your batching loop is a problem though, because SQL server has no way (that I am aware of) to pick up where it left off and will need to reprocess data from the start with each iteration. ... – T N Commented Feb 8 at 3:24
  • ...If your data has a single primary key column (not shown in your sample scenario), you might be able to track the last retrieved PK value and add pk > @lastValue to your batch logic. Another approach might be to just retrieve all selected PK values in a single pass and save those to a temp table. After that, you can then process the selected PKs in batches, retrieve the selected rows, and insert that selected data into the result table. – T N Commented Feb 8 at 3:24
  • Do you really only need to SELECT count(*) FROM #Results;, or was that just a simplification for demonstration or testing? – T N Commented Feb 8 at 3:31
  • Your OFFSET thing should be applied to the table before the WHERE parts, otherwise you will get funky results – siggemannen Commented Feb 8 at 7:27
 |  Show 6 more comments

2 Answers 2

Reset to default 0

One potential alternate approach is to first identify all of the #TestSet rows that match one or more rules and then run a final query that selects all rows not in that list.

This works best if you add an identity column (or other single column PK) to the #TestSet table so that we can track which rows match and easily exclude them from the final result.

Using indexes, SQL Server can more efficiently look up values that match a range, than in can searching for ranges that match a value. For that reason, it is more efficient to search for #TestSet rows that match a #TempRule row, than it is to search for #TempRule rows that match a #TestSet row. However, an index range seek can only efficiently narrow results for one range (such as AMin .. AMax). The other range conditions in #TempRule will just be checked against the retrieved data. That said, even narrowing the search by one column is an advantage.

Given the following modified #TestSet table definition:

CREATE TABLE #TestSet (
    Id INT IDENTITY(1,1)
        CONSTRAINT PK_TestSet PRIMARY KEY CLUSTERED,
    A INT,
    B INT,
    C INT,
    D INT,
    E INT,
    F INT,
    INDEX IX_TestSet_AX (A) INCLUDE (B, C, D, E, F),
);

The following will find all #TestSet rows having no matching #TempRule entry.

DECLARE @MatchedSamples TABLE(Id INT PRIMARY KEY);

-- First find the test set rows that match one or more rules
INSERT INTO @MatchedSamples
SELECT DISTINCT n.Id
FROM #TempRule t
JOIN #TestSet n
    ON n.A >= t.AMin AND n.A <= t.AMax -- Expect seek using IX_TestSet_AX
    AND n.B >= t.BMin AND n.B <= t.BMax -- The rest is checked on the fly
    AND n.C >= t.CMin AND n.C <= t.CMax
    AND n.D >= t.DMin AND n.D <= t.DMax
    AND n.E >= t.EMin AND n.E <= t.EMax
    AND n.F >= t.FMin AND n.F <= t.FMax;

-- Now select everything not in that match list
INSERT INTO #Results
SELECT n.A, n.B, n.C, n.D, n.E, n.F
FROM #TestSet n
WHERE n.Id NOT IN (SELECT m.Id FROM @MatchedSamples m);
  
SELECT count(*) FROM #Results;

For simplicity, I have eliminated all of the "chunking" logic from the following code. If really needed, perhaps that could be added back on to the INSERT INTO #Results ... step.

For the randomly generated test data, the execution was faster than the original logic, running in about 40% of the original query time. Results may be significantly different with meaningful live data.

Another alternative is to separately search for matches for each of the 6 ranges (A-F) and use INTERSECT to combine the results. The upside is that a separate index can be defined to support each match. The downside is that the extra cost of running separate queries and combining the results can offset the gains from more efficient index usage.

Given the following modified #TestSet table definition:

CREATE TABLE #TestSet (
    Id INT IDENTITY(1,1)
        CONSTRAINT PK_TestSet PRIMARY KEY CLUSTERED,
    A INT
        INDEX IX_TestSet_A,
    B INT
        INDEX IX_TestSet_B,
    C INT
        INDEX IX_TestSet_C,
    D INT
        INDEX IX_TestSet_D,
    E INT
        INDEX IX_TestSet_E,
    F INT
        INDEX IX_TestSet_F,
);

The following will yield the same results as before.

DECLARE @MatchedSamples TABLE(Id INT PRIMARY KEY);

-- First find the test set rows that match one or more rules
INSERT INTO @MatchedSamples
SELECT DISTINCT A.Id
FROM #TempRule t
CROSS APPLY (
    -- Expect multiple seek using IX_TestSet_A .. IX_TestSet_F
    SELECT n.Id FROM #TestSet n WHERE n.A >= t.AMin AND n.A <= t.AMax
    INTERSECT
    SELECT n.Id FROM #TestSet n WHERE n.B >= t.BMin AND n.B <= t.BMax
    INTERSECT
    SELECT n.Id FROM #TestSet n WHERE n.C >= t.CMin AND n.C <= t.CMax
    INTERSECT 
    SELECT n.Id FROM #TestSet n WHERE n.D >= t.DMin AND n.D <= t.DMax
    INTERSECT
    SELECT n.Id FROM #TestSet n WHERE n.E >= t.EMin AND n.E <= t.EMax
    INTERSECT
    SELECT n.Id FROM #TestSet n WHERE n.F >= t.FMin AND n.F <= t.FMax
) A;

-- Now select everything not in that match list
INSERT INTO #Results
SELECT n.A, n.B, n.C, n.D, n.E, n.F
FROM #TestSet n
WHERE n.Id NOT IN (SELECT m.Id FROM @MatchedSamples m);
  
SELECT count(*) FROM #Results;

Using the randomly generated test data, the above actually ran significantly slower than both the original code and the earlier version above. However, as before, live data might yield different performance results.

See this db<>fiddle for a demo.

instead of the covering index you could try an index for each range, and if you can give each rule a unique id then you could take the intersection of the ranges to get the matching rules.. not sure if it would be faster but it can be done with seeks.

here's a sqlfiddle but I had to take a couple zeroes off the sample size for it to not time out.

I tried this on a real server w/400k in TestSet and 6.3M in TempRule and it took 3 minutes. It seems to do it one pass over TestSet:

create index A on #TempRule (AMin,AMax,Id);
create index B on #TempRule (BMin,BMax,Id);
create index C on #TempRule (CMin,CMax,Id);
create index D on #TempRule (DMin,DMax,Id);
create index E on #TempRule (EMin,EMax,Id);
create index F on #TempRule (FMin,FMax,Id);

select count(*) from #TestSet t
outer apply (select top 1 * from (
    select Id from #TempRule where t.A between AMin and AMax intersect
    select Id from #TempRule where t.B between BMin and BMax intersect
    select Id from #TempRule where t.C between CMin and CMax intersect
    select Id from #TempRule where t.D between DMin and DMax intersect
    select Id from #TempRule where t.E between EMin and EMax intersect
    select Id from #TempRule where t.F between FMin and FMax)_
    ) first_matched_rule where first_matched_rule.Id is null
option (maxdop 24)
发布评论

评论列表(0)

  1. 暂无评论