最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql server - Optimizing SQL query with large dataset and multiple filters - Stack Overflow

programmeradmin1浏览0评论

I'm working on optimizing a SQL Server query that fetches filtered data from two large tables:

  • Articles (≈1 million rows for the last 3 months)
  • ArticleHistory (stores status changes, timestamps, and user changes, with a minimum of 10 rows per article)

The query needs to:

  • Fetch articles based on multiple filters (date range, status, user actions).
  • Select the latest or earliest status change date for specific statuses.
  • Join with multiple tables for additional metadata.
  • Use pagination (OFFSET ... FETCH NEXT).

Currently, the query runs too slow due to the large dataset and complex joins.

So far I have this code that is still too slow:

SELECT a.Id, a.CreatedAt, a.UpdatedAt, a.CustomsId, a.InternationalStatusId, 
      a.StatusId, a.UUID,  a.ArticleDescription
      -- i have more selected data here from other tables
FROM Articles AS a 
  -- i have left joins with other tables here
LEFT JOIN (
    SELECT 
        sh2.ArticleId, 
        sh2.StatusId, 
        CASE
            -- Special case: StatusId = 4 derived from 144, use MAX
            WHEN sh2.StatusId = 4 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 4 AND tsi.IsSpecial = 1
                ) THEN MAX(sh2.CreatedAt)

            -- Actual case: StatusId = 4, use MIN
            WHEN sh2.StatusId = 4 THEN MIN(sh2.CreatedAt)

            -- Special case: StatusId = 10 derived from 1410, use MIN
            WHEN sh2.StatusId = 10 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 10 AND tsi.IsSpecial = 1
                ) THEN MIN(sh2.CreatedAt)

            -- Actual case: StatusId = 10, use MAX
            WHEN sh2.StatusId = 10 THEN MAX(sh2.CreatedAt)

            -- Special case: StatusId = 14 derived from 114, use MIN
            WHEN sh2.StatusId = 14 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 14 AND tsi.IsSpecial = 1
                ) THEN MIN(sh2.CreatedAt)

            -- Actual case: StatusId = 14, use MAX
            WHEN sh2.StatusId = 14 THEN MAX(sh2.CreatedAt)

            ELSE MAX(sh2.CreatedAt)
        END AS CreatedAt
    FROM ArticleHistory sh2
                
    --One ArticleHisotry can have more than one row with the same ArticleId, StatusId and sometimes CreatedAt as in date
        --and only some specific ones can be used for filtering
    WHERE sh2.ChangeComment LIKE '%Status%'
        AND sh2.StatusId = 3
        AND cast(sh2.CreatedAt as date) >= '2025-01-01'
        AND cast(sh2.CreatedAt as date) <= '2025-01-31'

    GROUP BY sh2.ArticleId, sh2.StatusId
) AS sh ON a.Id = sh.ArticleId

WHERE 1 = 1
    AND sh.StatusId = 3
    AND cast(sh.CreatedAt as date) >= '2025-01-01'
    AND cast(sh.CreatedAt as date) <= '2025-01-31'
ORDER BY a.CreatedAt DESC
OFFSET 0 ROWS
FETCH NEXT 20 ROWS ONLY

See it and sample data here:

I've tried:

  • Indexing key columns (StatusId, ArticleId, StatusDate).
  • Using WITH (NOLOCK) to minimize locking impact.
  • Filtering ArticleHistory inside a LEFT JOIN to select relevant statuses and date ranges.

Despite this, performance is still not ideal. Are there best practices for optimizing queries with large history tables where filtering and aggregation (min/max dates) are required? Would partitioning or pre-aggregating status change data in a temporary table help?

I'm working on optimizing a SQL Server query that fetches filtered data from two large tables:

  • Articles (≈1 million rows for the last 3 months)
  • ArticleHistory (stores status changes, timestamps, and user changes, with a minimum of 10 rows per article)

The query needs to:

  • Fetch articles based on multiple filters (date range, status, user actions).
  • Select the latest or earliest status change date for specific statuses.
  • Join with multiple tables for additional metadata.
  • Use pagination (OFFSET ... FETCH NEXT).

Currently, the query runs too slow due to the large dataset and complex joins.

So far I have this code that is still too slow:

SELECT a.Id, a.CreatedAt, a.UpdatedAt, a.CustomsId, a.InternationalStatusId, 
      a.StatusId, a.UUID,  a.ArticleDescription
      -- i have more selected data here from other tables
FROM Articles AS a 
  -- i have left joins with other tables here
LEFT JOIN (
    SELECT 
        sh2.ArticleId, 
        sh2.StatusId, 
        CASE
            -- Special case: StatusId = 4 derived from 144, use MAX
            WHEN sh2.StatusId = 4 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 4 AND tsi.IsSpecial = 1
                ) THEN MAX(sh2.CreatedAt)

            -- Actual case: StatusId = 4, use MIN
            WHEN sh2.StatusId = 4 THEN MIN(sh2.CreatedAt)

            -- Special case: StatusId = 10 derived from 1410, use MIN
            WHEN sh2.StatusId = 10 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 10 AND tsi.IsSpecial = 1
                ) THEN MIN(sh2.CreatedAt)

            -- Actual case: StatusId = 10, use MAX
            WHEN sh2.StatusId = 10 THEN MAX(sh2.CreatedAt)

            -- Special case: StatusId = 14 derived from 114, use MIN
            WHEN sh2.StatusId = 14 AND EXISTS (
                    SELECT 1 
                    FROM #TempStatusIds tsi 
                    WHERE tsi.StatusId = 14 AND tsi.IsSpecial = 1
                ) THEN MIN(sh2.CreatedAt)

            -- Actual case: StatusId = 14, use MAX
            WHEN sh2.StatusId = 14 THEN MAX(sh2.CreatedAt)

            ELSE MAX(sh2.CreatedAt)
        END AS CreatedAt
    FROM ArticleHistory sh2
                
    --One ArticleHisotry can have more than one row with the same ArticleId, StatusId and sometimes CreatedAt as in date
        --and only some specific ones can be used for filtering
    WHERE sh2.ChangeComment LIKE '%Status%'
        AND sh2.StatusId = 3
        AND cast(sh2.CreatedAt as date) >= '2025-01-01'
        AND cast(sh2.CreatedAt as date) <= '2025-01-31'

    GROUP BY sh2.ArticleId, sh2.StatusId
) AS sh ON a.Id = sh.ArticleId

WHERE 1 = 1
    AND sh.StatusId = 3
    AND cast(sh.CreatedAt as date) >= '2025-01-01'
    AND cast(sh.CreatedAt as date) <= '2025-01-31'
ORDER BY a.CreatedAt DESC
OFFSET 0 ROWS
FETCH NEXT 20 ROWS ONLY

See it and sample data here:

https://dbfiddle.uk/wT0_5R3N

I've tried:

  • Indexing key columns (StatusId, ArticleId, StatusDate).
  • Using WITH (NOLOCK) to minimize locking impact.
  • Filtering ArticleHistory inside a LEFT JOIN to select relevant statuses and date ranges.

Despite this, performance is still not ideal. Are there best practices for optimizing queries with large history tables where filtering and aggregation (min/max dates) are required? Would partitioning or pre-aggregating status change data in a temporary table help?

Share Improve this question edited Feb 1 at 9:30 Mark Rotteveel 109k229 gold badges156 silver badges220 bronze badges asked Jan 30 at 15:44 E.SE.S 11 bronze badge 12
  • We need to see an actual SQL statement and a SMALL set of sample data. You can use dbfiddle.uk to help set this up if you want. Until we have these things, this question is not really answerable and likely to be put on hold. – Joel Coehoorn Commented Jan 30 at 15:47
  • Here’s a live SQL Fiddle where you can test it: dbfiddle.uk/jfMD3txJ – E.S Commented Jan 30 at 16:26
  • One thing I've figured out over the years separate from the question (so just a comment) is, when I have complicated nested queries like this, I will alternate keyword capitalization between nesting levels. So SELECT .. FROM (select ...) JOIN .. WHERE x = (select max(..) from ...) It helps me keep straight what code is at what level. – Joel Coehoorn Commented Jan 30 at 16:36
  • 1 Also, at first pass through, these conditional expressions are slowing you down: AND cast(sh.CreatedAt as date) >= '2025-01-01' AND cast(sh.CreatedAt as date) <= '2025-01-31' You want to instead write it in a way that avoids mutating the stored data. This not only saves pre-row processing work, but helps make better use of indexes. – Joel Coehoorn Commented Jan 30 at 16:39
  • 1 ... called away. To finish previous comment, corrected code looks like this, with a half-open range using an exclusive (<) upper bound rather than inclusive (<=): AND sh.CreatedAt >= '20250101' AND sh.CreatedAt < '20250201' – Joel Coehoorn Commented Jan 30 at 16:54
 |  Show 7 more comments

1 Answer 1

Reset to default -1

You should have indexes on the id column.

Also replace

AND cast(sh.CreatedAt as date) >= '2025-01-01'
AND cast(sh.CreatedAt as date) <= '2025-01-31'

With

YEAR(sh.CreatedAt) = 2025 AND MONTH(sh.CreatedAt) = 1

You could also make an index on the year and month if still slow.

发布评论

评论列表(0)

  1. 暂无评论