return FALSE; $r = well_tag_thread__update(array('id' => $id), $update); return $r; } function well_tag_thread_find($tagid, $page, $pagesize) { $arr = well_tag_thread__find(array('tagid' => $tagid), array('id' => -1), $page, $pagesize); return $arr; } function well_tag_thread_find_by_tid($tid, $page, $pagesize) { $arr = well_tag_thread__find(array('tid' => $tid), array(), $page, $pagesize); return $arr; } ?>sql - Building a rolling sum table with changing data - Stack Overflow
最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql - Building a rolling sum table with changing data - Stack Overflow

programmeradmin0浏览0评论

I have a daily_reports table that keeps daily metrics. Its structure is something like (parent_id, object_id, report_date, metric_1, metric_2, ...). (object_id, report_date) is always unique, so there is only at most one report row per day for an object. In part of the application, it is queried this way for example:

SELECT object_id,
       SUM(metric_1)    AS metric_1,
       SUM(metric_2)    AS metric_2,
       ...
FROM daily_reports
WHERE parent_id IN (...)
  AND report_date BETWEEN :start_date AND :end_date
GROUP BY object_id
order by metric_1 desc

The :start_date and end_date are chosen by the user, and can be arbitrary.

That query is slow, because there are too many `object_ids' to sum over.

I'm exploring ideas to make this quicker. One idea is to keep a "rolling sum" table for each object_id since the first day it appears in the daily_reports table.
Some things to keep in mind though:

  • Not every object_id appears every day. Sometimes some object_ids get no reports for arbitrary periods of time.
  • Sometimes new object_ids appear in the daily_reports.
  • There can be changes in the daily_reports for the last 30 days, but not before that.

With that in mind, the new rolling sum table needs to "carry forward" any object_ids that didn't appear in today's reports. Also, the rolling sums of the last 30 days need to be recalculated every day, since there can be changes.

Here is what I came up with, a stored procedure that builds a rolling sum table day by day:

DELIMITER //
CREATE OR REPLACE PROCEDURE update_cumul_reports(
    IN p_start_date DATE,
    IN p_end_date DATE,
    IN p_parent_id BIGINT
)
BEGIN
    DECLARE current_report_date DATE;
    DECLARE latest_cumul_date DATE;

    SET current_report_date = p_start_date;

    WHILE current_report_date <= p_end_date DO
        -- Find the latest date a rolling sum exists for
        SELECT MAX(report_date) INTO latest_cumul_date
        FROM cumulative_reports
        WHERE report_date < current_report_date
          AND parent_id = p_parent_id;

        INSERT INTO cumulative_reports
        (object_id, parent_id, report_date, metric_1, metric_2, metric_3, metric_4)
        SELECT
            -- object_ids that already exist in cumulative_reports, left join to carry forward object_ids that didn't get a daily_report
            cumul.object_id,
            COALESCE(raw.parent_id, cumul.parent_id),
            current_report_date AS report_date,
            COALESCE(cumul.metric_1, 0) + COALESCE(raw.metric_1, 0) AS metric_1,
            COALESCE(cumul.metric_2, 0) + COALESCE(raw.metric_2, 0) AS metric_2,
            ...
        FROM cumulative_reports AS cumul
                 LEFT JOIN daily_reports AS raw
                           ON cumul.object_id = raw.object_id
                               AND raw.report_date = current_report_date
                               AND raw.parent_id = p_parent_id
        WHERE cumul.report_date = latest_cumul_date
          AND cumul.parent_id = p_parent_id

        UNION ALL

        SELECT
            -- New object_ids that appeared in daily_reports for the first time today.
            raw.object_id,
            raw.parent_id,
            current_report_date,
            raw.metric_1,
            raw.metric_2,
            raw.metric_3,
            raw.metric_4
        FROM daily_reports AS raw
        WHERE raw.report_date = current_report_date
          AND raw.parent_id = p_parent_id
          AND NOT EXISTS (
            SELECT 1
            FROM cumulative_reports c
            WHERE c.report_date = latest_cumul_date
              AND c.parent_id = p_parent_id
              AND c.object_id = raw.object_id
        );

        -- Go to next date
        SET current_report_date = DATE_ADD(current_report_date, INTERVAL 1 DAY);
    END WHILE;
END;
//
DELIMITER ;

I have two indexes on the new table cumulative_reports: (parent_id, report_date, object_id) PRIMARY and (report_date, object_id).

Querying the new table is very quick. Every object_id only needs two rows from the cumulative table: one for the start date and one for the end date, instead of summing over the entire period.

The issue is building the table gets slower and slower the more the table itself grows.

This alone is fast enough (takes two minutes):

CALL update_cumul_reports(
        '2025-01-01',
        '2025-01-31',
        :parent_id_with_most_objects
     );

But if the entire table since the beginning is built, with all the old object_ids that are being carried forward all there, it would take 10-15 minutes. Not carrying the object_ids that don't get daily_reports forward means querying becomes too complex and too slow.

Are there any ideas improve this? Or any better way to implement the rolling sum table?
If there is any missing information from my question, please ask and I'll add it.

Thanks in advance for your answers.

PS: I am using 10.6.18-MariaDB-log. I cannot use ColumnStore in my current setup.

Edit: some additional information as requested:

  • daily_reports already has a (parent_id, report_date) index.
  • daily_reports has (obtained using EXPLAIN SELECT * FROM daily_reports) a little under 2 billion rows and growing.
  • Intervals can be anything from a few months to a few years.
  • The original query, when selecting an interval of only two months and limiting it to one parent_id (the one with the most object_ids), takes 20 seconds to run. The requirement is it needs to be under 5 seconds.
  • For a two month interval, daily_reports has little under 2 million rows for only one parent_id, with around 100k distinct object_ids.
  • The query is usually run with rarely over 10 parent_ids.

Edit 2: As requested:

# SHOW CREATE TABLE daily_reports
CREATE TABLE `daily_reports`
(
    `id`          bigint(20) unsigned NOT NULL AUTO_INCREMENT,
    `report_date` int(11) unsigned    NOT NULL, # Unfortunately
    `parent_id`   bigint(20) unsigned NOT NULL,
    `object_id`   bigint(20) unsigned NOT NULL,
    `metric_1`    int(11)             NOT NULL,
    `metric_2`    double(10, 2)       NOT NULL,
    # ...
    PRIMARY KEY (`id`),
    UNIQUE KEY `report_date_object_id_uidx` (`report_date`, `object_id`),
    KEY `object_id_idx` (`object_id`),
    KEY `parent_id_report_date` (`parent_id`, `report_date`)
) ENGINE = InnoDB
  DEFAULT CHARSET = utf8mb4
  COLLATE = utf8mb4_general_ci
  ROW_FORMAT = COMPRESSED
  • Table goes back 5 - 6 years only, no rows are ever deleted.

  • There is a little under 200 million distinct object_ids.

Edit 3 : a small fiddle that exemplifies what I want: dbfiddle.uk/1hbEHiUV.

I have a daily_reports table that keeps daily metrics. Its structure is something like (parent_id, object_id, report_date, metric_1, metric_2, ...). (object_id, report_date) is always unique, so there is only at most one report row per day for an object. In part of the application, it is queried this way for example:

SELECT object_id,
       SUM(metric_1)    AS metric_1,
       SUM(metric_2)    AS metric_2,
       ...
FROM daily_reports
WHERE parent_id IN (...)
  AND report_date BETWEEN :start_date AND :end_date
GROUP BY object_id
order by metric_1 desc

The :start_date and end_date are chosen by the user, and can be arbitrary.

That query is slow, because there are too many `object_ids' to sum over.

I'm exploring ideas to make this quicker. One idea is to keep a "rolling sum" table for each object_id since the first day it appears in the daily_reports table.
Some things to keep in mind though:

  • Not every object_id appears every day. Sometimes some object_ids get no reports for arbitrary periods of time.
  • Sometimes new object_ids appear in the daily_reports.
  • There can be changes in the daily_reports for the last 30 days, but not before that.

With that in mind, the new rolling sum table needs to "carry forward" any object_ids that didn't appear in today's reports. Also, the rolling sums of the last 30 days need to be recalculated every day, since there can be changes.

Here is what I came up with, a stored procedure that builds a rolling sum table day by day:

DELIMITER //
CREATE OR REPLACE PROCEDURE update_cumul_reports(
    IN p_start_date DATE,
    IN p_end_date DATE,
    IN p_parent_id BIGINT
)
BEGIN
    DECLARE current_report_date DATE;
    DECLARE latest_cumul_date DATE;

    SET current_report_date = p_start_date;

    WHILE current_report_date <= p_end_date DO
        -- Find the latest date a rolling sum exists for
        SELECT MAX(report_date) INTO latest_cumul_date
        FROM cumulative_reports
        WHERE report_date < current_report_date
          AND parent_id = p_parent_id;

        INSERT INTO cumulative_reports
        (object_id, parent_id, report_date, metric_1, metric_2, metric_3, metric_4)
        SELECT
            -- object_ids that already exist in cumulative_reports, left join to carry forward object_ids that didn't get a daily_report
            cumul.object_id,
            COALESCE(raw.parent_id, cumul.parent_id),
            current_report_date AS report_date,
            COALESCE(cumul.metric_1, 0) + COALESCE(raw.metric_1, 0) AS metric_1,
            COALESCE(cumul.metric_2, 0) + COALESCE(raw.metric_2, 0) AS metric_2,
            ...
        FROM cumulative_reports AS cumul
                 LEFT JOIN daily_reports AS raw
                           ON cumul.object_id = raw.object_id
                               AND raw.report_date = current_report_date
                               AND raw.parent_id = p_parent_id
        WHERE cumul.report_date = latest_cumul_date
          AND cumul.parent_id = p_parent_id

        UNION ALL

        SELECT
            -- New object_ids that appeared in daily_reports for the first time today.
            raw.object_id,
            raw.parent_id,
            current_report_date,
            raw.metric_1,
            raw.metric_2,
            raw.metric_3,
            raw.metric_4
        FROM daily_reports AS raw
        WHERE raw.report_date = current_report_date
          AND raw.parent_id = p_parent_id
          AND NOT EXISTS (
            SELECT 1
            FROM cumulative_reports c
            WHERE c.report_date = latest_cumul_date
              AND c.parent_id = p_parent_id
              AND c.object_id = raw.object_id
        );

        -- Go to next date
        SET current_report_date = DATE_ADD(current_report_date, INTERVAL 1 DAY);
    END WHILE;
END;
//
DELIMITER ;

I have two indexes on the new table cumulative_reports: (parent_id, report_date, object_id) PRIMARY and (report_date, object_id).

Querying the new table is very quick. Every object_id only needs two rows from the cumulative table: one for the start date and one for the end date, instead of summing over the entire period.

The issue is building the table gets slower and slower the more the table itself grows.

This alone is fast enough (takes two minutes):

CALL update_cumul_reports(
        '2025-01-01',
        '2025-01-31',
        :parent_id_with_most_objects
     );

But if the entire table since the beginning is built, with all the old object_ids that are being carried forward all there, it would take 10-15 minutes. Not carrying the object_ids that don't get daily_reports forward means querying becomes too complex and too slow.

Are there any ideas improve this? Or any better way to implement the rolling sum table?
If there is any missing information from my question, please ask and I'll add it.

Thanks in advance for your answers.

PS: I am using 10.6.18-MariaDB-log. I cannot use ColumnStore in my current setup.

Edit: some additional information as requested:

  • daily_reports already has a (parent_id, report_date) index.
  • daily_reports has (obtained using EXPLAIN SELECT * FROM daily_reports) a little under 2 billion rows and growing.
  • Intervals can be anything from a few months to a few years.
  • The original query, when selecting an interval of only two months and limiting it to one parent_id (the one with the most object_ids), takes 20 seconds to run. The requirement is it needs to be under 5 seconds.
  • For a two month interval, daily_reports has little under 2 million rows for only one parent_id, with around 100k distinct object_ids.
  • The query is usually run with rarely over 10 parent_ids.

Edit 2: As requested:

# SHOW CREATE TABLE daily_reports
CREATE TABLE `daily_reports`
(
    `id`          bigint(20) unsigned NOT NULL AUTO_INCREMENT,
    `report_date` int(11) unsigned    NOT NULL, # Unfortunately
    `parent_id`   bigint(20) unsigned NOT NULL,
    `object_id`   bigint(20) unsigned NOT NULL,
    `metric_1`    int(11)             NOT NULL,
    `metric_2`    double(10, 2)       NOT NULL,
    # ...
    PRIMARY KEY (`id`),
    UNIQUE KEY `report_date_object_id_uidx` (`report_date`, `object_id`),
    KEY `object_id_idx` (`object_id`),
    KEY `parent_id_report_date` (`parent_id`, `report_date`)
) ENGINE = InnoDB
  DEFAULT CHARSET = utf8mb4
  COLLATE = utf8mb4_general_ci
  ROW_FORMAT = COMPRESSED
  • Table goes back 5 - 6 years only, no rows are ever deleted.

  • There is a little under 200 million distinct object_ids.

Edit 3 : a small fiddle that exemplifies what I want: dbfiddle.uk/1hbEHiUV.

Share Improve this question edited Feb 4 at 10:46 Zakaria asked Jan 31 at 12:45 ZakariaZakaria 4,7962 gold badges8 silver badges36 bronze badges 12
  • What data volumes are involved, what time interval is usually chosen, and what indexes does the daily_reports table have? – mr mcwolf Commented Jan 31 at 13:19
  • seems likely to me your original query can be made fast enough. having a (parent_id, report_date) index should help significantly. beyond using the index, you may see improvement joining to specify the parents and dates, like from (select '2000-01-01' + interval seq day report_date from seq_0_to_36524 having report_date between :start_date AND :end_date) dates cross join (select null parent_id where 0 union all values (:parent1),(:parent2),(:parent3)...) parents join daily_reports using (parent_id, report_date) – ysth Commented Jan 31 at 13:21
  • If your users can choose arbitrary start and end dates, how does having a running sum table help you? The running sums are guaranteed to be outside the users' desired ranges at some point. – Chris Maurer Commented Jan 31 at 13:40
  • @mrmcwolf I updated my question with some details. @ysth yes I do have the index. I have a calendar_days table that has the needed days, joined it to the parent_id (only selected one) and the new query has pretty much the same performance as the original one. @ChrisMaurer the running sum table is to be updated every day. – Zakaria Commented Jan 31 at 15:15
  • 1 What is the PRIMARY KEY of daily_reports? (Better yet, please provide SHOW CREATE TABLE daily_reports;.) – Rick James Commented Jan 31 at 22:47
 |  Show 7 more comments

2 Answers 2

Reset to default 0

So this is the slow query:

SELECT object_id,
       SUM(metric_1)    AS metric_1,
       SUM(metric_2)    AS metric_2,
       ...
FROM daily_reports
WHERE parent_id IN (...)
  AND report_date BETWEEN :start_date AND :end_date
GROUP BY object_id
order by metric_1 desc

that is, you have some metrics to sum. You could create an index on object_id, year(start_date), month(start_date) and another index for end_date: object_id, year(start_date), month(start_date). This should quicken the query. Yet, it still may be slow due to having to sum many values. If so, then besides the multidimensional indexes you can also create a table of the format of

metric_archives(object_1, year, month, sum_metric_1, sum_metric_2, ...)

and whenever a customer defines a start_date and an end_date that spans across multiple months, like between 2020-01-15 and 2021-04-22, then you can load the metric sums between (2020, 2) and (2021, 3) and search for the remainders from the end of January 2020 and start of April 2021 from your daily_reports table. Therefore you will always load 2-month time records from daily_reports maximum and use the indexes, and fill the entire months in-between from the other helper table.

The drawback is of course that you will need to maintain metric_archives as you insert / update / delete records from daily_reports.

The technique (easily with MariaDB, not so easily with MySQL) for including empty rows where there is no data:

Use a sequence table to create all dates in the desired range, then LEFT JOIN to your data table. This will give you NULLs, which you can turn into 0s with COALESCE(...) or IFNULL(...).

https://mysql.rjweb./doc.php/index_cookbook_mysql#sequence and https://mariadb/kb/en/sequence-storage-engine/

SELECT  '2019-01-01' + INTERVAL seq-1 DAY   FROM seq_1_to_31;
+-----------------------------------+
| '2019-01-01' + INTERVAL seq-1 DAY |
+-----------------------------------+
| 2019-01-01                        |
| 2019-01-02                        |
| 2019-01-03                        |

SUM() doesn't care whether it encounters no-row, NULL, or zero; so I am unclear on why this might apply to a "rolling sum".

COUNT() and AVG(), on the other hand, do care.

Please provide a short list of data that exemplifies the Question and provide the desired output.

More

To speed up the Stored Proc, can you simply add the new value to the last existing row? (Instead of re-SUM-ming.)

Whether or not the above is viable, WHILE looping is bad for performance; it is better to have a complex SELECT that recomputes everything at once.

I'm not sure, but these may be better than KEY(object_id):

INDEX(object_id, parent_id, metric_1)
INDEX(parent_id, object_id, metric_1)
发布评论

评论列表(0)

  1. 暂无评论