I have a daily_reports
table that keeps daily metrics. Its structure is something like (parent_id, object_id, report_date, metric_1, metric_2, ...)
. (object_id, report_date)
is always unique, so there is only at most one report row per day for an object.
In part of the application, it is queried this way for example:
SELECT object_id,
SUM(metric_1) AS metric_1,
SUM(metric_2) AS metric_2,
...
FROM daily_reports
WHERE parent_id IN (...)
AND report_date BETWEEN :start_date AND :end_date
GROUP BY object_id
order by metric_1 desc
The :start_date
and end_date
are chosen by the user, and can be arbitrary.
That query is slow, because there are too many `object_ids' to sum over.
I'm exploring ideas to make this quicker. One idea is to keep a "rolling sum" table for each object_id
since the first day it appears in the daily_reports
table.
Some things to keep in mind though:
- Not every
object_id
appears every day. Sometimes someobject_ids
get no reports for arbitrary periods of time. - Sometimes new
object_ids
appear in thedaily_reports
. - There can be changes in the
daily_reports
for the last 30 days, but not before that.
With that in mind, the new rolling sum table needs to "carry forward" any object_ids
that didn't appear in today's reports. Also, the rolling sums of the last 30 days need to be recalculated every day, since there can be changes.
Here is what I came up with, a stored procedure that builds a rolling sum table day by day:
DELIMITER //
CREATE OR REPLACE PROCEDURE update_cumul_reports(
IN p_start_date DATE,
IN p_end_date DATE,
IN p_parent_id BIGINT
)
BEGIN
DECLARE current_report_date DATE;
DECLARE latest_cumul_date DATE;
SET current_report_date = p_start_date;
WHILE current_report_date <= p_end_date DO
-- Find the latest date a rolling sum exists for
SELECT MAX(report_date) INTO latest_cumul_date
FROM cumulative_reports
WHERE report_date < current_report_date
AND parent_id = p_parent_id;
INSERT INTO cumulative_reports
(object_id, parent_id, report_date, metric_1, metric_2, metric_3, metric_4)
SELECT
-- object_ids that already exist in cumulative_reports, left join to carry forward object_ids that didn't get a daily_report
cumul.object_id,
COALESCE(raw.parent_id, cumul.parent_id),
current_report_date AS report_date,
COALESCE(cumul.metric_1, 0) + COALESCE(raw.metric_1, 0) AS metric_1,
COALESCE(cumul.metric_2, 0) + COALESCE(raw.metric_2, 0) AS metric_2,
...
FROM cumulative_reports AS cumul
LEFT JOIN daily_reports AS raw
ON cumul.object_id = raw.object_id
AND raw.report_date = current_report_date
AND raw.parent_id = p_parent_id
WHERE cumul.report_date = latest_cumul_date
AND cumul.parent_id = p_parent_id
UNION ALL
SELECT
-- New object_ids that appeared in daily_reports for the first time today.
raw.object_id,
raw.parent_id,
current_report_date,
raw.metric_1,
raw.metric_2,
raw.metric_3,
raw.metric_4
FROM daily_reports AS raw
WHERE raw.report_date = current_report_date
AND raw.parent_id = p_parent_id
AND NOT EXISTS (
SELECT 1
FROM cumulative_reports c
WHERE c.report_date = latest_cumul_date
AND c.parent_id = p_parent_id
AND c.object_id = raw.object_id
);
-- Go to next date
SET current_report_date = DATE_ADD(current_report_date, INTERVAL 1 DAY);
END WHILE;
END;
//
DELIMITER ;
I have two indexes on the new table cumulative_reports
: (parent_id, report_date, object_id) PRIMARY
and (report_date, object_id)
.
Querying the new table is very quick. Every object_id
only needs two rows from the cumulative table: one for the start date and one for the end date, instead of summing over the entire period.
The issue is building the table gets slower and slower the more the table itself grows.
This alone is fast enough (takes two minutes):
CALL update_cumul_reports(
'2025-01-01',
'2025-01-31',
:parent_id_with_most_objects
);
But if the entire table since the beginning is built, with all the old object_ids
that are being carried forward all there, it would take 10-15 minutes.
Not carrying the object_ids that don't get daily_reports forward means querying becomes too complex and too slow.
Are there any ideas improve this? Or any better way to implement the rolling sum table?
If there is any missing information from my question, please ask and I'll add it.
Thanks in advance for your answers.
PS: I am using 10.6.18-MariaDB-log. I cannot use ColumnStore in my current setup.
Edit: some additional information as requested:
daily_reports
already has a(parent_id, report_date)
index.daily_reports
has (obtained usingEXPLAIN SELECT * FROM daily_reports)
a little under 2 billion rows and growing.- Intervals can be anything from a few months to a few years.
- The original query, when selecting an interval of only two months and limiting it to one
parent_id
(the one with the mostobject_ids
), takes 20 seconds to run. The requirement is it needs to be under 5 seconds. - For a two month interval,
daily_reports
has little under 2 million rows for only oneparent_id
, with around 100k distinctobject_ids
. - The query is usually run with rarely over 10
parent_ids
.
Edit 2: As requested:
# SHOW CREATE TABLE daily_reports
CREATE TABLE `daily_reports`
(
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`report_date` int(11) unsigned NOT NULL, # Unfortunately
`parent_id` bigint(20) unsigned NOT NULL,
`object_id` bigint(20) unsigned NOT NULL,
`metric_1` int(11) NOT NULL,
`metric_2` double(10, 2) NOT NULL,
# ...
PRIMARY KEY (`id`),
UNIQUE KEY `report_date_object_id_uidx` (`report_date`, `object_id`),
KEY `object_id_idx` (`object_id`),
KEY `parent_id_report_date` (`parent_id`, `report_date`)
) ENGINE = InnoDB
DEFAULT CHARSET = utf8mb4
COLLATE = utf8mb4_general_ci
ROW_FORMAT = COMPRESSED
Table goes back 5 - 6 years only, no rows are ever deleted.
There is a little under 200 million distinct
object_ids
.
Edit 3 : a small fiddle that exemplifies what I want: dbfiddle.uk/1hbEHiUV.
I have a daily_reports
table that keeps daily metrics. Its structure is something like (parent_id, object_id, report_date, metric_1, metric_2, ...)
. (object_id, report_date)
is always unique, so there is only at most one report row per day for an object.
In part of the application, it is queried this way for example:
SELECT object_id,
SUM(metric_1) AS metric_1,
SUM(metric_2) AS metric_2,
...
FROM daily_reports
WHERE parent_id IN (...)
AND report_date BETWEEN :start_date AND :end_date
GROUP BY object_id
order by metric_1 desc
The :start_date
and end_date
are chosen by the user, and can be arbitrary.
That query is slow, because there are too many `object_ids' to sum over.
I'm exploring ideas to make this quicker. One idea is to keep a "rolling sum" table for each object_id
since the first day it appears in the daily_reports
table.
Some things to keep in mind though:
- Not every
object_id
appears every day. Sometimes someobject_ids
get no reports for arbitrary periods of time. - Sometimes new
object_ids
appear in thedaily_reports
. - There can be changes in the
daily_reports
for the last 30 days, but not before that.
With that in mind, the new rolling sum table needs to "carry forward" any object_ids
that didn't appear in today's reports. Also, the rolling sums of the last 30 days need to be recalculated every day, since there can be changes.
Here is what I came up with, a stored procedure that builds a rolling sum table day by day:
DELIMITER //
CREATE OR REPLACE PROCEDURE update_cumul_reports(
IN p_start_date DATE,
IN p_end_date DATE,
IN p_parent_id BIGINT
)
BEGIN
DECLARE current_report_date DATE;
DECLARE latest_cumul_date DATE;
SET current_report_date = p_start_date;
WHILE current_report_date <= p_end_date DO
-- Find the latest date a rolling sum exists for
SELECT MAX(report_date) INTO latest_cumul_date
FROM cumulative_reports
WHERE report_date < current_report_date
AND parent_id = p_parent_id;
INSERT INTO cumulative_reports
(object_id, parent_id, report_date, metric_1, metric_2, metric_3, metric_4)
SELECT
-- object_ids that already exist in cumulative_reports, left join to carry forward object_ids that didn't get a daily_report
cumul.object_id,
COALESCE(raw.parent_id, cumul.parent_id),
current_report_date AS report_date,
COALESCE(cumul.metric_1, 0) + COALESCE(raw.metric_1, 0) AS metric_1,
COALESCE(cumul.metric_2, 0) + COALESCE(raw.metric_2, 0) AS metric_2,
...
FROM cumulative_reports AS cumul
LEFT JOIN daily_reports AS raw
ON cumul.object_id = raw.object_id
AND raw.report_date = current_report_date
AND raw.parent_id = p_parent_id
WHERE cumul.report_date = latest_cumul_date
AND cumul.parent_id = p_parent_id
UNION ALL
SELECT
-- New object_ids that appeared in daily_reports for the first time today.
raw.object_id,
raw.parent_id,
current_report_date,
raw.metric_1,
raw.metric_2,
raw.metric_3,
raw.metric_4
FROM daily_reports AS raw
WHERE raw.report_date = current_report_date
AND raw.parent_id = p_parent_id
AND NOT EXISTS (
SELECT 1
FROM cumulative_reports c
WHERE c.report_date = latest_cumul_date
AND c.parent_id = p_parent_id
AND c.object_id = raw.object_id
);
-- Go to next date
SET current_report_date = DATE_ADD(current_report_date, INTERVAL 1 DAY);
END WHILE;
END;
//
DELIMITER ;
I have two indexes on the new table cumulative_reports
: (parent_id, report_date, object_id) PRIMARY
and (report_date, object_id)
.
Querying the new table is very quick. Every object_id
only needs two rows from the cumulative table: one for the start date and one for the end date, instead of summing over the entire period.
The issue is building the table gets slower and slower the more the table itself grows.
This alone is fast enough (takes two minutes):
CALL update_cumul_reports(
'2025-01-01',
'2025-01-31',
:parent_id_with_most_objects
);
But if the entire table since the beginning is built, with all the old object_ids
that are being carried forward all there, it would take 10-15 minutes.
Not carrying the object_ids that don't get daily_reports forward means querying becomes too complex and too slow.
Are there any ideas improve this? Or any better way to implement the rolling sum table?
If there is any missing information from my question, please ask and I'll add it.
Thanks in advance for your answers.
PS: I am using 10.6.18-MariaDB-log. I cannot use ColumnStore in my current setup.
Edit: some additional information as requested:
daily_reports
already has a(parent_id, report_date)
index.daily_reports
has (obtained usingEXPLAIN SELECT * FROM daily_reports)
a little under 2 billion rows and growing.- Intervals can be anything from a few months to a few years.
- The original query, when selecting an interval of only two months and limiting it to one
parent_id
(the one with the mostobject_ids
), takes 20 seconds to run. The requirement is it needs to be under 5 seconds. - For a two month interval,
daily_reports
has little under 2 million rows for only oneparent_id
, with around 100k distinctobject_ids
. - The query is usually run with rarely over 10
parent_ids
.
Edit 2: As requested:
# SHOW CREATE TABLE daily_reports
CREATE TABLE `daily_reports`
(
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`report_date` int(11) unsigned NOT NULL, # Unfortunately
`parent_id` bigint(20) unsigned NOT NULL,
`object_id` bigint(20) unsigned NOT NULL,
`metric_1` int(11) NOT NULL,
`metric_2` double(10, 2) NOT NULL,
# ...
PRIMARY KEY (`id`),
UNIQUE KEY `report_date_object_id_uidx` (`report_date`, `object_id`),
KEY `object_id_idx` (`object_id`),
KEY `parent_id_report_date` (`parent_id`, `report_date`)
) ENGINE = InnoDB
DEFAULT CHARSET = utf8mb4
COLLATE = utf8mb4_general_ci
ROW_FORMAT = COMPRESSED
Table goes back 5 - 6 years only, no rows are ever deleted.
There is a little under 200 million distinct
object_ids
.
Edit 3 : a small fiddle that exemplifies what I want: dbfiddle.uk/1hbEHiUV.
Share Improve this question edited Feb 4 at 10:46 Zakaria asked Jan 31 at 12:45 ZakariaZakaria 4,7962 gold badges8 silver badges36 bronze badges 12 | Show 7 more comments2 Answers
Reset to default 0So this is the slow query:
SELECT object_id,
SUM(metric_1) AS metric_1,
SUM(metric_2) AS metric_2,
...
FROM daily_reports
WHERE parent_id IN (...)
AND report_date BETWEEN :start_date AND :end_date
GROUP BY object_id
order by metric_1 desc
that is, you have some metrics to sum. You could create an index on object_id, year(start_date), month(start_date)
and another index for end_date: object_id, year(start_date), month(start_date)
. This should quicken the query. Yet, it still may be slow due to having to sum many values. If so, then besides the multidimensional indexes you can also create a table of the format of
metric_archives(object_1, year, month, sum_metric_1, sum_metric_2, ...)
and whenever a customer defines a start_date and an end_date that spans across multiple months, like between 2020-01-15 and 2021-04-22, then you can load the metric sums between (2020, 2) and (2021, 3) and search for the remainders from the end of January 2020 and start of April 2021 from your daily_reports table. Therefore you will always load 2-month time records from daily_reports maximum and use the indexes, and fill the entire months in-between from the other helper table.
The drawback is of course that you will need to maintain metric_archives as you insert / update / delete records from daily_reports.
The technique (easily with MariaDB, not so easily with MySQL) for including empty rows where there is no data:
Use a sequence table to create all dates in the desired range, then LEFT JOIN
to your data table. This will give you NULLs
, which you can turn into 0s
with COALESCE(...)
or IFNULL(...)
.
https://mysql.rjweb./doc.php/index_cookbook_mysql#sequence and https://mariadb/kb/en/sequence-storage-engine/
SELECT '2019-01-01' + INTERVAL seq-1 DAY FROM seq_1_to_31;
+-----------------------------------+
| '2019-01-01' + INTERVAL seq-1 DAY |
+-----------------------------------+
| 2019-01-01 |
| 2019-01-02 |
| 2019-01-03 |
SUM()
doesn't care whether it encounters no-row, NULL, or zero; so I am unclear on why this might apply to a "rolling sum".
COUNT()
and AVG()
, on the other hand, do care.
Please provide a short list of data that exemplifies the Question and provide the desired output.
More
To speed up the Stored Proc, can you simply add the new value to the last existing row? (Instead of re-SUM-ming.)
Whether or not the above is viable, WHILE
looping is bad for performance; it is better to have a complex SELECT
that recomputes everything at once.
I'm not sure, but these may be better than KEY(object_id)
:
INDEX(object_id, parent_id, metric_1)
INDEX(parent_id, object_id, metric_1)
daily_reports
table have? – mr mcwolf Commented Jan 31 at 13:19from (select '2000-01-01' + interval seq day report_date from seq_0_to_36524 having report_date between :start_date AND :end_date) dates cross join (select null parent_id where 0 union all values (:parent1),(:parent2),(:parent3)...) parents join daily_reports using (parent_id, report_date)
– ysth Commented Jan 31 at 13:21calendar_days
table that has the needed days, joined it to theparent_id
(only selected one) and the new query has pretty much the same performance as the original one. @ChrisMaurer the running sum table is to be updated every day. – Zakaria Commented Jan 31 at 15:15PRIMARY KEY
ofdaily_reports
? (Better yet, please provideSHOW CREATE TABLE daily_reports;
.) – Rick James Commented Jan 31 at 22:47