We use Prometheus PushGateway for collecting metrics from short-lived cronjobs. I would like to collect two metrics:
- How many times a particular cronjob finished without errors.
- How many times a particular cronjob failed with a fatal error.
Suppose I use Counter
for this. From each particular cronjob point of view both metrics can be either 1 or empty (I'm incrementing either success or failure metric during a given cronjob executions). Also, I need to push those metrics with different groupingKey from each cronjob to avoid metric value overwrite (in case some job have parallel executions). I use timestamp as my groupingKey. Now, what I'm left with are the timeseries in the following format:
my_job_successful_executions{job="my-job", timestamp="2025-03-05:12:14:12"} = 1
my_job_successful_executions{job="my-job", timestamp="2025-03-05:12:29:12"} = 1
How can I get for example the following information: "How many successful executions of cronjobs happened in the last hour" vs. "How many executions failed in the last hour"?
To specify time window and use increase()
, for example, I would need to have a range vector.
I can sum up all the values accross all timeseries that I have:
sum(my_job_successful_executions{job="my-job"})
But that leaves me with another instant vector, I can't apply increase()
to.
With Gauge I will be only having last state which won't allow me to query for the last hour / day etc.
We use Prometheus PushGateway for collecting metrics from short-lived cronjobs. I would like to collect two metrics:
- How many times a particular cronjob finished without errors.
- How many times a particular cronjob failed with a fatal error.
Suppose I use Counter
for this. From each particular cronjob point of view both metrics can be either 1 or empty (I'm incrementing either success or failure metric during a given cronjob executions). Also, I need to push those metrics with different groupingKey from each cronjob to avoid metric value overwrite (in case some job have parallel executions). I use timestamp as my groupingKey. Now, what I'm left with are the timeseries in the following format:
my_job_successful_executions{job="my-job", timestamp="2025-03-05:12:14:12"} = 1
my_job_successful_executions{job="my-job", timestamp="2025-03-05:12:29:12"} = 1
How can I get for example the following information: "How many successful executions of cronjobs happened in the last hour" vs. "How many executions failed in the last hour"?
To specify time window and use increase()
, for example, I would need to have a range vector.
I can sum up all the values accross all timeseries that I have:
sum(my_job_successful_executions{job="my-job"})
But that leaves me with another instant vector, I can't apply increase()
to.
With Gauge I will be only having last state which won't allow me to query for the last hour / day etc.
Share Improve this question asked Mar 6 at 14:55 ArtemArtem 2,5201 gold badge23 silver badges45 bronze badges 2- Normally pushgateway only useable, if you care only about last run of job (and jobs aren't overlapping). Also your grouppingKey approach is not good match for pushgateway, since it doesnt fet pushed metrics. – markalex Commented Mar 7 at 5:52
- 1 Have you read Should I be using the Pushgateway? To me it sounds like your jobs are better suited as daemons, with usual pull based scraping – markalex Commented Mar 7 at 5:58
1 Answer
Reset to default 0Don't store metrics with labels containing potentially unbound number of values, such as
timestamp
andpid
, since this may result to high cardinality issues such as high RAM usage, data ingestion slowdown and query execution slowdown.It is better to use a long-running server like statsd , statsd_exporter or vmagent with stream aggregation for aggregating the metrics received from cron jobs before storing them in Prometheus.
An alternative is to store all the original events about cron job status in an analytical database such as ClickHouse in the form of wide events . This will allow you calculating various stats over the stored events with filters and groupings on any fields stored in the events.