I've been trying to figure out a SQL (in postgresql) query for a cohort-type analysis at work and can't figure this one out for the life of me.
I need a snapshot count of the number of valid users at each month since each group's partnership began with our company (anizations
.service_start
), for each employer (eligibility
.anization_id
), and eligibility service type (eligibility
.service_type
- there may be more than one per employer), for every month since the employer joined into a partnership. the users' validity is determined by the user's initial eligibility start date (eligibility
.start_date
) and end date (eligibility
.end_date
), if said end_date
is available and later than the start date.
The eligibility table is laid out as follows:
ID | START_DATE | END_DATE | SERVICE_TYPE | ORGANIZATION_ID | ORG_START_DATE |
---|---|---|---|---|---|
1234 | 01/13/2023 | TYPE 1 | GROUP A | 2022-06-01 | |
5678 | 02/21/2023 | 12/31/2024 | TYPE 2 | GROUP A | 2022-09-01 |
1357 | 10/28/2022 | 12/31/2023 | TYPE 1 | GROUP B | 2023-01-01 |
2468 | 10/28/2022 | 12/31/2023 | TYPE 2 | GROUP B | 2023-08-01 |
I've been trying to figure out a SQL (in postgresql) query for a cohort-type analysis at work and can't figure this one out for the life of me.
I need a snapshot count of the number of valid users at each month since each group's partnership began with our company (anizations
.service_start
), for each employer (eligibility
.anization_id
), and eligibility service type (eligibility
.service_type
- there may be more than one per employer), for every month since the employer joined into a partnership. the users' validity is determined by the user's initial eligibility start date (eligibility
.start_date
) and end date (eligibility
.end_date
), if said end_date
is available and later than the start date.
The eligibility table is laid out as follows:
ID | START_DATE | END_DATE | SERVICE_TYPE | ORGANIZATION_ID | ORG_START_DATE |
---|---|---|---|---|---|
1234 | 01/13/2023 | TYPE 1 | GROUP A | 2022-06-01 | |
5678 | 02/21/2023 | 12/31/2024 | TYPE 2 | GROUP A | 2022-09-01 |
1357 | 10/28/2022 | 12/31/2023 | TYPE 1 | GROUP B | 2023-01-01 |
2468 | 10/28/2022 | 12/31/2023 | TYPE 2 | GROUP B | 2023-08-01 |
but the end_date column may be inaccurately saved, either null (row 1), prior to start_date, or various other issues.
What I have so far is basically counts of the validly eligible users at the start of each group-type combination:
select
o.name as employer, e.service_type,
date_trunc('month', e.start_date::date) as start_month,
sum(count(*)) over (partition by e.anization_id, e.health_plan_id, e.service_type, date_trunc('month', e.start_date::date)) as count
from eligibility e
left join anizations o on e.anization_id = o.id
where e.end_date is null or e.end_date > e.start_date
group by 2, 3, 1
order by 1, 2, 3;
meaning its basically one row per group-type combination and the count
EMPLOYER | SERVICE_TYPE | START_MONTH | COUNT |
---|---|---|---|
Group A | Type 1 | 2022-06-01 | 50 |
Group A | Type 2 | 2022-09-01 | 140 |
Group B | Type 1 | 2023-01-01 | 94 |
Group B | Type 2 | 2023-08-01 | 12 |
but i am wanting this for all months for each combination between the start_month and present day.
Share Improve this question edited Mar 2 at 8:07 Zegarek 27.4k5 gold badges24 silver badges30 bronze badges asked Mar 2 at 6:21 Eleanor BrockEleanor Brock 11 bronze badge 3 |1 Answer
Reset to default 0While the whole process still seems quite unclear, it seems you will need generate_series()
to instantiate every month of your "start to end" ranges.
Here is a query that returns "something" that may be close to your need (that we can only guess until you clarify it):
with
-- Which "employer / service" tuples will we work on?
-- When are their starts and stops?
service_by_ as
(
select
anization_id, service_type,
min(_start_date) start_date,
max(coalesce(end_date, current_timestamp)) end_date
from eligibility
group by 1, 2
),
-- Now get the list of months for the full range of each tuple.
months as
(
select
anization_id, service_type,
s.d start_date
from service_by_ sbo,
generate_series
(
date_trunc('month', start_date),
end_date,
interval '1 month'
) s(d)
)
-- Count.
select o.name as employer, e.service_type, m.start_date as start_month, count(1) as count
from months m
join eligibility e
on e.anization_id = m.anization_id
and e.service_type = m.service_type
and e.start_date < m.start_date + interval '1 month'
and (e.end_date is null or e.end_date >= m.start_date)
and e._start_date <= m.start_date -- Filter out entries with start_date < _start_date, for months before the _start_date.
join anizations o on o.id = e.anization_id
group by 1, 2, 3
order by 1, 2, 3
;
You can see it running live in a fiddle,
where I added another entry on GROUP A / Type 1, one year after the first one, to show a count greater than 1 (without it the example you gave only creates groups of 1 entry (from what I understood), which makes hard to demonstrate any interesting sum()
)
count()
? The differenthealth_plan_id
over the same employer? May you post a minimal reproducible example, for example by fiddling on a small basis that you can modify with significant examples (adding somehealth_plan_id
for example) until it works returning something showing your problem? Then paste the URL of that resulting fiddle, that first (running in the fiddle) query, its results, and the one you'd like, so that we can easily understand how to help you. – Guillaume Outters Commented Mar 2 at 8:26health_plan_id
.join anizations
should be after grouping data in tableeligibility
, or take name as max(employer). – ValNik Commented Mar 2 at 8:28GROUP BY
andORDER BY
clauses: their use makes queries more fragile and increases cognitive load on readers. Names remove the dependence on column order in select lists and can convey more information to the reader (assuming that useful names are chosen instead of names such ascol_1, col_2
). – JohnH Commented Mar 2 at 19:55