This is my sample data set in a table called "status": fiddle
create table status(id, status)as values
(1, 'in')
,(1, 'out')
,(1, 'in')
,(1, 'in')
,(2, 'in')
,(2, 'in')
,(3, 'out')
,(3, 'out')
,(3, 'in')
,(3, 'out')
,(4, 'out')
,(4, 'out');
If I want to return ID with at least 1 "in" status, how should I write it? I started with this
SELECT id, COUNT(status)
FROM status
GROUP BY id
HAVING count(*) > 1
but it is returning the actual number of status per ID, including 4
that has zero in
statuses.
id | count |
---|---|
3 | 4 |
4 | 2 |
2 | 2 |
1 | 4 |
This is my sample data set in a table called "status": fiddle
create table status(id, status)as values
(1, 'in')
,(1, 'out')
,(1, 'in')
,(1, 'in')
,(2, 'in')
,(2, 'in')
,(3, 'out')
,(3, 'out')
,(3, 'in')
,(3, 'out')
,(4, 'out')
,(4, 'out');
If I want to return ID with at least 1 "in" status, how should I write it? I started with this
SELECT id, COUNT(status)
FROM status
GROUP BY id
HAVING count(*) > 1
but it is returning the actual number of status per ID, including 4
that has zero in
statuses.
id | count |
---|---|
3 | 4 |
4 | 2 |
2 | 2 |
1 | 4 |
3 Answers
Reset to default 1To return IDs that have at least one "in"
status you may find using the distinct
keyword suitable:
SELECT DISTINCT id
FROM status
WHERE status = 'in';
It's enough to ask if any of the statuses in a group is in
, which you can do with bool_or()
:
demo at db<>fiddle
select id,count(*)
from status
group by id
having bool_or(status='in');
Logically, it's the same as asking if every(status<>'in')
- if not, it means at least one is 'in'
. While bool_and()
has the nice and readable every()
alias, bool_or()
doesn't, because any()
is reserved for an array/subquery operator. Which happens to be another way to phrase it:
select id,count(*)
from status
group by id
having 'in'=any(array_agg(status));
It's the same but runs faster by using an index-only scan, assuming you create the index:
create index on status(id,status);
Alternatively, you can use an exists()
condition:
select id,count(*)
from status as a
where exists(select from status as b
where a.id=b.id
and b.status='in')
group by id;
On a 700k test sample with 50k ids and 20%-80% split between out
s and in
s, here's how they perform (in seconds, over 25 calls):
version | avg | min | max | stddev | sum |
---|---|---|---|---|---|
having 'in'=any(array_agg(status)) |
0.139506 | 0.133707 | 0.150881 | 0.005853 | 3.487649 |
having bool_or(status='in') |
0.225468 | 0.20282 | 0.285397 | 0.02107 | 5.636694 |
having count(*)filter(where status='in')>0 |
0.229346 | 0.203776 | 0.298583 | 0.025039 | 5.733644 |
having sum(case when status='in' then 1 else 0 end)>0 |
0.234454 | 0.203856 | 0.43467 | 0.046771 | 5.861347 |
where exists(..) |
0.614826 | 0.550297 | 1.136524 | 0.120762 | 15.37065 |
Given a different distribution(e.g. more rare 'in'
status, more id
s) some of those don't hold and most switch to using the index:
demo at db<>fiddle
query_name | avg | min | max | stddev | sum |
---|---|---|---|---|---|
where exists() |
0.111665 | 0.10433 | 0.146798 | 0.009681 | 2.233304 |
having bool_or(status='in') |
0.142832 | 0.136964 | 0.15687 | 0.006704 | 2.856638 |
having count(*)filter(where status='in')>0 |
0.144093 | 0.133775 | 0.212441 | 0.017232 | 2.881863 |
having sum(case when status='in' then 1 else 0 end)>0 |
0.145145 | 0.138662 | 0.160853 | 0.007255 | 2.9029 |
having 'in'=any(array_agg(status)) |
0.216814 | 0.204083 | 0.265939 | 0.016245 | 4.336287 |
While not necessarily the best way to formulate the having
clause, the filter
clause that @Jonas Metzler has suggested, might still come handy:
select id
, count(*)
, count(*)filter(where status='in') as ins
, count(*)filter(where status='out') as outs
from status
group by id
having bool_or(status='in');
id | count | ins | outs |
---|---|---|---|
3 | 4 | 1 | 3 |
2 | 2 | 2 | 0 |
1 | 4 | 3 | 1 |
It lets you collect the three different counts in a single query, in a single pass.
If the only reason you used count(*)
was to establish you want at least one 'in'
but you only actually need the distinct id
s, not counts, a compact partial index can speed it up:
create index on status(id)where(status='in');
Now all these two queries do is read it out, with no need to search:
select id
from status
where status='in'
group by id;
select distinct id
from status
where status='in';
You can conditionally sum whether an Id has a Status of 'in' and then filter on it e.g.
create table Status (Id int, Status varchar(12));
insert into Status (Id, Status)
values
(1, 'in'),
(1, 'out'),
(1, 'in'),
(1, 'in'),
(2, 'in'),
(2, 'in'),
(3, 'out'),
(3, 'out'),
(3, 'in'),
(3, 'out');
SELECT Id, COUNT(Status)
FROM Status
GROUP BY Id
HAVING SUM(CASE WHEN Status = 'in' THEN 1 ELSE 0 END) > 0
ORDER BY Id;
id | count |
---|---|
1 | 4 |
2 | 2 |
3 | 4 |
fiddle
As noted by others (thanks @JonasMetzler) FILTER
is another, potentially better, option here e.g.
SELECT Id, COUNT(Status)
FROM Status
GROUP BY Id
HAVING COUNT(*) FILTER (WHERE Status = 'in') > 0
ORDER BY Id;
..HAVING COUNT(*) FILTER(WHERE status = 'in') > 0
, but it's unclear which outcome excatly you want. – Jonas Metzler Commented Apr 2 at 7:34status
asboolean
if there's just the options, or anenum
if there's more - in case it's not an enum already. If you have any indexes on the table, it would be helpful to see them, if you could edit them in. Do you have some stats about the input set (rowcount, rows per id, how many ins vs outs)? – Zegarek Commented Apr 2 at 11:07