I am having trouble wrapping my head around the following problem. A sample dataset (truncated):
ID | Test | Instance | Weight | Outcome |
---|---|---|---|---|
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | B | 1 | 0.5 | 2 |
1 | B | 1 | 0.5 | 2 |
2 | A | 1 | 1 | 1 |
2 | A | 2 | 1 | 2 |
2 | B | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
4 | A | 1 | 0.5 | 3 |
4 | B | 1 | 0.5 | 4 |
4 | B | 1 | 0.5 | 4 |
5 | A | 1 | 0.5 | 2 |
5 | B | 1 | 0.5 | 2 |
6 | A | 1 | 0.25 | 2 |
6 | A | 1 | 0.25 | 2 |
6 | A | 2 | 0.25 | 2 |
6 | A | 3 | 0.25 | 1 |
6 | B | 1 | 0.25 | 2 |
I am having trouble wrapping my head around the following problem. A sample dataset (truncated):
ID | Test | Instance | Weight | Outcome |
---|---|---|---|---|
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | A | 1 | 0.5 | 2 |
1 | B | 1 | 0.5 | 2 |
1 | B | 1 | 0.5 | 2 |
2 | A | 1 | 1 | 1 |
2 | A | 2 | 1 | 2 |
2 | B | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
3 | C | 1 | 1 | 2 |
4 | A | 1 | 0.5 | 3 |
4 | B | 1 | 0.5 | 4 |
4 | B | 1 | 0.5 | 4 |
5 | A | 1 | 0.5 | 2 |
5 | B | 1 | 0.5 | 2 |
6 | A | 1 | 0.25 | 2 |
6 | A | 1 | 0.25 | 2 |
6 | A | 2 | 0.25 | 2 |
6 | A | 3 | 0.25 | 1 |
6 | B | 1 | 0.25 | 2 |
My code:
SELECT
ID,
TEST,
INSTANCE,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '1' THEN ID END))) AS Outcome1,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '2' THEN ID END))) AS Outcome2,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '3' THEN ID END))) AS Outcome3,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '4' THEN ID END))) AS Outcome4
FROM
DATA
WHERE
TEST IN ('A', 'B', 'C')
GROUP BY
ID, TEST, WEIGHT
ORDER BY
ID;
which results in the following output:
ID | Test | Instance | Outcome1 | Outcome2 | Outcome3 | Outcome4 |
---|---|---|---|---|---|---|
1 | A | 1 | 0 | 0.5 | 0 | 0 |
1 | B | 1 | 0 | 0.5 | 0 | 0 |
2 | A | 1 | 1 | 0 | 0 | 0 |
2 | A | 2 | 0 | 1 | 0 | 0 |
2 | B | 1 | 0 | 1 | 0 | 0 |
3 | C | 1 | 0 | 1 | 0 | 0 |
4 | A | 1 | 0 | 0 | 0.5 | 0 |
4 | B | 1 | 0 | 0 | 0 | 0.5 |
5 | A | 1 | 0 | 0.5 | 0 | 0 |
5 | B | 1 | 0 | 0.5 | 0 | 0 |
6 | A | 1 | 0 | 0.25 | 0 | 0 |
6 | A | 2 | 0 | 0.25 | 0 | 0 |
6 | A | 3 | 0.25 | 0 | 0 | 0 |
6 | B | 1 | 0 | 0.25 | 0 | 0 |
7 | A | 1 | 0.5 | 0 | 0 | 0 |
7 | B | 1 | 0.5 | 0 | 0 | 0 |
8 | B | 1 | 1 | 0 | 0 | 0 |
9 | A | 1 | 0 | 1 | 0 | 0 |
9 | B | 1 | 0.5 | 0 | 0 | 0 |
9 | B | 2 | 0.5 | 0 | 0 | 0 |
10 | A | 1 | 0 | 0.25 | 0 | 0 |
10 | A | 2 | 0 | 0.25 | 0 | 0 |
11 | A | 1 | 0 | 1 | 0 | 0 |
12 | B | 1 | 0 | 1 | 0 | 0 |
13 | A | 1 | 0.5 | 0 | 0 | 0 |
13 | B | 1 | 0 | 0 | 0 | 0.5 |
14 | B | 1 | 0 | 0.5 | 0 | 0 |
15 | A | 1 | 1 | 0 | 0 | 0 |
15 | B | 1 | 1 | 0 | 0 | 0 |
15 | C | 1 | 0 | 0 | 1 | 0 |
However, I would like to have one row for each ID that collapses everything like the following:
ID | Outcome1 | Outcome2 | Outcome3 | Outcome4 |
---|---|---|---|---|
1 | 0 | 1 | 0 | 0 |
2 | 1 | 2 | 0 | 0 |
3 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 0.5 | 0.5 |
5 | 0 | 1 | 0 | 0 |
6 | 0.25 | 0.75 | 0 | 0 |
7 | 1 | 0 | 0 | 0 |
8 | 1 | 0 | 0 | 0 |
9 | 1 | 1 | 0 | 0 |
10 | 0 | 0.5 | 0 | 0 |
11 | 0 | 1 | 0 | 0 |
12 | 0 | 1 | 0 | 0 |
13 | 0 | 0 | 0 | 0.5 |
14 | 0 | 0.5 | 0 | 0 |
15 | 2 | 0 | 1 | 0 |
I have tried to pull this without the "Test" variable while keeping it in the group by, but that reduces each ID to just one outcome so no ID has a number higher than 1 (e.g., ID 2 Outcome 2 = 1 when it should be 2, I assume because it was only looking for each unique outcome. so if they achieved outcome 1 on tests A and B, it still only counts outcome 1 once). I tried removing the "unique" identifier as well, but that muddied things too much as some IDs have 50 records for Test A and I only care about each outcome once. (e.g., if they got Outcome 1 25 times and Outcome 2 25 times on Test A, I still want it to display Outcome 1 = 1 x weight, Outcome 2 = 1 x weight). I tried to do this in Excel using pivot tables but unfortunately my dataset is actually around 1 million rows so the export didn't go well. I would love to find a way to do within SQL.
To summarize: I would like to collapse my table so that each ID has one row with all data summarized while accounting for unique test outcomes.
Thank you!
Share Improve this question edited Mar 19 at 15:09 Bonnie Brett asked Mar 18 at 21:19 Bonnie BrettBonnie Brett 133 bronze badges 5- I don't think your expected results table reflect what you're saying in your text. Which one is right? In cases such as this, you build your query piece by piece. The first piece could be a select of the unique row numbers. You could have a part that is a select of unique tests results for each unique row number. – S.Comeau Commented Mar 18 at 21:38
- 2 Please tag the RDBMS you are using (Toad is a UI and as such unlikely to be relevant to the question). – Dale K Commented Mar 19 at 3:39
- 2 A minimal reproducible example is supposed to specify both sample table data and the expected result. – jarlh Commented Mar 19 at 9:54
- Thank you S.Comeau - you are correct that my expected results didn't match but it has been fixed. If my question was not clear, I am trying to get the second table instead of the first without losing any information. – Bonnie Brett Commented Mar 19 at 13:39
- I included sample data and another column that I think is necessary for showing why I am using UNIQUE. Thank you for the reminder. – Bonnie Brett Commented Mar 19 at 15:10
1 Answer
Reset to default 0Any modern RDBMS (>= 2010?) will provide you Common Table Expressions that will help you inject the results of your first query in a second query dedicated to the grouping:
with t as
(
-- This is your query, unmodified except for the last ; removed:
SELECT
ID,
TEST,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '1' THEN ID END))) AS Outcome1,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '2' THEN ID END))) AS Outcome2,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '3' THEN ID END))) AS Outcome3,
WEIGHT * (COUNT(UNIQUE(CASE WHEN OUTCOME = '4' THEN ID END))) AS Outcome4
FROM
DATA
WHERE
TEST IN ('A', 'B', 'C')
GROUP BY
ID, TEST, WEIGHT
ORDER BY
ID
)
select
ID,
sum(Outcome1) Outcome1,
sum(Outcome2) Outcome2,
sum(Outcome3) Outcome3,
sum(Outcome4) Outcome4
from t
group by ID
order by ID;
This will help you build your solution incrementally:
you can comment the last select
and place a select * from t
to verify the results of the intermediate query, before unleashing the definitive version of the final one,
thus secure every "good" query into its own alias (even isolating interesting attempts in subqueries that you don't want to use in the final result, but keep for posterity),
and chain the steps to get your final report.
In essence, they are like temporary tables, that vanish outside of the WITH
.