Problem statement: I am working on a reporting dashboard in Tableau where each record belongs to one status_category. However, I need to display a common aggregated count (processed_count) across all categories without causing row explosion.
For example, if 1,200 transactions were processed, I want that number to appear in every category (eg - "Pending," "In Progress," etc.), even for records that didn’t have a processed status.
Approaches attempted:
- Extracted distinct status categories
- Extracted distinct grouping dimensions (that are used as filters in my dashboard)
CROSS JOIN
ed them to ensure every status appears for each unique combination of the grouping columnsLEFT JOIN
ed the result back to the main dataset
Issue: this worked correctly, but caused massive row explosion
- Original dataset: 600k rows
- After applying this approach: 2.4M rows
This impacts performance in our database and visualization tool by making queries inefficient and introducing unnecessary data bloat.
Sample output:
Region | Department | Segment | Cycle_Month | Status_Category | Processed_Count | Record_Count |
---|---|---|---|---|---|---|
North | Finance | Large | Jan | Processed in Current Cycle | 1,200 | 500 |
North | Finance | Large | Jan | Pending | 1,200 | 300 |
North | Finance | Large | Jan | In Progress | 1,200 | 400 |
Problem statement: I am working on a reporting dashboard in Tableau where each record belongs to one status_category. However, I need to display a common aggregated count (processed_count) across all categories without causing row explosion.
For example, if 1,200 transactions were processed, I want that number to appear in every category (eg - "Pending," "In Progress," etc.), even for records that didn’t have a processed status.
Approaches attempted:
- Extracted distinct status categories
- Extracted distinct grouping dimensions (that are used as filters in my dashboard)
CROSS JOIN
ed them to ensure every status appears for each unique combination of the grouping columnsLEFT JOIN
ed the result back to the main dataset
Issue: this worked correctly, but caused massive row explosion
- Original dataset: 600k rows
- After applying this approach: 2.4M rows
This impacts performance in our database and visualization tool by making queries inefficient and introducing unnecessary data bloat.
Sample output:
Region | Department | Segment | Cycle_Month | Status_Category | Processed_Count | Record_Count |
---|---|---|---|---|---|---|
North | Finance | Large | Jan | Processed in Current Cycle | 1,200 | 500 |
North | Finance | Large | Jan | Pending | 1,200 | 300 |
North | Finance | Large | Jan | In Progress | 1,200 | 400 |
Screenshot
Query:
-- Aggregate Data by Dimensions
WITH aggregated_data AS
(
SELECT
pd.cycle_month,
pd.region_id,
pd.region_name,
pd.department,
pd.segment,
pd.team_id,
pd.team_name,
pd.division_id,
pd.division_name,
pd.manager_id,
pd.manager_name,
pd.status_category,
COUNT(DISTINCT pd.record_id) AS record_count
FROM processed_data_table pd
GROUP BY ALL
),
-- Extract Unique Status Categories
distinct_statuses AS (
SELECT DISTINCT status_category
FROM aggregated_data
),
-- Extract Unique Dimension Combinations (excluding status_category)
distinct_dimensions AS (
SELECT DISTINCT
cycle_month,
region_id,
region_name,
department,
segment,
team_id,
team_name,
division_id,
division_name,
manager_id,
manager_name
FROM aggregated_data
),
-- Create a Full Status-Dimension Combination
status_dimension_mapping AS (
SELECT
d.*,
s.status_category
FROM distinct_dimensions d
CROSS JOIN distinct_statuses s
),
-- Merge Actual Record Counts While Retaining All Statuses
data_with_defaults AS (
SELECT
sdm.*,
COALESCE(ad.record_count, 0) AS record_count
FROM status_dimension_mapping sdm
LEFT JOIN aggregated_data ad
ON sdm.cycle_month = ad.cycle_month
AND sdm.region_id = ad.region_id
AND sdm.department = ad.department
AND sdm.segment = ad.segment
AND sdm.team_id = ad.team_id
AND sdm.division_id = ad.division_id
AND sdm.manager_id = ad.manager_id
AND sdm.status_category = ad.status_category
),
-- Assign Processed Count Across All Statuses
output AS (
SELECT
*,
MAX(CASE WHEN status_category = 'Processed in Current Cycle' THEN record_count END)
OVER (PARTITION BY cycle_month, region_id, department, segment,
team_id, division_id, manager_id
)
AS processed_count
FROM data_with_defaults
)
SELECT * FROM output;
Share
Improve this question
edited Mar 15 at 8:29
marc_s
756k184 gold badges1.4k silver badges1.5k bronze badges
asked Mar 15 at 0:22
blue thunderblue thunder
32 bronze badges
1
- Please share sample input, expected sample output.minimal-reproducible-example – samhita Commented Mar 15 at 9:46
1 Answer
Reset to default 0This is based on assumption due to limited information about sample input data and columns.
Based on my understanding you need a view showing count of each status for a group of dimension,alongside count of processed status for the same group of dimensions.
If the understanding correct, you can just find count of each status category and find count of processed category like below .
SELECT
pd.cycle_month,
pd.region_id,
pd.region_name,
pd.department,
pd.segment,
pd.team_id,
pd.team_name,
pd.division_id,
pd.division_name,
pd.manager_id,
pd.manager_name,
pd.status_category,
COUNT(DISTINCT pd.record_id) AS status_count,
MAX(CASE WHEN pd.status_category = 'Processed' THEN COUNT(DISTINCT pd.record_id) END)
OVER (PARTITION BY pd.cycle_month, pd.region_id, pd.department, pd.segment,
pd.team_id, pd.division_id, pd.manager_id) AS processed_count
FROM
processed_data_table pd
GROUP BY
pd.cycle_month, pd.region_id, pd.region_name, pd.department, pd.segment,
pd.team_id, pd.team_name, pd.division_id, pd.division_name,
pd.manager_id, pd.manager_name, pd.status_category;