I'm looking to get column coverage with the fewest amount of rows possible. The table is a list of products with quite a few categorisation columns. I want the rows to cover all the columns that have a 1 in them (some columns may from time to time have no 1s in), all rows and columns will either contain a 1 or 0 value.
ID | cat1 | cat2 | cat3 | cat4 |
---|---|---|---|---|
123 | 1 | 1 | 0 | 0 |
324 | 0 | 0 | 0 | 0 |
555 | 1 | 0 | 0 | 0 |
665 | 0 | 1 | 1 | 0 |
4455 | 0 | 1 | 1 | 0 |
6756 | 0 | 0 | 1 | 0 |
6563 | 0 | 0 | 0 | 1 |
22222 | 0 | 1 | 1 | 1 |
I'm looking to get column coverage with the fewest amount of rows possible. The table is a list of products with quite a few categorisation columns. I want the rows to cover all the columns that have a 1 in them (some columns may from time to time have no 1s in), all rows and columns will either contain a 1 or 0 value.
ID | cat1 | cat2 | cat3 | cat4 |
---|---|---|---|---|
123 | 1 | 1 | 0 | 0 |
324 | 0 | 0 | 0 | 0 |
555 | 1 | 0 | 0 | 0 |
665 | 0 | 1 | 1 | 0 |
4455 | 0 | 1 | 1 | 0 |
6756 | 0 | 0 | 1 | 0 |
6563 | 0 | 0 | 0 | 1 |
22222 | 0 | 1 | 1 | 1 |
The table at present, is roughly 15k rows and 35 columns. When I say the fewest, it doesn't have to be the absolute fewest, if we have a row or 2 extra on top of optimal I'm ok with that. The output from the table above would be:
ID | cat1 | cat2 | cat3 | cat4 |
---|---|---|---|---|
123 | 1 | 1 | 0 | 0 |
22222 | 0 | 1 | 1 | 1 |
the variant of SQL i'm using is T-SQL.
My thinking was to take the first row that covers the most columns and then loop through on the remaining columns required with the remaining rows and repeat until all columns are accounted for. columns with no 1 value in all the rows would be removed prior to the process.
I've tried using CTEs, but it gets very long and it's not dynamic, so is not going to be very repeatable.
I've loaded an example of the table here if it's of any use: https://tmpfiles./21721172/product_category_binary_file.csv
I have loaded some of the data into db-fiddle here https://www.db-fiddle/f/3wSz4fXaecU4gbrkQnG5da/21
Share edited Mar 4 at 14:10 honestmule asked Mar 4 at 12:10 honestmulehonestmule 294 bronze badges 11- 5 Please add a minimal reproducible example. That includes table structure, example data, desired result and attempts to solve the problem as text and not as images. Ideally you can additionally (!) create a runable fiddle for instance at dbfiddle.uk or db-fiddle – derpirscher Commented Mar 4 at 12:21
- 1 While waiting for your question to be reopened, you can check what I believe resolves your problem under a fiddle (using the old values, but this should work equally well with the new ones). – Guillaume Outters Commented Mar 4 at 14:10
- 1 @Aaron, you have duplicates (e.g. 21680) in your fiddle. And do we agree that we are on SQL Server, not MySQL as your fiddle has been created with? Anyway, after deduplication of entries, the query runs on your data. … But maybe your fiddle could be more concise (ideally all the corner cases you'd like to test should be present in the hereabove examples, so I'd say the fiddle should not contain more than a few dozen lines. But surely not 7000 entries!: the latency discourages many to try new queries). – Guillaume Outters Commented Mar 4 at 14:41
- 1 @JNevill column coverage means, when i have selected x rows, at least one row has a 1 in each column. 555 and 665 would not satisfy this condition on their own you'd need 6563 or 22222 as well – honestmule Commented Mar 4 at 14:47
- 1 Got it! Thank you for the clarification. This feels a bit like a knapsack problem, but with a little extra flair for the slotting aspect. I would solve this with a procedural language rather than SQL since it likely will need some iteration to solve effectively. Perhaps a recursive CTE might be an option, but that feels like it would get ugly quickly. – JNevill Commented Mar 4 at 14:54
1 Answer
Reset to default 1Pivoting and computing max's per category
Instead of iteratively or recursively computing row by row or category by category,
let's imagine we want to select, for each category, the "wealthiest" ids it reaches:
- wealthiest: which has many categories
- it reaches: which has the currently studied category
This will mechanically discard IDs with less categories.
Particularly, if B's categories are a subset of A's (let's say B has categories 1 and 2, and A has 1, 2, 3), for each category of B, A has the same category (thus they are in competition to be elected for that category), but as A has 1 more category (is wealthiest), it will take over B in the competition for this category.
Thus after having tried every category, B can claim none:
only A appears in the results.
Now let's transpose this reasoning to SQL:
with
cats as
(
select id, cat
from t unpivot (val for cat in (f_prodImage,f_award,f_awards,f_ballot /* list all category columns of t here */)) p
where val = 1
),
-- For each id, count how many categories it covers.
n as (select id, count(1) n from cats group by id),
-- For each category, find which ids (having this category) have the maximum count of categories.
catmax as
(
select cat, max(n) nmax
from cats join n on n.id = cats.id
group by cat
),
-- Now select every cat / id tuple, for ids that are a maximum for this category.
maxs as
(
select n.id, string_agg(catmax.cat, ', ') within group (order by catmax.cat) maxforcat
from catmax join cats on cats.cat = catmax.cat join n on n.id = cats.id and n.n = catmax.nmax
group by n.id
)
-- Now just select entries
select *
from t join maxs on maxs.id = t.id
order by len(maxforcat) desc, maxforcat;
Which can be seen on a small test dataset
or on a big, complete one.