I'm trying to bucket my data into deciles, but not in the traditional sense where the dimension is the basis of the decile.
I have 463 unique it_scores ranging from 316-900 (my dimension) with 1,296,070 trade_counts (my measure) total. Using the following code breaks my data into 10 buckets with 47 unique it_scores:
ntile(10) over (order by it_score)) as tileno
While this is definitely doing what it's supposed to, I need my buckets to be built on the basis of total trade_counts, with each bucket containing about 129.6k observations. The it_score is still the dimension but the ranges wouldn't necessarily be equal i.e. decile 10 might have a range of 316-688 with 129.6k observations while decile 9 might be 689-712 also with 129.6k observations.
How would I achieve that?
I'm trying to bucket my data into deciles, but not in the traditional sense where the dimension is the basis of the decile.
I have 463 unique it_scores ranging from 316-900 (my dimension) with 1,296,070 trade_counts (my measure) total. Using the following code breaks my data into 10 buckets with 47 unique it_scores:
ntile(10) over (order by it_score)) as tileno
While this is definitely doing what it's supposed to, I need my buckets to be built on the basis of total trade_counts, with each bucket containing about 129.6k observations. The it_score is still the dimension but the ranges wouldn't necessarily be equal i.e. decile 10 might have a range of 316-688 with 129.6k observations while decile 9 might be 689-712 also with 129.6k observations.
How would I achieve that?
Share Improve this question asked Nov 15, 2024 at 21:05 A. OliA. Oli 431 silver badge6 bronze badges 3 |1 Answer
Reset to default 0SUM(trade_count) OVER (ORDER BY it_score)
to assign deciles based on cumulative trade_counts
.
SELECT
decile,
SUM(trade_count) AS decile_trade_count
FROM
(
SELECT
it_score,
trade_count,
FLOOR(
(SUM(trade_count) OVER (ORDER BY it_score) - 1) / (SUM(trade_count) OVER ()) * 10
) + 1 AS decile
FROM table
) sub
GROUP BY decile
ORDER BY decile;
sql
andhive
and instead use the tagsalgorithm
anddistribution
and write a more precise description. This may or may not include that you want to get the standard deviation of the buckets' trade_counts as low as possible. – Thorsten Kettner Commented Nov 16, 2024 at 16:08