最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

hive - Creating deciles in SQL - Stack Overflow

programmeradmin1浏览0评论

I'm trying to bucket my data into deciles, but not in the traditional sense where the dimension is the basis of the decile.

I have 463 unique it_scores ranging from 316-900 (my dimension) with 1,296,070 trade_counts (my measure) total. Using the following code breaks my data into 10 buckets with 47 unique it_scores:

ntile(10) over (order by it_score)) as tileno

While this is definitely doing what it's supposed to, I need my buckets to be built on the basis of total trade_counts, with each bucket containing about 129.6k observations. The it_score is still the dimension but the ranges wouldn't necessarily be equal i.e. decile 10 might have a range of 316-688 with 129.6k observations while decile 9 might be 689-712 also with 129.6k observations.

How would I achieve that?

I'm trying to bucket my data into deciles, but not in the traditional sense where the dimension is the basis of the decile.

I have 463 unique it_scores ranging from 316-900 (my dimension) with 1,296,070 trade_counts (my measure) total. Using the following code breaks my data into 10 buckets with 47 unique it_scores:

ntile(10) over (order by it_score)) as tileno

While this is definitely doing what it's supposed to, I need my buckets to be built on the basis of total trade_counts, with each bucket containing about 129.6k observations. The it_score is still the dimension but the ranges wouldn't necessarily be equal i.e. decile 10 might have a range of 316-688 with 129.6k observations while decile 9 might be 689-712 also with 129.6k observations.

How would I achieve that?

Share Improve this question asked Nov 15, 2024 at 21:05 A. OliA. Oli 431 silver badge6 bronze badges 3
  • Please read : Why should I provide a Minimal Reproducible Example, even for a very simple SQL query? – MatBailie Commented Nov 16, 2024 at 10:47
  • That may become complicated. It seems you want 10 buckets of adjacent it_scores. Their average trade_count will be 129.6k. So let's just say, that it_score 316 and 317 have a trade_count of 86.4k each. Then you must decide whether it_score 316 gets its own bucket with 43.2k below average or shares a bucket with it_score 317 thus getting a bucket with 43.2k above average. Depending on the division of counts in the other adjacent it_scores one or the other decision may get you a more even distribution of counts in the buckets. – Thorsten Kettner Commented Nov 16, 2024 at 15:58
  • If I have understood this correctly, you are looking for an algorithm (that may then get coded in SQL). You may hence want to remove the tags sql and hive and instead use the tags algorithm and distribution and write a more precise description. This may or may not include that you want to get the standard deviation of the buckets' trade_counts as low as possible. – Thorsten Kettner Commented Nov 16, 2024 at 16:08
Add a comment  | 

1 Answer 1

Reset to default 0

SUM(trade_count) OVER (ORDER BY it_score) to assign deciles based on cumulative trade_counts.

SELECT
  decile,
  SUM(trade_count) AS decile_trade_count
FROM
  (
    SELECT
      it_score,
      trade_count,
      FLOOR(
        (SUM(trade_count) OVER (ORDER BY it_score) - 1) / (SUM(trade_count) OVER ()) * 10
      ) + 1 AS decile
    FROM table
  ) sub
GROUP BY decile
ORDER BY decile;
发布评论

评论列表(0)

  1. 暂无评论