最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How Can I Match Percentile Results in Postgres and Pandas? - Stack Overflow

programmeradmin0浏览0评论

I am making calculations in the database and want to validate results against Pandas' calculations.

I want to calculate the 25th, 50th and 75th percentile. The end use is a statistical calculation so percentile_cont() is the Postgres function I'm using.

My query:

SELECT percentile_cont(0.25) WITHIN GROUP (order by obs1) as q1
     , percentile_cont(0.5) WITHIN GROUP (order by obs1) as med
     , percentile_cont(0.75) WITHIN GROUP (order by obs1) as q3
FROM my_table;

It returns q1=73.99, med=74.0 and q3=74.0085 as the result.

However, when I do the calculations in Pandas the q3 value is different. I searched and the consensus online is the interpolation='linear' argument in Pandas causes the calculation method to match that of percentile_cont().

dataframe = pandas.read_sql(sql='SELECT obs1 FROM my_table', con='my_connection_info')
dataframe = dataframe.sort_values(by='obs1')
percentiles = dataframe.quantile(q=[0.25, 0.5, 0.75],
                                         axis=0,
                                         numeric_only=False,
                                         interpolation='linear',
                                         method='single')

The results are:

        obs1
0.25  73.992
0.50  74.000
0.75  74.006

I'm confused because only the 75th percentile is off. q1 and q2 look like a rounding issue, but q3 is simply off.

When I do my calculation method (25 sorted values * 0.75 = 18.75 position) and look at the sorted values below, the 18th and 19th position (index values 13 and 17) are both 74.006.

I do not see how Postgres gets the q3 result that it does, nor how to get the results from Postgres and Pandas to match for testing purposes.

Update 1 per answer suggestions from @Jose Luis Dioncio:

  1. Added ::numeric, no change to the results.
  2. The column datatype is float8
  3. Confirmed count of data; SELECT COUNT (obs1) FROM my_table; = 25 and len(dataframe) = 25.
  4. End use is dictating the use percentile_cont(). As a test I tried percentile_disc() instead and it returned 74.009. Still strikes me as a strange result.
  5. Confirming the ordering inside the Postgres and Pandas methods is beyond me, but based on what I can observe it looks look it is working because q1 and q2 match.

What should I try next?

Data directly from my database using DBeaver interface and SELECT obs1 FROM my_table;:

      obs1
24  73.982
12  73.983
18  73.984
7   73.985
2   73.988
20  73.988
4   73.992
10  73.994
16  73.994
1   73.995
6   73.995
9   73.998
15  74.000
19  74.000
3   74.002
11  74.004
21  74.004
13  74.006
17  74.006
8   74.008
5   74.009
22  74.010
14  74.012
23  74.015
0   74.030

I am making calculations in the database and want to validate results against Pandas' calculations.

I want to calculate the 25th, 50th and 75th percentile. The end use is a statistical calculation so percentile_cont() is the Postgres function I'm using.

My query:

SELECT percentile_cont(0.25) WITHIN GROUP (order by obs1) as q1
     , percentile_cont(0.5) WITHIN GROUP (order by obs1) as med
     , percentile_cont(0.75) WITHIN GROUP (order by obs1) as q3
FROM my_table;

It returns q1=73.99, med=74.0 and q3=74.0085 as the result.

However, when I do the calculations in Pandas the q3 value is different. I searched and the consensus online is the interpolation='linear' argument in Pandas causes the calculation method to match that of percentile_cont().

dataframe = pandas.read_sql(sql='SELECT obs1 FROM my_table', con='my_connection_info')
dataframe = dataframe.sort_values(by='obs1')
percentiles = dataframe.quantile(q=[0.25, 0.5, 0.75],
                                         axis=0,
                                         numeric_only=False,
                                         interpolation='linear',
                                         method='single')

The results are:

        obs1
0.25  73.992
0.50  74.000
0.75  74.006

I'm confused because only the 75th percentile is off. q1 and q2 look like a rounding issue, but q3 is simply off.

When I do my calculation method (25 sorted values * 0.75 = 18.75 position) and look at the sorted values below, the 18th and 19th position (index values 13 and 17) are both 74.006.

I do not see how Postgres gets the q3 result that it does, nor how to get the results from Postgres and Pandas to match for testing purposes.

Update 1 per answer suggestions from @Jose Luis Dioncio:

  1. Added ::numeric, no change to the results.
  2. The column datatype is float8
  3. Confirmed count of data; SELECT COUNT (obs1) FROM my_table; = 25 and len(dataframe) = 25.
  4. End use is dictating the use percentile_cont(). As a test I tried percentile_disc() instead and it returned 74.009. Still strikes me as a strange result.
  5. Confirming the ordering inside the Postgres and Pandas methods is beyond me, but based on what I can observe it looks look it is working because q1 and q2 match.

What should I try next?

Data directly from my database using DBeaver interface and SELECT obs1 FROM my_table;:

      obs1
24  73.982
12  73.983
18  73.984
7   73.985
2   73.988
20  73.988
4   73.992
10  73.994
16  73.994
1   73.995
6   73.995
9   73.998
15  74.000
19  74.000
3   74.002
11  74.004
21  74.004
13  74.006
17  74.006
8   74.008
5   74.009
22  74.010
14  74.012
23  74.015
0   74.030
Share Improve this question edited yesterday Python_Learner asked yesterday Python_LearnerPython_Learner 1,6774 gold badges25 silver badges55 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

the problem seems to be that the discrepancy in the 75th percentile results is likely due to a mismatch in data precision, hidden decimals, or the order of the data. What I recommend is to verify your data, such as confirm the number of rows in PostgreSQL matches Pandas and also check values with exact precision because PostgreSQL may store values with hidden decimal places, impacting interpolation. You can retrieve exact values with full precision in PostgreSQL with this query:

SELECT obs1::numeric(10,6) FROM my_table ORDER BY obs1;

I also recommend confirming both systems analyze the same dataset size, in other words check a row count mismatch.

SELECT COUNT (obs1) FROM my_table; -- Should return 25

If it does not return check extra rows in the database not shown in your sample data.

Additionally, sometimes, pandas can be someway very strictly while PostgreSQL avoid some decimal numbers meaning is not as strict as Pandas, so you can use explicit casting in PostgreSQL to avoid rounding:

SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY obs1::numeric(10,6)) FROM my_table;

And if you want to discard interpolation, use percentile_disc() in PostgreSQL to avoid it. Like this:

SELECT percentile_disc(0.75) WITHIN GROUP (ORDER BY obs1)

Hope it works!

发布评论

评论列表(0)

  1. 暂无评论