I am making calculations in the database and want to validate results against Pandas
' calculations.
I want to calculate the 25th, 50th and 75th percentile. The end use is a statistical calculation so percentile_cont()
is the Postgres
function I'm using.
My query:
SELECT percentile_cont(0.25) WITHIN GROUP (order by obs1) as q1
, percentile_cont(0.5) WITHIN GROUP (order by obs1) as med
, percentile_cont(0.75) WITHIN GROUP (order by obs1) as q3
FROM my_table;
It returns q1
=73.99
, med
=74.0
and q3
=74.0085
as the result.
However, when I do the calculations in Pandas
the q3
value is different. I searched and the consensus online is the interpolation='linear'
argument in Pandas
causes the calculation method to match that of percentile_cont()
.
dataframe = pandas.read_sql(sql='SELECT obs1 FROM my_table', con='my_connection_info')
dataframe = dataframe.sort_values(by='obs1')
percentiles = dataframe.quantile(q=[0.25, 0.5, 0.75],
axis=0,
numeric_only=False,
interpolation='linear',
method='single')
The results are:
obs1
0.25 73.992
0.50 74.000
0.75 74.006
I'm confused because only the 75th percentile is off. q1
and q2
look like a rounding issue, but q3
is simply off.
When I do my calculation method (25 sorted values * 0.75 = 18.75 position) and look at the sorted values below, the 18th and 19th position (index values 13 and 17) are both 74.006
.
I do not see how Postgres
gets the q3
result that it does, nor how to get the results from Postgres
and Pandas
to match for testing purposes.
Update 1 per answer suggestions from @Jose Luis Dioncio:
- Added
::numeric
, no change to the results. - The column datatype is
float8
- Confirmed
count
of data;SELECT COUNT (obs1) FROM my_table;
=25
andlen(dataframe)
=25
. - End use is dictating the use
percentile_cont()
. As a test I triedpercentile_disc()
instead and it returned74.009
. Still strikes me as a strange result. - Confirming the ordering inside the
Postgres
andPandas
methods is beyond me, but based on what I can observe it looks look it is working becauseq1
andq2
match.
What should I try next?
Data directly from my database using DBeaver
interface and SELECT obs1 FROM my_table;
:
obs1
24 73.982
12 73.983
18 73.984
7 73.985
2 73.988
20 73.988
4 73.992
10 73.994
16 73.994
1 73.995
6 73.995
9 73.998
15 74.000
19 74.000
3 74.002
11 74.004
21 74.004
13 74.006
17 74.006
8 74.008
5 74.009
22 74.010
14 74.012
23 74.015
0 74.030
I am making calculations in the database and want to validate results against Pandas
' calculations.
I want to calculate the 25th, 50th and 75th percentile. The end use is a statistical calculation so percentile_cont()
is the Postgres
function I'm using.
My query:
SELECT percentile_cont(0.25) WITHIN GROUP (order by obs1) as q1
, percentile_cont(0.5) WITHIN GROUP (order by obs1) as med
, percentile_cont(0.75) WITHIN GROUP (order by obs1) as q3
FROM my_table;
It returns q1
=73.99
, med
=74.0
and q3
=74.0085
as the result.
However, when I do the calculations in Pandas
the q3
value is different. I searched and the consensus online is the interpolation='linear'
argument in Pandas
causes the calculation method to match that of percentile_cont()
.
dataframe = pandas.read_sql(sql='SELECT obs1 FROM my_table', con='my_connection_info')
dataframe = dataframe.sort_values(by='obs1')
percentiles = dataframe.quantile(q=[0.25, 0.5, 0.75],
axis=0,
numeric_only=False,
interpolation='linear',
method='single')
The results are:
obs1
0.25 73.992
0.50 74.000
0.75 74.006
I'm confused because only the 75th percentile is off. q1
and q2
look like a rounding issue, but q3
is simply off.
When I do my calculation method (25 sorted values * 0.75 = 18.75 position) and look at the sorted values below, the 18th and 19th position (index values 13 and 17) are both 74.006
.
I do not see how Postgres
gets the q3
result that it does, nor how to get the results from Postgres
and Pandas
to match for testing purposes.
Update 1 per answer suggestions from @Jose Luis Dioncio:
- Added
::numeric
, no change to the results. - The column datatype is
float8
- Confirmed
count
of data;SELECT COUNT (obs1) FROM my_table;
=25
andlen(dataframe)
=25
. - End use is dictating the use
percentile_cont()
. As a test I triedpercentile_disc()
instead and it returned74.009
. Still strikes me as a strange result. - Confirming the ordering inside the
Postgres
andPandas
methods is beyond me, but based on what I can observe it looks look it is working becauseq1
andq2
match.
What should I try next?
Data directly from my database using DBeaver
interface and SELECT obs1 FROM my_table;
:
obs1
24 73.982
12 73.983
18 73.984
7 73.985
2 73.988
20 73.988
4 73.992
10 73.994
16 73.994
1 73.995
6 73.995
9 73.998
15 74.000
19 74.000
3 74.002
11 74.004
21 74.004
13 74.006
17 74.006
8 74.008
5 74.009
22 74.010
14 74.012
23 74.015
0 74.030
Share
Improve this question
edited yesterday
Python_Learner
asked yesterday
Python_LearnerPython_Learner
1,6774 gold badges25 silver badges55 bronze badges
1 Answer
Reset to default 1the problem seems to be that the discrepancy in the 75th percentile results is likely due to a mismatch in data precision, hidden decimals, or the order of the data. What I recommend is to verify your data, such as confirm the number of rows in PostgreSQL matches Pandas and also check values with exact precision because PostgreSQL may store values with hidden decimal places, impacting interpolation. You can retrieve exact values with full precision in PostgreSQL with this query:
SELECT obs1::numeric(10,6) FROM my_table ORDER BY obs1;
I also recommend confirming both systems analyze the same dataset size, in other words check a row count mismatch.
SELECT COUNT (obs1) FROM my_table; -- Should return 25
If it does not return check extra rows in the database not shown in your sample data.
Additionally, sometimes, pandas can be someway very strictly while PostgreSQL avoid some decimal numbers meaning is not as strict as Pandas, so you can use explicit casting in PostgreSQL to avoid rounding:
SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY obs1::numeric(10,6)) FROM my_table;
And if you want to discard interpolation, use percentile_disc()
in PostgreSQL to avoid it. Like this:
SELECT percentile_disc(0.75) WITHIN GROUP (ORDER BY obs1)
Hope it works!