pandas - Why do `string[pyarrow]` and `pd.ArrowDtype(pa.string())` behave differently even though they look like the same dtype

I get the following unexpected behaviour when round-tripping string[pyarrow] via parquet

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'A': ['a', 'b', 'c']}).astype('string[pyarrow]')

Looking at the dtypes, I get string[pyarrow] as expected

>>> df.dtypes
A    string[pyarrow]
dtype: object

But if I round-trip this dataframe via parquet

>>> df.to_parquet('~/tmp/test1.parquet')
>>> df_read = pd.read_parquet('~/tmp/test1.parquet')

The dtypes change

>>> df_read.dtypes
A    string[python]
dtype: object

However, if I follow exactly the same process, but use .astype(pd.ArrowDtype(pa.string()) instead of .astype('string[pyarrow]), the round-trip via parquet leaves the dtypes untouched!

>>> df2 = pd.DataFrame({'A': ['a', 'b', 'c']}).astype(pd.ArrowDtype(pa.string()))
>>> df2.dtypes

A    string[pyarrow]
dtype: object

>>> df2.to_parquet('~/tmp/test2.parquet')
>>> df2_read = pd.read_parquet('~/tmp/test2.parquet')
>>> df2_read.dtypes
A    string[pyarrow]
dtype: object

What explains this behaviour?
How can I tell the two seemingly identical dtypes apart?
What are the downsides to always using pd.ArrowDtype(pa.string())?

There is a related unanswered question which suggests that pd.ArrowDtype(pa.string()) doesn't have .str support.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pandas - Why do `string[pyarrow]` and `pd.ArrowDtype(pa.string())` behave differently even though they look like the same dtype

与本文相关的文章

评论列表(0)