I get the following unexpected behaviour when round-tripping string[pyarrow]
via parquet
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({'A': ['a', 'b', 'c']}).astype('string[pyarrow]')
Looking at the dtypes, I get string[pyarrow]
as expected
>>> df.dtypes
A string[pyarrow]
dtype: object
But if I round-trip this dataframe via parquet
>>> df.to_parquet('~/tmp/test1.parquet')
>>> df_read = pd.read_parquet('~/tmp/test1.parquet')
The dtypes change
>>> df_read.dtypes
A string[python]
dtype: object
However, if I follow exactly the same process, but use .astype(pd.ArrowDtype(pa.string())
instead of .astype('string[pyarrow])
, the round-trip via parquet leaves the dtypes untouched!
>>> df2 = pd.DataFrame({'A': ['a', 'b', 'c']}).astype(pd.ArrowDtype(pa.string()))
>>> df2.dtypes
A string[pyarrow]
dtype: object
>>> df2.to_parquet('~/tmp/test2.parquet')
>>> df2_read = pd.read_parquet('~/tmp/test2.parquet')
>>> df2_read.dtypes
A string[pyarrow]
dtype: object
- What explains this behaviour?
- How can I tell the two seemingly identical dtypes apart?
- What are the downsides to always using
pd.ArrowDtype(pa.string())
?
There is a related unanswered question which suggests that pd.ArrowDtype(pa.string())
doesn't have .str
support.