最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pandas - Why do `string[pyarrow]` and `pd.ArrowDtype(pa.string())` behave differently even though they look like the same dtype

programmeradmin3浏览0评论

I get the following unexpected behaviour when round-tripping string[pyarrow] via parquet

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'A': ['a', 'b', 'c']}).astype('string[pyarrow]')

Looking at the dtypes, I get string[pyarrow] as expected

>>> df.dtypes
A    string[pyarrow]
dtype: object

But if I round-trip this dataframe via parquet

>>> df.to_parquet('~/tmp/test1.parquet')
>>> df_read = pd.read_parquet('~/tmp/test1.parquet')

The dtypes change

>>> df_read.dtypes
A    string[python]
dtype: object

However, if I follow exactly the same process, but use .astype(pd.ArrowDtype(pa.string()) instead of .astype('string[pyarrow]), the round-trip via parquet leaves the dtypes untouched!

>>> df2 = pd.DataFrame({'A': ['a', 'b', 'c']}).astype(pd.ArrowDtype(pa.string()))
>>> df2.dtypes

A    string[pyarrow]
dtype: object

>>> df2.to_parquet('~/tmp/test2.parquet')
>>> df2_read = pd.read_parquet('~/tmp/test2.parquet')
>>> df2_read.dtypes
A    string[pyarrow]
dtype: object
  • What explains this behaviour?
  • How can I tell the two seemingly identical dtypes apart?
  • What are the downsides to always using pd.ArrowDtype(pa.string())?

There is a related unanswered question which suggests that pd.ArrowDtype(pa.string()) doesn't have .str support.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论