最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - how to preserve (unknown) datatypes when creating a df from existing dfs with partially differing column names - Stack

programmeradmin2浏览0评论

I want to create a summary file of a few dozen original files, of which I only extract the last row (df.tail(1)) of each original file with 1000s of columns, which might be int, float, str, bool.

My problem is, that sometimes column names differ some (which is fine, I just have to retain the information), the majority of column names are the same. When using pd.concat on these dfs, nans are created for all columns which do not exist in all files, which leads to a change in datatype for ints.

df1 = pd.DataFrame([(2, 3, True, "Test")], columns=["x", "y", "z", "a"])
df2 = pd.DataFrame([(4, 6)], columns=["x", "b"])

df3 = pd.concat([df1, df2], ignore_index = True)

--> df3:

resulting df3

But datatype preservation is necessary for further file handling (ints have to stay ints)!

I tried adding the information of the subsequent dfs by using dicts

df3 = df1.copy(deep=True)

res_dict = {"x": 4, "y": 6}
df3.loc[len(df3.index)] = res_dict

but this doesn't preserve datatype either.

The only solution I found so far, is to add an all empty row when adding values of a new original file, accessing cells explicitely, and if a new column is to be added, to also add the column all empty before adding information to the specific cell

for loop_no, key in enumerate(res_dict.keys()):
    if loop_no == 0:
        row_no = len(df3)
        df3.loc[row_no] = [""] * len(df3.columns)

    if key in df3.columns:
        df3.loc[row_no, key] = res_dict[key]
    else:
        df3[key] = ""
        df3.loc[row_no, key] = res_dict[key]

But this is really tedious and with the amount of data I have to assemble not exactly efficient.

If I use convert_dtypes() as suggested, I will receive a warning of

pandas\core\dtypes\cast.py: 1080 RuntimeWarning: invalid value encountered in cast if (arr.astype(int) == arr).all()

for every column which is not "int". Can this be simply ignored?

Also i wonder if I will run into: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation

I want to create a summary file of a few dozen original files, of which I only extract the last row (df.tail(1)) of each original file with 1000s of columns, which might be int, float, str, bool.

My problem is, that sometimes column names differ some (which is fine, I just have to retain the information), the majority of column names are the same. When using pd.concat on these dfs, nans are created for all columns which do not exist in all files, which leads to a change in datatype for ints.

df1 = pd.DataFrame([(2, 3, True, "Test")], columns=["x", "y", "z", "a"])
df2 = pd.DataFrame([(4, 6)], columns=["x", "b"])

df3 = pd.concat([df1, df2], ignore_index = True)

--> df3:

resulting df3

But datatype preservation is necessary for further file handling (ints have to stay ints)!

I tried adding the information of the subsequent dfs by using dicts

df3 = df1.copy(deep=True)

res_dict = {"x": 4, "y": 6}
df3.loc[len(df3.index)] = res_dict

but this doesn't preserve datatype either.

The only solution I found so far, is to add an all empty row when adding values of a new original file, accessing cells explicitely, and if a new column is to be added, to also add the column all empty before adding information to the specific cell

for loop_no, key in enumerate(res_dict.keys()):
    if loop_no == 0:
        row_no = len(df3)
        df3.loc[row_no] = [""] * len(df3.columns)

    if key in df3.columns:
        df3.loc[row_no, key] = res_dict[key]
    else:
        df3[key] = ""
        df3.loc[row_no, key] = res_dict[key]

But this is really tedious and with the amount of data I have to assemble not exactly efficient.

If I use convert_dtypes() as suggested, I will receive a warning of

pandas\core\dtypes\cast.py: 1080 RuntimeWarning: invalid value encountered in cast if (arr.astype(int) == arr).all()

for every column which is not "int". Can this be simply ignored?

Also i wonder if I will run into: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation

Share Improve this question edited Mar 17 at 8:29 ACM asked Mar 11 at 9:34 ACMACM 213 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 2

You can first convert_dtypes to ensure nullables dtypes will be used:

df3 = pd.concat([df1.convert_dtypes(),
                 df2.convert_dtypes()],
                ignore_index=True
               )

# or after
df3 = pd.concat([df1, df2], ignore_index=True).convert_dtypes()

Output:

   x     y     z     a     b
0  2     3  True  Test  <NA>
1  4  <NA>  <NA>  <NA>     6

Note however that this would only handle the NAs, if you have a mix of integers and floats for instance, this will convert the integers to float.

If you want to keep integers even in this case, then force everything to be objects:

df3 = pd.concat([df1.astype(object), df2.astype(object)], ignore_index=True)

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论