python - how to preserve (unknown) datatypes when creating a df from existing dfs with partially differing column names

I want to create a summary file of a few dozen original files, of which I only extract the last row (df.tail(1)) of each original file with 1000s of columns, which might be int, float, str, bool.

My problem is, that sometimes column names differ some (which is fine, I just have to retain the information), the majority of column names are the same. When using pd.concat on these dfs, nans are created for all columns which do not exist in all files, which leads to a change in datatype for ints.

df1 = pd.DataFrame([(2, 3, True, "Test")], columns=["x", "y", "z", "a"])
df2 = pd.DataFrame([(4, 6)], columns=["x", "b"])

df3 = pd.concat([df1, df2], ignore_index = True)

--> df3:

resulting df3

But datatype preservation is necessary for further file handling (ints have to stay ints)!

I tried adding the information of the subsequent dfs by using dicts

df3 = df1.copy(deep=True)

res_dict = {"x": 4, "y": 6}
df3.loc[len(df3.index)] = res_dict

but this doesn't preserve datatype either.

The only solution I found so far, is to add an all empty row when adding values of a new original file, accessing cells explicitely, and if a new column is to be added, to also add the column all empty before adding information to the specific cell

for loop_no, key in enumerate(res_dict.keys()):
    if loop_no == 0:
        row_no = len(df3)
        df3.loc[row_no] = [""] * len(df3.columns)

    if key in df3.columns:
        df3.loc[row_no, key] = res_dict[key]
    else:
        df3[key] = ""
        df3.loc[row_no, key] = res_dict[key]

But this is really tedious and with the amount of data I have to assemble not exactly efficient.

If I use convert_dtypes() as suggested, I will receive a warning of

pandas\core\dtypes\cast.py: 1080 RuntimeWarning: invalid value encountered in cast if (arr.astype(int) == arr).all()

for every column which is not "int". Can this be simply ignored?

Also i wonder if I will run into: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation

I want to create a summary file of a few dozen original files, of which I only extract the last row (df.tail(1)) of each original file with 1000s of columns, which might be int, float, str, bool.

df1 = pd.DataFrame([(2, 3, True, "Test")], columns=["x", "y", "z", "a"])
df2 = pd.DataFrame([(4, 6)], columns=["x", "b"])

df3 = pd.concat([df1, df2], ignore_index = True)

--> df3:

resulting df3

But datatype preservation is necessary for further file handling (ints have to stay ints)!

I tried adding the information of the subsequent dfs by using dicts

df3 = df1.copy(deep=True)

res_dict = {"x": 4, "y": 6}
df3.loc[len(df3.index)] = res_dict

but this doesn't preserve datatype either.

for loop_no, key in enumerate(res_dict.keys()):
    if loop_no == 0:
        row_no = len(df3)
        df3.loc[row_no] = [""] * len(df3.columns)

    if key in df3.columns:
        df3.loc[row_no, key] = res_dict[key]
    else:
        df3[key] = ""
        df3.loc[row_no, key] = res_dict[key]

But this is really tedious and with the amount of data I have to assemble not exactly efficient.

If I use convert_dtypes() as suggested, I will receive a warning of

pandas\core\dtypes\cast.py: 1080 RuntimeWarning: invalid value encountered in cast if (arr.astype(int) == arr).all()

for every column which is not "int". Can this be simply ignored?

Share Improve this question edited Mar 17 at 8:29 asked Mar 11 at 9:34 ACM 213 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

You can first convert_dtypes to ensure nullables dtypes will be used:

df3 = pd.concat([df1.convert_dtypes(),
                 df2.convert_dtypes()],
                ignore_index=True
               )

# or after
df3 = pd.concat([df1, df2], ignore_index=True).convert_dtypes()

Output:

   x     y     z     a     b
0  2     3  True  Test  <NA>
1  4  <NA>  <NA>  <NA>     6

Note however that this would only handle the NAs, if you have a mix of integers and floats for instance, this will convert the integers to float.

If you want to keep integers even in this case, then force everything to be objects:

df3 = pd.concat([df1.astype(object), df2.astype(object)], ignore_index=True)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - how to preserve (unknown) datatypes when creating a df from existing dfs with partially differing column names - Stack

1 Answer 1

与本文相关的文章

评论列表(0)