I want to create a summary file of a few dozen original files, of which I only extract the last row (df.tail(1)) of each original file with 1000s of columns, which might be int, float, str, bool.
My problem is, that sometimes column names differ some (which is fine, I just have to retain the information), the majority of column names are the same. When using pd.concat on these dfs, nans are created for all columns which do not exist in all files, which leads to a change in datatype for ints.
df1 = pd.DataFrame([(2, 3, True, "Test")], columns=["x", "y", "z", "a"])
df2 = pd.DataFrame([(4, 6)], columns=["x", "b"])
df3 = pd.concat([df1, df2], ignore_index = True)
--> df3:
resulting df3
But datatype preservation is necessary for further file handling (ints have to stay ints)!
I tried adding the information of the subsequent dfs by using dicts
df3 = df1.copy(deep=True)
res_dict = {"x": 4, "y": 6}
df3.loc[len(df3.index)] = res_dict
but this doesn't preserve datatype either.
The only solution I found so far, is to add an all empty row when adding values of a new original file, accessing cells explicitely, and if a new column is to be added, to also add the column all empty before adding information to the specific cell
for loop_no, key in enumerate(res_dict.keys()):
if loop_no == 0:
row_no = len(df3)
df3.loc[row_no] = [""] * len(df3.columns)
if key in df3.columns:
df3.loc[row_no, key] = res_dict[key]
else:
df3[key] = ""
df3.loc[row_no, key] = res_dict[key]
But this is really tedious and with the amount of data I have to assemble not exactly efficient.
If I use convert_dtypes() as suggested, I will receive a warning of
pandas\core\dtypes\cast.py: 1080 RuntimeWarning: invalid value encountered in cast if (arr.astype(int) == arr).all()
for every column which is not "int". Can this be simply ignored?
Also i wonder if I will run into: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation
I want to create a summary file of a few dozen original files, of which I only extract the last row (df.tail(1)) of each original file with 1000s of columns, which might be int, float, str, bool.
My problem is, that sometimes column names differ some (which is fine, I just have to retain the information), the majority of column names are the same. When using pd.concat on these dfs, nans are created for all columns which do not exist in all files, which leads to a change in datatype for ints.
df1 = pd.DataFrame([(2, 3, True, "Test")], columns=["x", "y", "z", "a"])
df2 = pd.DataFrame([(4, 6)], columns=["x", "b"])
df3 = pd.concat([df1, df2], ignore_index = True)
--> df3:
resulting df3
But datatype preservation is necessary for further file handling (ints have to stay ints)!
I tried adding the information of the subsequent dfs by using dicts
df3 = df1.copy(deep=True)
res_dict = {"x": 4, "y": 6}
df3.loc[len(df3.index)] = res_dict
but this doesn't preserve datatype either.
The only solution I found so far, is to add an all empty row when adding values of a new original file, accessing cells explicitely, and if a new column is to be added, to also add the column all empty before adding information to the specific cell
for loop_no, key in enumerate(res_dict.keys()):
if loop_no == 0:
row_no = len(df3)
df3.loc[row_no] = [""] * len(df3.columns)
if key in df3.columns:
df3.loc[row_no, key] = res_dict[key]
else:
df3[key] = ""
df3.loc[row_no, key] = res_dict[key]
But this is really tedious and with the amount of data I have to assemble not exactly efficient.
If I use convert_dtypes() as suggested, I will receive a warning of
pandas\core\dtypes\cast.py: 1080 RuntimeWarning: invalid value encountered in cast if (arr.astype(int) == arr).all()
for every column which is not "int". Can this be simply ignored?
Also i wonder if I will run into: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation
Share Improve this question edited Mar 17 at 8:29 ACM asked Mar 11 at 9:34 ACMACM 213 bronze badges1 Answer
Reset to default 2You can first convert_dtypes
to ensure nullables dtypes will be used:
df3 = pd.concat([df1.convert_dtypes(),
df2.convert_dtypes()],
ignore_index=True
)
# or after
df3 = pd.concat([df1, df2], ignore_index=True).convert_dtypes()
Output:
x y z a b
0 2 3 True Test <NA>
1 4 <NA> <NA> <NA> 6
Note however that this would only handle the NAs, if you have a mix of integers and floats for instance, this will convert the integers to float.
If you want to keep integers even in this case, then force everything to be objects:
df3 = pd.concat([df1.astype(object), df2.astype(object)], ignore_index=True)