I have two Excel worksheets, one of which ("edit") is a slightly modified version of the other ("base"). I want to figure out if any columns have been added, deleted, or moved. I have loaded the worksheets into dataframes, and tried to correlated the two frames, but I get an unhelpful error, which I assume is due to being lax about checking cell value types.
base = pd.read_excel(base_path, engine="openpyxl", sheet_name=name, header=None)
edit = pd.read_excel(edit_path, engine="openpyxl", sheet_name=name, header=None)
print(base.to_string())
#=> 0 1 2 3 4 5 6 7
#=> 0 NaN snip blip twig zorp plum glim frap
#=> 1 qux 10 10 9 11 9 10 10
#=> 2 baz 20 18 19 20 20 20 18
#=> 3 bat 12 11 12 11 11 12 12
#=> 4 zot 15 15 16 14 16 14 14
#=> 5 wib 11 11 9 9 10 10 11
#=> 6 fiz 16 16 18 17 18 18 16
#=> 7 woz 19 18 17 19 17 18 17
#=> 8 lug 13 12 12 12 11 12 13
#=> 9 vim 13 14 12 14 12 13 13
#=> 10 nub 18 17 18 16 16 17 18
#=> 11 sums 147 142 142 143 140 144 142
print(edit.to_string())
#=> 0 1 2 3 4 5 6 7 8 9 10 11
#=> 0 0.7 snip blip twig zorp plum glim2 glim frap NaN NaN NaN
#=> 1 qux 10 10 9 11 9 10 10 10 NaN NaN NaN
#=> 2 baz 20 18 19 20 20 21 20 18 NaN NaN 1.2
#=> 3 bat 12 11 12 11 11 12 12 12 NaN NaN NaN
#=> 4 zot 15 15 16 14 16 17 14 14 NaN NaN NaN
#=> 5 wib 11 11 9 9 61.6 10 10 11 NaN NaN NaN
#=> 6 fiz 16 16 18 17 18 18 19 16 NaN NaN NaN
#=> 7 woz 19 18 17 19 17 18 18 17 NaN NaN NaN
#=> 8 lug 13 12 12 12 11 12 12 13 NaN NaN NaN
#=> 9 vim 13 14 12 14 12 13 13 13 NaN NaN NaN
#=> 10 nub 18 17 18 16 16 17 17 18 NaN NaN NaN
#=> 11 sums 147 131 142 150 191.6 148 145 142 NaN NaN NaN
corr = base.corrwith(edit, axis=0)
Gives this error:
Traceback (most recent call last):
File "/Users/phrogz/xlsxdiff/tmp.py", line 18, in <module>
corr = base.corrwith(edit, axis=0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11311, in corrwith
ldem = left - left.mean(numeric_only=numeric_only)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11693, in mean
result = super().mean(axis, skipna, numeric_only, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 12420, in mean
return self._stat_function(
^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 12377, in _stat_function
return self._reduce(
^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11562, in _reduce
res = df._mgr.reduce(blk_func)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1500, in reduce
nbs = blk.reduce(func)
^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 404, in reduce
result = func(self.values)
^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11481, in blk_func
return op(values, axis=axis, skipna=skipna, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 147, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 404, in new_func
result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 719, in nanmean
the_sum = values.sum(axis, dtype=dtype_sum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/numpy/_core/_methods.py", line 53, in _sum
return umr_sum(a, axis, dtype, out, keepdims, initial, where)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Is there a way to use dataframe's correlation calculation, or am I going to need to roll my own?
FWIW, this test data is simplified. There is not always a single header row of unique strings, there may be many rows representing the header. I cannot use a single row to determine a unique identifier. Moreover, I next want to do the same for rows, where there are (again) possibly many columns used as row headers. And, as shown, the cell values may have changed slightly.
I have two Excel worksheets, one of which ("edit") is a slightly modified version of the other ("base"). I want to figure out if any columns have been added, deleted, or moved. I have loaded the worksheets into dataframes, and tried to correlated the two frames, but I get an unhelpful error, which I assume is due to being lax about checking cell value types.
base = pd.read_excel(base_path, engine="openpyxl", sheet_name=name, header=None)
edit = pd.read_excel(edit_path, engine="openpyxl", sheet_name=name, header=None)
print(base.to_string())
#=> 0 1 2 3 4 5 6 7
#=> 0 NaN snip blip twig zorp plum glim frap
#=> 1 qux 10 10 9 11 9 10 10
#=> 2 baz 20 18 19 20 20 20 18
#=> 3 bat 12 11 12 11 11 12 12
#=> 4 zot 15 15 16 14 16 14 14
#=> 5 wib 11 11 9 9 10 10 11
#=> 6 fiz 16 16 18 17 18 18 16
#=> 7 woz 19 18 17 19 17 18 17
#=> 8 lug 13 12 12 12 11 12 13
#=> 9 vim 13 14 12 14 12 13 13
#=> 10 nub 18 17 18 16 16 17 18
#=> 11 sums 147 142 142 143 140 144 142
print(edit.to_string())
#=> 0 1 2 3 4 5 6 7 8 9 10 11
#=> 0 0.7 snip blip twig zorp plum glim2 glim frap NaN NaN NaN
#=> 1 qux 10 10 9 11 9 10 10 10 NaN NaN NaN
#=> 2 baz 20 18 19 20 20 21 20 18 NaN NaN 1.2
#=> 3 bat 12 11 12 11 11 12 12 12 NaN NaN NaN
#=> 4 zot 15 15 16 14 16 17 14 14 NaN NaN NaN
#=> 5 wib 11 11 9 9 61.6 10 10 11 NaN NaN NaN
#=> 6 fiz 16 16 18 17 18 18 19 16 NaN NaN NaN
#=> 7 woz 19 18 17 19 17 18 18 17 NaN NaN NaN
#=> 8 lug 13 12 12 12 11 12 12 13 NaN NaN NaN
#=> 9 vim 13 14 12 14 12 13 13 13 NaN NaN NaN
#=> 10 nub 18 17 18 16 16 17 17 18 NaN NaN NaN
#=> 11 sums 147 131 142 150 191.6 148 145 142 NaN NaN NaN
corr = base.corrwith(edit, axis=0)
Gives this error:
Traceback (most recent call last):
File "/Users/phrogz/xlsxdiff/tmp.py", line 18, in <module>
corr = base.corrwith(edit, axis=0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11311, in corrwith
ldem = left - left.mean(numeric_only=numeric_only)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11693, in mean
result = super().mean(axis, skipna, numeric_only, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 12420, in mean
return self._stat_function(
^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 12377, in _stat_function
return self._reduce(
^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11562, in _reduce
res = df._mgr.reduce(blk_func)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1500, in reduce
nbs = blk.reduce(func)
^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 404, in reduce
result = func(self.values)
^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11481, in blk_func
return op(values, axis=axis, skipna=skipna, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 147, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 404, in new_func
result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 719, in nanmean
the_sum = values.sum(axis, dtype=dtype_sum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/phrogz/.venv/lib/python3.11/site-packages/numpy/_core/_methods.py", line 53, in _sum
return umr_sum(a, axis, dtype, out, keepdims, initial, where)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Is there a way to use dataframe's correlation calculation, or am I going to need to roll my own?
FWIW, this test data is simplified. There is not always a single header row of unique strings, there may be many rows representing the header. I cannot use a single row to determine a unique identifier. Moreover, I next want to do the same for rows, where there are (again) possibly many columns used as row headers. And, as shown, the cell values may have changed slightly.
Share Improve this question edited Nov 17, 2024 at 0:07 Vitalizzare 7,28210 gold badges21 silver badges44 bronze badges asked Nov 16, 2024 at 19:22 PhrogzPhrogz 304k113 gold badges667 silver badges758 bronze badges2 Answers
Reset to default 1We need to read the data correctly so that headers and indices are placed properly, rather than being treated as a part of the data. Pandas will align columns by their names and calculate correlation for corresponding pairs:
base = pd.read_excel(base_path, header=0, index_col=0)
edit = pd.read_excel(edit_path, header=0, index_col=0,
usecols='A:I') # alternatively, skip drafts outside the table
print(base.corrwith(edit))
# snip 1.000000
# blip 0.999975
# twig 1.000000
# zorp 0.999990
# plum 0.955042
# glim 0.999973
# frap 1.000000
# glim2 NaN
# dtype: float64
In case of many rows representing the header, we can use a list with header row indexes, like header=[0, 1]
:
# add a new level and save base as Excel
(
base
.T
.assign(new_level=[*'ABCDEFG'])
.set_index('new_level', append=True)
.T
.to_excel('base_multi_header.xlsx')
)
# read multi header base from Excel
base = pd.read_excel('base_multi_header.xlsx', header=[0, 1], index_col=0)
When applying corrwith
, we have to ensure the same number of levels and level names:
# add new level to edited data
edit = pd.read_excel('edit.xlsx', header=0, index_col=0, usecols='A:I')
edit = (
edit
.T
.assign(level2=[*'ABCDEFFG'])
.set_index('level2', append=True)
.T
)
# be careful with level names, they should be the same for both frames;
# I used "level2" as a new level name for `edit`, which is
# as if you changed the `A2` cell of the Excel doc before reading the data;
# we have to check or just rename the levels before moving on,
# otherwise `corrwith` fails
if base.columns.names != edit.columns.names:
edit = edit.rename_axis(columns=dict(zip(edit.columns.names,
base.columns.names)))
# see the correlation
print(edit.corrwith(base))
# new_level
# snip A 1.000000
# blip B 0.999975
# twig C 1.000000
# zorp D 0.999990
# plum E 0.955042
# glim F 0.999973
# frap G 1.000000
# glim2 F NaN
# dtype: float64
One ~solution I found was to convert every string value into a unique integer. This allowed the correlation calculation to "succeed"…but it was not useful for my goal of figuring out which columns in base
went with which columns in edit
, as the correlation results were a single vector of either 1.0
or NaN
values. :p
# Get rid of the NaNs
base.fillna(-1.0, inplace=True)
edit.fillna(-1.0, inplace=True)
# Create a map of every string in the data frames to a unique integer
def generate_string_map(*dfs):
alldata = pd.concat(dfs, ignore_index=True)
strings = alldata.select_dtypes(include=['object']).stack().unique()
strings = [s for s in strings if isinstance(s, str)]
return {o: i+100000 for i, o in enumerate(strings)}
mapping = generate_string_map(base, edit)
def hash_strings(val):
return mapping.get(val, val)
base_nums = base.map(hash_strings)
edit_nums = edit.map(hash_strings)
print(base_nums.to_string())
#=> 0 1 2 3 4 5 6 7
#=> 0 -1.0 100000 100001 100002 100003 100004 100005 100006
#=> 1 100007.0 10 10 9 11 9 10 10
#=> 2 100008.0 20 18 19 20 20 20 18
#=> 3 100009.0 12 11 12 11 11 12 12
#=> 4 100010.0 15 15 16 14 16 14 14
#=> 5 100011.0 11 11 9 9 10 10 11
#=> 6 100012.0 16 16 18 17 18 18 16
#=> 7 100013.0 19 18 17 19 17 18 17
#=> 8 100014.0 13 12 12 12 11 12 13
#=> 9 100015.0 13 14 12 14 12 13 13
#=> 10 100016.0 18 17 18 16 16 17 18
#=> 11 100017.0 147 142 142 143 140 144 142
print(edit_nums.to_string())
#=> 0 1 2 3 4 5 6 7 8 9 10 11
#=> 0 0.7 100000 100001 100002 100003 100004.0 100018 100005 100006 -1.0 -1.0 -1.0
#=> 1 100007.0 10 10 9 11 9.0 10 10 10 -1.0 -1.0 -1.0
#=> 2 100008.0 20 18 19 20 20.0 21 20 18 -1.0 -1.0 1.2
#=> 3 100009.0 12 11 12 11 11.0 12 12 12 -1.0 -1.0 -1.0
#=> 4 100010.0 15 15 16 14 16.0 17 14 14 -1.0 -1.0 -1.0
#=> 5 100011.0 11 11 9 9 61.6 10 10 11 -1.0 -1.0 -1.0
#=> 6 100012.0 16 16 18 17 18.0 18 19 16 -1.0 -1.0 -1.0
#=> 7 100013.0 19 18 17 19 17.0 18 18 17 -1.0 -1.0 -1.0
#=> 8 100014.0 13 12 12 12 11.0 12 12 13 -1.0 -1.0 -1.0
#=> 9 100015.0 13 14 12 14 12.0 13 13 13 -1.0 -1.0 -1.0
#=> 10 100016.0 18 17 18 16 16.0 17 17 18 -1.0 -1.0 -1.0
#=> 11 100017.0 147 131 142 150 191.6 148 145 142 -1.0 -1.0 -1.0
corr = base_nums.corrwith(edit_nums, axis=0)
print(corr)
#=> 0 1.0
#=> 1 1.0
#=> 2 1.0
#=> 3 1.0
#=> 4 1.0
#=> 5 1.0
#=> 6 1.0
#=> 7 1.0
#=> 8 NaN
#=> 9 NaN
#=> 10 NaN
#=> 11 NaN
#=> dtype: float64
I then tried manually calculating the full N^2 set of correlations between each column in base and edit, and found that still the correlations continue to be useless for my end goal. Most every data column shows a 1.0 correlation with every other. :/
# Compute the correlation between each pair of columns
base_nums.reset_index(drop=True, inplace=True)
edit_nums.reset_index(drop=True, inplace=True)
corr_matrix = pd.DataFrame(index=base_nums.columns, columns=edit_nums.columns)
for bcol in base_nums.columns:
for ecol in edit_nums.columns:
corr_value = base_nums[bcol].corr(edit_nums[ecol])
corr_matrix.loc[bcol, ecol] = corr_value
print(corr_matrix)
#=> 0 1 2 3 4 ... 7 8 9 10 11
#=> 0 1.0 -0.999999 -0.999999 -0.999999 -0.999999 ... -0.999999 -0.999999 NaN NaN 0.090865
#=> 1 -0.999999 1.0 1.0 1.0 1.0 ... 1.0 1.0 NaN NaN -0.090982
#=> 2 -0.999999 1.0 1.0 1.0 1.0 ... 1.0 1.0 NaN NaN -0.090994
#=> 3 -0.999999 1.0 1.0 1.0 1.0 ... 1.0 1.0 NaN NaN -0.090983
#=> 4 -0.999999 1.0 1.0 1.0 1.0 ... 1.0 1.0 NaN NaN -0.090974
#=> 5 -0.999999 1.0 1.0 1.0 1.0 ... 1.0 1.0 NaN NaN -0.090969
#=> 6 -0.999999 1.0 1.0 1.0 1.0 ... 1.0 1.0 NaN NaN -0.090976
#=> 7 -0.999999 1.0 1.0 1.0 1.0 ... 1.0 1.0 NaN NaN -0.090994
I'm still interested in knowing if pandas provides a way to correlate columns better than this, but it seems that what I really need is not to lean on a heuristic to see which columns are the same, but to use a heuristic to find/calculate an index that will be stable for each column and row.