python - Correlate columns in two pandas dataframes with varying data types

I have two Excel worksheets, one of which ("edit") is a slightly modified version of the other ("base"). I want to figure out if any columns have been added, deleted, or moved. I have loaded the worksheets into dataframes, and tried to correlated the two frames, but I get an unhelpful error, which I assume is due to being lax about checking cell value types.

base = pd.read_excel(base_path, engine="openpyxl", sheet_name=name, header=None)
edit = pd.read_excel(edit_path, engine="openpyxl", sheet_name=name, header=None)

print(base.to_string())
#=>        0     1     2     3     4     5     6     7
#=> 0    NaN  snip  blip  twig  zorp  plum  glim  frap
#=> 1    qux    10    10     9    11     9    10    10
#=> 2    baz    20    18    19    20    20    20    18
#=> 3    bat    12    11    12    11    11    12    12
#=> 4    zot    15    15    16    14    16    14    14
#=> 5    wib    11    11     9     9    10    10    11
#=> 6    fiz    16    16    18    17    18    18    16
#=> 7    woz    19    18    17    19    17    18    17
#=> 8    lug    13    12    12    12    11    12    13
#=> 9    vim    13    14    12    14    12    13    13
#=> 10   nub    18    17    18    16    16    17    18
#=> 11  sums   147   142   142   143   140   144   142

print(edit.to_string())
#=>       0     1     2     3     4      5      6     7     8   9   10   11
#=> 0    0.7  snip  blip  twig  zorp   plum  glim2  glim  frap NaN NaN  NaN
#=> 1    qux    10    10     9    11      9     10    10    10 NaN NaN  NaN
#=> 2    baz    20    18    19    20     20     21    20    18 NaN NaN  1.2
#=> 3    bat    12    11    12    11     11     12    12    12 NaN NaN  NaN
#=> 4    zot    15    15    16    14     16     17    14    14 NaN NaN  NaN
#=> 5    wib    11    11     9     9   61.6     10    10    11 NaN NaN  NaN
#=> 6    fiz    16    16    18    17     18     18    19    16 NaN NaN  NaN
#=> 7    woz    19    18    17    19     17     18    18    17 NaN NaN  NaN
#=> 8    lug    13    12    12    12     11     12    12    13 NaN NaN  NaN
#=> 9    vim    13    14    12    14     12     13    13    13 NaN NaN  NaN
#=> 10   nub    18    17    18    16     16     17    17    18 NaN NaN  NaN
#=> 11  sums   147   131   142   150  191.6    148   145   142 NaN NaN  NaN

corr = base.corrwith(edit, axis=0)

Gives this error:

Traceback (most recent call last):
  File "/Users/phrogz/xlsxdiff/tmp.py", line 18, in <module>
    corr = base.corrwith(edit, axis=0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11311, in corrwith
    ldem = left - left.mean(numeric_only=numeric_only)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11693, in mean
    result = super().mean(axis, skipna, numeric_only, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 12420, in mean
    return self._stat_function(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 12377, in _stat_function
    return self._reduce(
           ^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11562, in _reduce
    res = df._mgr.reduce(blk_func)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1500, in reduce
    nbs = blk.reduce(func)
          ^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 404, in reduce
    result = func(self.values)
             ^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11481, in blk_func
    return op(values, axis=axis, skipna=skipna, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 147, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 404, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 719, in nanmean
    the_sum = values.sum(axis, dtype=dtype_sum)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/numpy/_core/_methods.py", line 53, in _sum
    return umr_sum(a, axis, dtype, out, keepdims, initial, where)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Is there a way to use dataframe's correlation calculation, or am I going to need to roll my own?

FWIW, this test data is simplified. There is not always a single header row of unique strings, there may be many rows representing the header. I cannot use a single row to determine a unique identifier. Moreover, I next want to do the same for rows, where there are (again) possibly many columns used as row headers. And, as shown, the cell values may have changed slightly.

base = pd.read_excel(base_path, engine="openpyxl", sheet_name=name, header=None)
edit = pd.read_excel(edit_path, engine="openpyxl", sheet_name=name, header=None)

print(base.to_string())
#=>        0     1     2     3     4     5     6     7
#=> 0    NaN  snip  blip  twig  zorp  plum  glim  frap
#=> 1    qux    10    10     9    11     9    10    10
#=> 2    baz    20    18    19    20    20    20    18
#=> 3    bat    12    11    12    11    11    12    12
#=> 4    zot    15    15    16    14    16    14    14
#=> 5    wib    11    11     9     9    10    10    11
#=> 6    fiz    16    16    18    17    18    18    16
#=> 7    woz    19    18    17    19    17    18    17
#=> 8    lug    13    12    12    12    11    12    13
#=> 9    vim    13    14    12    14    12    13    13
#=> 10   nub    18    17    18    16    16    17    18
#=> 11  sums   147   142   142   143   140   144   142

print(edit.to_string())
#=>       0     1     2     3     4      5      6     7     8   9   10   11
#=> 0    0.7  snip  blip  twig  zorp   plum  glim2  glim  frap NaN NaN  NaN
#=> 1    qux    10    10     9    11      9     10    10    10 NaN NaN  NaN
#=> 2    baz    20    18    19    20     20     21    20    18 NaN NaN  1.2
#=> 3    bat    12    11    12    11     11     12    12    12 NaN NaN  NaN
#=> 4    zot    15    15    16    14     16     17    14    14 NaN NaN  NaN
#=> 5    wib    11    11     9     9   61.6     10    10    11 NaN NaN  NaN
#=> 6    fiz    16    16    18    17     18     18    19    16 NaN NaN  NaN
#=> 7    woz    19    18    17    19     17     18    18    17 NaN NaN  NaN
#=> 8    lug    13    12    12    12     11     12    12    13 NaN NaN  NaN
#=> 9    vim    13    14    12    14     12     13    13    13 NaN NaN  NaN
#=> 10   nub    18    17    18    16     16     17    17    18 NaN NaN  NaN
#=> 11  sums   147   131   142   150  191.6    148   145   142 NaN NaN  NaN

corr = base.corrwith(edit, axis=0)

Gives this error:

Traceback (most recent call last):
  File "/Users/phrogz/xlsxdiff/tmp.py", line 18, in <module>
    corr = base.corrwith(edit, axis=0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11311, in corrwith
    ldem = left - left.mean(numeric_only=numeric_only)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11693, in mean
    result = super().mean(axis, skipna, numeric_only, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 12420, in mean
    return self._stat_function(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/generic.py", line 12377, in _stat_function
    return self._reduce(
           ^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11562, in _reduce
    res = df._mgr.reduce(blk_func)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 1500, in reduce
    nbs = blk.reduce(func)
          ^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 404, in reduce
    result = func(self.values)
             ^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 11481, in blk_func
    return op(values, axis=axis, skipna=skipna, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 147, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 404, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/pandas/core/nanops.py", line 719, in nanmean
    the_sum = values.sum(axis, dtype=dtype_sum)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/phrogz/.venv/lib/python3.11/site-packages/numpy/_core/_methods.py", line 53, in _sum
    return umr_sum(a, axis, dtype, out, keepdims, initial, where)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Is there a way to use dataframe's correlation calculation, or am I going to need to roll my own?

Share Improve this question edited Nov 17, 2024 at 0:07 Vitalizzare 7,28210 gold badges21 silver badges44 bronze badges asked Nov 16, 2024 at 19:22 Phrogz 304k113 gold badges667 silver badges758 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

We need to read the data correctly so that headers and indices are placed properly, rather than being treated as a part of the data. Pandas will align columns by their names and calculate correlation for corresponding pairs:

base = pd.read_excel(base_path, header=0, index_col=0)
edit = pd.read_excel(edit_path, header=0, index_col=0,
                     usecols='A:I')     # alternatively, skip drafts outside the table

print(base.corrwith(edit))

# snip     1.000000
# blip     0.999975
# twig     1.000000
# zorp     0.999990
# plum     0.955042
# glim     0.999973
# frap     1.000000
# glim2         NaN
# dtype: float64

In case of many rows representing the header, we can use a list with header row indexes, like header=[0, 1]:

# add a new level and save base as Excel
(
    base
    .T
    .assign(new_level=[*'ABCDEFG'])
    .set_index('new_level', append=True)
    .T
    .to_excel('base_multi_header.xlsx')
)

# read multi header base from Excel
base = pd.read_excel('base_multi_header.xlsx', header=[0, 1], index_col=0)

When applying corrwith, we have to ensure the same number of levels and level names:

# add new level to edited data
edit = pd.read_excel('edit.xlsx', header=0, index_col=0, usecols='A:I')
edit = (
    edit
    .T
    .assign(level2=[*'ABCDEFFG'])
    .set_index('level2', append=True)
    .T
)

# be careful with level names, they should be the same for both frames;
# I used "level2" as a new level name for `edit`, which is 
# as if you changed the `A2` cell of the Excel doc before reading the data;
# we have to check or just rename the levels before moving on,
# otherwise `corrwith` fails
if base.columns.names != edit.columns.names:
    edit = edit.rename_axis(columns=dict(zip(edit.columns.names,
                                             base.columns.names)))

# see the correlation
print(edit.corrwith(base))

#        new_level
# snip   A            1.000000
# blip   B            0.999975
# twig   C            1.000000
# zorp   D            0.999990
# plum   E            0.955042
# glim   F            0.999973
# frap   G            1.000000
# glim2  F                 NaN
# dtype: float64

One ~solution I found was to convert every string value into a unique integer. This allowed the correlation calculation to "succeed"…but it was not useful for my goal of figuring out which columns in base went with which columns in edit, as the correlation results were a single vector of either 1.0 or NaN values. :p

# Get rid of the NaNs
base.fillna(-1.0, inplace=True)
edit.fillna(-1.0, inplace=True)

# Create a map of every string in the data frames to a unique integer
def generate_string_map(*dfs):
    alldata = pd.concat(dfs, ignore_index=True)
    strings = alldata.select_dtypes(include=['object']).stack().unique()
    strings = [s for s in strings if isinstance(s, str)]
    return {o: i+100000 for i, o in enumerate(strings)}

mapping = generate_string_map(base, edit)
def hash_strings(val):
    return mapping.get(val, val)

base_nums = base.map(hash_strings)
edit_nums = edit.map(hash_strings)

print(base_nums.to_string())
#=>            0       1       2       3       4       5       6       7
#=> 0       -1.0  100000  100001  100002  100003  100004  100005  100006
#=> 1   100007.0      10      10       9      11       9      10      10
#=> 2   100008.0      20      18      19      20      20      20      18
#=> 3   100009.0      12      11      12      11      11      12      12
#=> 4   100010.0      15      15      16      14      16      14      14
#=> 5   100011.0      11      11       9       9      10      10      11
#=> 6   100012.0      16      16      18      17      18      18      16
#=> 7   100013.0      19      18      17      19      17      18      17
#=> 8   100014.0      13      12      12      12      11      12      13
#=> 9   100015.0      13      14      12      14      12      13      13
#=> 10  100016.0      18      17      18      16      16      17      18
#=> 11  100017.0     147     142     142     143     140     144     142

print(edit_nums.to_string())
#=>           0       1       2       3       4         5       6       7       8    9    10   11
#=> 0        0.7  100000  100001  100002  100003  100004.0  100018  100005  100006 -1.0 -1.0 -1.0
#=> 1   100007.0      10      10       9      11       9.0      10      10      10 -1.0 -1.0 -1.0
#=> 2   100008.0      20      18      19      20      20.0      21      20      18 -1.0 -1.0  1.2
#=> 3   100009.0      12      11      12      11      11.0      12      12      12 -1.0 -1.0 -1.0
#=> 4   100010.0      15      15      16      14      16.0      17      14      14 -1.0 -1.0 -1.0
#=> 5   100011.0      11      11       9       9      61.6      10      10      11 -1.0 -1.0 -1.0
#=> 6   100012.0      16      16      18      17      18.0      18      19      16 -1.0 -1.0 -1.0
#=> 7   100013.0      19      18      17      19      17.0      18      18      17 -1.0 -1.0 -1.0
#=> 8   100014.0      13      12      12      12      11.0      12      12      13 -1.0 -1.0 -1.0
#=> 9   100015.0      13      14      12      14      12.0      13      13      13 -1.0 -1.0 -1.0
#=> 10  100016.0      18      17      18      16      16.0      17      17      18 -1.0 -1.0 -1.0
#=> 11  100017.0     147     131     142     150     191.6     148     145     142 -1.0 -1.0 -1.0

corr = base_nums.corrwith(edit_nums, axis=0)
print(corr)
#=> 0     1.0
#=> 1     1.0
#=> 2     1.0
#=> 3     1.0
#=> 4     1.0
#=> 5     1.0
#=> 6     1.0
#=> 7     1.0
#=> 8     NaN
#=> 9     NaN
#=> 10    NaN
#=> 11    NaN
#=> dtype: float64

I then tried manually calculating the full N^2 set of correlations between each column in base and edit, and found that still the correlations continue to be useless for my end goal. Most every data column shows a 1.0 correlation with every other. :/

# Compute the correlation between each pair of columns
base_nums.reset_index(drop=True, inplace=True)
edit_nums.reset_index(drop=True, inplace=True)

corr_matrix = pd.DataFrame(index=base_nums.columns, columns=edit_nums.columns)

for bcol in base_nums.columns:
    for ecol in edit_nums.columns:
        corr_value = base_nums[bcol].corr(edit_nums[ecol])
        corr_matrix.loc[bcol, ecol] = corr_value

print(corr_matrix)
#=>          0         1         2         3         4   ...        7         8    9    10        11
#=> 0       1.0 -0.999999 -0.999999 -0.999999 -0.999999  ... -0.999999 -0.999999  NaN  NaN  0.090865
#=> 1 -0.999999       1.0       1.0       1.0       1.0  ...       1.0       1.0  NaN  NaN -0.090982
#=> 2 -0.999999       1.0       1.0       1.0       1.0  ...       1.0       1.0  NaN  NaN -0.090994
#=> 3 -0.999999       1.0       1.0       1.0       1.0  ...       1.0       1.0  NaN  NaN -0.090983
#=> 4 -0.999999       1.0       1.0       1.0       1.0  ...       1.0       1.0  NaN  NaN -0.090974
#=> 5 -0.999999       1.0       1.0       1.0       1.0  ...       1.0       1.0  NaN  NaN -0.090969
#=> 6 -0.999999       1.0       1.0       1.0       1.0  ...       1.0       1.0  NaN  NaN -0.090976
#=> 7 -0.999999       1.0       1.0       1.0       1.0  ...       1.0       1.0  NaN  NaN -0.090994

I'm still interested in knowing if pandas provides a way to correlate columns better than this, but it seems that what I really need is not to lean on a heuristic to see which columns are the same, but to use a heuristic to find/calculate an index that will be stable for each column and row.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Correlate columns in two pandas dataframes with varying data types - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)