python - Pandas Join Two Series Based on Conditions

I have a dataframe that has information about employee's employment info and I am trying to combine with another dataframe that has their Employee ID #.

df

Name             SSN
Doe, John A      XXXX-XX-1234
Doe, Jane B      XXXX-XX-9876
Test, Example    XXXX-XX-0192

Employee_Info

First_Name    Last_Name            SSN     EmployeeID
      John          Doe    999-45-1234             12
      JANE          DOE    999-45-9876             13
   Example         Test    999-45-0192             14

My desired output is:

Name             SSN          EmployeeID
Doe, John A      XXX-XX-1234          12
Doe, Jane B      XXX-XX-9876          13
Test, Example    XXX-XX-0192          14

The df dataframe actually has the SSN masked except for the last 4 characters. Here is the code I have currently:

df['SSN_Last_4'] = df['SSN4'].str[-4:]
Employee_Info['SSN_Last_4'] = Employee_Info['SSN'].str[-4:]
df2 = pd.merge(df, Employee_Info, on='SSN', how='left')

However because some employees might have the same last 4 digits of SSN, I need to also match based on name. However the caveat is that the Name in df is the employee fullname (which might include middle initial) and the case might be different. My original idea was to split the Name on , and drop middle initial, and then convert all the name columns to be lowercase and modify the join. However I feel that there are better methods to join the data.

I have a dataframe that has information about employee's employment info and I am trying to combine with another dataframe that has their Employee ID #.

df

Name             SSN
Doe, John A      XXXX-XX-1234
Doe, Jane B      XXXX-XX-9876
Test, Example    XXXX-XX-0192

Employee_Info

First_Name    Last_Name            SSN     EmployeeID
      John          Doe    999-45-1234             12
      JANE          DOE    999-45-9876             13
   Example         Test    999-45-0192             14

My desired output is:

Name             SSN          EmployeeID
Doe, John A      XXX-XX-1234          12
Doe, Jane B      XXX-XX-9876          13
Test, Example    XXX-XX-0192          14

The df dataframe actually has the SSN masked except for the last 4 characters. Here is the code I have currently:

df['SSN_Last_4'] = df['SSN4'].str[-4:]
Employee_Info['SSN_Last_4'] = Employee_Info['SSN'].str[-4:]
df2 = pd.merge(df, Employee_Info, on='SSN', how='left')

Share Improve this question asked Mar 17 at 15:54 Bijan 8,72221 gold badges102 silver badges162 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

Another possible solution:

pd.merge(
    df2.assign(
        First_Name=df2['First_Name'].str.upper(), 
        Last_Name=df2['Last_Name'].str.upper()), 
    pd.concat([
        df1, 
        df1['Name'].str.replace(r'\s\D$', '', regex=True)
        .str.upper().str.split(', ', expand=True)], axis=1), 
    right_on=[1, 0], left_on=['First_Name', 'Last_Name'], 
    suffixes=['', '_y'])[df1.columns.to_list() + ['EmployeeID']]

It first modifies df2 using assign to create uppercase versions of the First_Name and Last_Name columns. Then, it constructs an extended version of df1 using concat, where the Name column is processed with str.replace() to remove any trailing single-character initials and then split into separate first and last names using str.split(expand=True). The merge() function is then applied, aligning the transformed name columns (First_Name and Last_Name) with the corresponding split names from df1, using right_on=[1, 0] and left_on=['First_Name', 'Last_Name']. Finally, the output retains only the columns from df1, along with the EmployeeID column.

The following updates the solution to contemplate the case mentioned by the OP in a comment below:

pd.merge(
    df2.assign(
        First_Name=df2['First_Name'].str.upper(), 
        Last_Name=df2['Last_Name'].str.upper(), 
        aux=df2['SSN'].str.extract(r'.*\-(\d{4})$')),
    pd.concat([
        df1.assign(aux=df2['SSN'].str.extract(r'.*\-(\d{4})$')), 
        df1['Name'].str.replace(r'\s\D$', '', regex=True)
        .str.upper().str.split(', ', expand=True)], axis=1), 
    right_on=[1, 0, 'aux'], left_on=['First_Name', 'Last_Name', 'aux'], 
    suffixes=['', '_y'])[df1.columns.to_list() + ['EmployeeID']]

Output:

            Name          SSN  EmployeeID
0    Doe, John A  999-45-1234          12
1    Doe, Jane B  999-45-9876          13
2  Test, Example  999-45-0192          14

One option could be to extract the last digits of the SSN, rework the names and extract from the full names, then merge:

import re

employees = (Employee_Info['Last_Name']+', '+Employee_Info['First_Name']).str.casefold()
pat = f"({'|'.join(map(re.escape, employees.unique()))})"
employees_info = df['Name'].str.lower().str.extract(pat, expand=False)

out = df[['Name']].merge(Employee_Info,
                         left_on=[employees, df['SSN'].str[-4:]],
                         right_on=[employees_info, Employee_Info['SSN'].str[-4:]],
                         how='left'
                        )

Output:

           key_0 key_1           Name First_Name Last_Name          SSN  EmployeeID
0      doe, john  1234    Doe, John A       John       Doe  999-45-1234          12
1      doe, jane  9876    Doe, Jane B       JANE       DOE  999-45-9876          13
2  test, example  0192  Test, Example    Example      Test  999-45-0192          14

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Pandas Join Two Series Based on Conditions - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)