最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Pandas Join Two Series Based on Conditions - Stack Overflow

programmeradmin2浏览0评论

I have a dataframe that has information about employee's employment info and I am trying to combine with another dataframe that has their Employee ID #.

df

Name             SSN
Doe, John A      XXXX-XX-1234
Doe, Jane B      XXXX-XX-9876
Test, Example    XXXX-XX-0192

Employee_Info

First_Name    Last_Name            SSN     EmployeeID
      John          Doe    999-45-1234             12
      JANE          DOE    999-45-9876             13
   Example         Test    999-45-0192             14

My desired output is:

Name             SSN          EmployeeID
Doe, John A      XXX-XX-1234          12
Doe, Jane B      XXX-XX-9876          13
Test, Example    XXX-XX-0192          14

The df dataframe actually has the SSN masked except for the last 4 characters. Here is the code I have currently:

df['SSN_Last_4'] = df['SSN4'].str[-4:]
Employee_Info['SSN_Last_4'] = Employee_Info['SSN'].str[-4:]
df2 = pd.merge(df, Employee_Info, on='SSN', how='left')

However because some employees might have the same last 4 digits of SSN, I need to also match based on name. However the caveat is that the Name in df is the employee fullname (which might include middle initial) and the case might be different. My original idea was to split the Name on , and drop middle initial, and then convert all the name columns to be lowercase and modify the join. However I feel that there are better methods to join the data.

I have a dataframe that has information about employee's employment info and I am trying to combine with another dataframe that has their Employee ID #.

df

Name             SSN
Doe, John A      XXXX-XX-1234
Doe, Jane B      XXXX-XX-9876
Test, Example    XXXX-XX-0192

Employee_Info

First_Name    Last_Name            SSN     EmployeeID
      John          Doe    999-45-1234             12
      JANE          DOE    999-45-9876             13
   Example         Test    999-45-0192             14

My desired output is:

Name             SSN          EmployeeID
Doe, John A      XXX-XX-1234          12
Doe, Jane B      XXX-XX-9876          13
Test, Example    XXX-XX-0192          14

The df dataframe actually has the SSN masked except for the last 4 characters. Here is the code I have currently:

df['SSN_Last_4'] = df['SSN4'].str[-4:]
Employee_Info['SSN_Last_4'] = Employee_Info['SSN'].str[-4:]
df2 = pd.merge(df, Employee_Info, on='SSN', how='left')

However because some employees might have the same last 4 digits of SSN, I need to also match based on name. However the caveat is that the Name in df is the employee fullname (which might include middle initial) and the case might be different. My original idea was to split the Name on , and drop middle initial, and then convert all the name columns to be lowercase and modify the join. However I feel that there are better methods to join the data.

Share Improve this question asked Mar 17 at 15:54 BijanBijan 8,72221 gold badges102 silver badges162 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 1

Another possible solution:

pd.merge(
    df2.assign(
        First_Name=df2['First_Name'].str.upper(), 
        Last_Name=df2['Last_Name'].str.upper()), 
    pd.concat([
        df1, 
        df1['Name'].str.replace(r'\s\D$', '', regex=True)
        .str.upper().str.split(', ', expand=True)], axis=1), 
    right_on=[1, 0], left_on=['First_Name', 'Last_Name'], 
    suffixes=['', '_y'])[df1.columns.to_list() + ['EmployeeID']]

It first modifies df2 using assign to create uppercase versions of the First_Name and Last_Name columns. Then, it constructs an extended version of df1 using concat, where the Name column is processed with str.replace() to remove any trailing single-character initials and then split into separate first and last names using str.split(expand=True). The merge() function is then applied, aligning the transformed name columns (First_Name and Last_Name) with the corresponding split names from df1, using right_on=[1, 0] and left_on=['First_Name', 'Last_Name']. Finally, the output retains only the columns from df1, along with the EmployeeID column.


The following updates the solution to contemplate the case mentioned by the OP in a comment below:

pd.merge(
    df2.assign(
        First_Name=df2['First_Name'].str.upper(), 
        Last_Name=df2['Last_Name'].str.upper(), 
        aux=df2['SSN'].str.extract(r'.*\-(\d{4})$')),
    pd.concat([
        df1.assign(aux=df2['SSN'].str.extract(r'.*\-(\d{4})$')), 
        df1['Name'].str.replace(r'\s\D$', '', regex=True)
        .str.upper().str.split(', ', expand=True)], axis=1), 
    right_on=[1, 0, 'aux'], left_on=['First_Name', 'Last_Name', 'aux'], 
    suffixes=['', '_y'])[df1.columns.to_list() + ['EmployeeID']]

Output:

            Name          SSN  EmployeeID
0    Doe, John A  999-45-1234          12
1    Doe, Jane B  999-45-9876          13
2  Test, Example  999-45-0192          14

One option could be to extract the last digits of the SSN, rework the names and extract from the full names, then merge:

import re

employees = (Employee_Info['Last_Name']+', '+Employee_Info['First_Name']).str.casefold()
pat = f"({'|'.join(map(re.escape, employees.unique()))})"
employees_info = df['Name'].str.lower().str.extract(pat, expand=False)

out = df[['Name']].merge(Employee_Info,
                         left_on=[employees, df['SSN'].str[-4:]],
                         right_on=[employees_info, Employee_Info['SSN'].str[-4:]],
                         how='left'
                        )

Output:

           key_0 key_1           Name First_Name Last_Name          SSN  EmployeeID
0      doe, john  1234    Doe, John A       John       Doe  999-45-1234          12
1      doe, jane  9876    Doe, Jane B       JANE       DOE  999-45-9876          13
2  test, example  0192  Test, Example    Example      Test  999-45-0192          14
发布评论

评论列表(0)

  1. 暂无评论