python - Find the value that is different when a row is marked "left_only" or "right

There are two dataframes: current and base. current is an incremental update to base.

Whenever there is an update made to current, we want to run checks to see if it adheres to rules. If it does, current becomes base.

Updates to column Issr is not expected except empty value. So, there are two checks:

New entry must not add any value to Issr column ('' is accepted)
Existing row's value for column Issr shouldn't be updated (if updated to empty, it is fine). Update to any other column is ok.

base

SNo	Rank	Ctry	Cat	Issr	Ref
1	10	A	Book	Y	100
2	14	B	Laptop	C	101
3	15	C	Pen	J	102
4	50	D	Pen		103

There are two dataframes: current and base. current is an incremental update to base.

Whenever there is an update made to current, we want to run checks to see if it adheres to rules. If it does, current becomes base.

Updates to column Issr is not expected except empty value. So, there are two checks:

New entry must not add any value to Issr column ('' is accepted)
Existing row's value for column Issr shouldn't be updated (if updated to empty, it is fine). Update to any other column is ok.

base

SNo	Rank	Ctry	Cat	Issr	Ref
1	10	A	Book	Y	100
2	14	B	Laptop	C	101
3	15	C	Pen	J	102
4	50	D	Pen		103

current

SNo	Rank	Ctry	Cat	Issr	Ref
1	10	A	Book1	Y	100
2 (updated)	14	B	Laptop		101
3	15	C	Pen	J	102
4 (updated)	50	D	Pen	U	103
5 (new entry)	24	K	Pencil	W	101
6 (new entry)	24	RT	Pencil		201

Above current fails checks because of multiple issues:

Row# 4 in base: Existing row's column Issr got updated to a non-empty values (while existing Row# 2 update is fine as it made the value empty)
Row# 5 is a new entry in current with a non-empty value. This is violation. Row#6 is also new entry but ok since the value for Issr is empty.

import pandas as pd

base = {
    'Rank': [10,14,15,50],
    'Ctry': ['A', 'B', 'C', 'D'],
    'Cat': ['Book', 'Laptop', 'Pen', 'Pen'],
    'Issr': ['Y', 'C', 'J', ''],
    'Ref': ['100', '101', '102', '103']
}

current = {
    'Rank': [10,14,15,50, 24, 24],
    'Ctry': ['A', 'B', 'C', 'D', 'K', 'RT'],
    'Cat': ['Book', 'Laptop', 'Pen', 'Pen', 'Pencil', 'Pencil'],
    'Issr': ['Y', '', 'J', 'U', 'W', ''],
    'Ref': ['100', '101', '102', '103', '101', '201']
}

base_df = pd.DataFrame(base)
current_df = pd.DataFrame(current)


merged_df = pd.merge(current_df, base_df, how="outer", indicator=True)

print(merged_df)

Rank Ctry     Cat   Issr    Ref      _merge
10    A      Book    Y      100        both
14    B     Laptop          101   left_only     
15    C     Pen     J       102        both 
50    D     Pen     U       103   left_only   <--- how to know this got marked `left_only` as value of `issr` col is different 
24    K     Pencil  W       101   left_only   <--- Invalid - new entry has no-empty value for `Issr` col
24    RT    Pencil          201   left_only
14    B     Laptop  C       101  right_only
50    D     Pen             103  right_only

I can get left_only (indicating update/new rows in current df) but how to know because of which column/columns the row got marked as left_only?

If I get to know that pandas marked left_only as it saw a diff in Issr column, I can just check its value (empty or not), and pass/fail the job.

How to get this column info?

Share Improve this question edited Mar 24 at 16:03 asked Mar 24 at 15:15 adarsh 1,5131 gold badge11 silver badges19 bronze badges

Please post your code. E.g. 1) What columns are you merging on? 2) I assume by "outer merge function", you mean merge(..., how="outer"), but please be explicit. For related tips, see How to make good reproducible pandas examples and minimal reproducible example. – wjandrea Commented Mar 24 at 15:34
Hi Adarsh, I think you might consider adding keyword on followed by the list of columns you want to join the two dataframes on. That makes it more controllable to you. Those should be the columns that identify the entity and should uniquely identify a record (at most one). Then you can continue checking the column values which might have changed in the result set and if you get a record with left_- or right_only, you know that a record has been deleted respectively inserted (if the old on is on the left). – jottbe Commented Mar 24 at 18:12

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

If you don't use on parameter for pd.merge, I can explain like below.

both is generated only when if nothing has changed in every column. And when if any of the columns has changed, left_only is generated in the row after the change, right_only is generated in the row before the change.

Here is an answer to compare these dataframes.

First, you need to set criteria columns, like Primary Key.

Let's say the criteria columns are ['Ctry', 'Cat'].

# Changing the columns' name of base_df
base_df.columns = [i + '_base' if i not in ['Ctry', 'Cat'] else i for i in base_df.columns]

# Merging on criteria columns
merged_df = pd.merge(current_df, base_df, how="outer", on=['Ctry', 'Cat'], indicator=True)

# Detecting the part that changes
merged_df['is_changed'] =\
       np.where(merged_df['_merge']=='left_only', 'new_item', 
         np.where(merged_df['Rank']!=merged_df['Rank_base'], 'Rank_changed', 
           np.where(merged_df['Issr']!=merged_df['Issr_base'], 'Issr_changed', 
            np.where(merged_df['Ref']!=merged_df['Ref_base'], 'Ref_changed', 'No_changed'))))

final_df = merged_df[current_df.columns.tolist() + ['is_changed']]

Then, you can get below.

>>> final_df

+----+--------+--------+--------+--------+-------+--------------+
|    |   Rank | Ctry   | Cat    | Issr   |   Ref | is_changed   |
|----+--------+--------+--------+--------+-------+--------------|
|  0 |     10 | A      | Book   | Y      |   100 | No_changed   |
|  1 |     14 | B      | Laptop |        |   101 | Issr_changed |
|  2 |     15 | C      | Pen    | J      |   102 | No_changed   |
|  3 |     50 | D      | Pen    | U      |   103 | Issr_changed |
|  4 |     24 | K      | Pencil | W      |   101 | new_item     |
|  5 |     24 | RT     | Pencil |        |   201 | new_item     |
+----+--------+--------+--------+--------+-------+--------------+

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Find the value that is different when a row is marked "left_only" or "right_only" w

1 Answer 1

与本文相关的文章

评论列表(0)