There are two dataframes: current
and base
. current
is an incremental update to base
.
Whenever there is an update made to current
, we want to run checks to see if it adheres to rules. If it does, current
becomes base
.
Updates to column Issr
is not expected except empty value. So, there are two checks:
- New entry must not add any value to
Issr
column (''
is accepted) - Existing row's value for column
Issr
shouldn't be updated (if updated to empty, it is fine). Update to any other column is ok.
base
SNo | Rank | Ctry | Cat | Issr | Ref |
---|---|---|---|---|---|
1 | 10 | A | Book | Y | 100 |
2 | 14 | B | Laptop | C | 101 |
3 | 15 | C | Pen | J | 102 |
4 | 50 | D | Pen | 103 |
There are two dataframes: current
and base
. current
is an incremental update to base
.
Whenever there is an update made to current
, we want to run checks to see if it adheres to rules. If it does, current
becomes base
.
Updates to column Issr
is not expected except empty value. So, there are two checks:
- New entry must not add any value to
Issr
column (''
is accepted) - Existing row's value for column
Issr
shouldn't be updated (if updated to empty, it is fine). Update to any other column is ok.
base
SNo | Rank | Ctry | Cat | Issr | Ref |
---|---|---|---|---|---|
1 | 10 | A | Book | Y | 100 |
2 | 14 | B | Laptop | C | 101 |
3 | 15 | C | Pen | J | 102 |
4 | 50 | D | Pen | 103 |
current
SNo | Rank | Ctry | Cat | Issr | Ref |
---|---|---|---|---|---|
1 | 10 | A | Book1 | Y | 100 |
2 (updated) | 14 | B | Laptop | 101 | |
3 | 15 | C | Pen | J | 102 |
4 (updated) | 50 | D | Pen | U | 103 |
5 (new entry) | 24 | K | Pencil | W | 101 |
6 (new entry) | 24 | RT | Pencil | 201 |
Above current
fails checks because of multiple issues:
- Row# 4 in
base
: Existing row's columnIssr
got updated to a non-empty values (while existing Row# 2 update is fine as it made the value empty) - Row# 5 is a new entry in
current
with a non-empty value. This is violation. Row#6 is also new entry but ok since the value forIssr
is empty.
import pandas as pd
base = {
'Rank': [10,14,15,50],
'Ctry': ['A', 'B', 'C', 'D'],
'Cat': ['Book', 'Laptop', 'Pen', 'Pen'],
'Issr': ['Y', 'C', 'J', ''],
'Ref': ['100', '101', '102', '103']
}
current = {
'Rank': [10,14,15,50, 24, 24],
'Ctry': ['A', 'B', 'C', 'D', 'K', 'RT'],
'Cat': ['Book', 'Laptop', 'Pen', 'Pen', 'Pencil', 'Pencil'],
'Issr': ['Y', '', 'J', 'U', 'W', ''],
'Ref': ['100', '101', '102', '103', '101', '201']
}
base_df = pd.DataFrame(base)
current_df = pd.DataFrame(current)
merged_df = pd.merge(current_df, base_df, how="outer", indicator=True)
print(merged_df)
Rank Ctry Cat Issr Ref _merge
10 A Book Y 100 both
14 B Laptop 101 left_only
15 C Pen J 102 both
50 D Pen U 103 left_only <--- how to know this got marked `left_only` as value of `issr` col is different
24 K Pencil W 101 left_only <--- Invalid - new entry has no-empty value for `Issr` col
24 RT Pencil 201 left_only
14 B Laptop C 101 right_only
50 D Pen 103 right_only
I can get left_only
(indicating update/new rows in current
df) but how to know because of which column/columns the row got marked as left_only
?
If I get to know that pandas marked left_only
as it saw a diff in Issr
column, I can just check its value (empty or not), and pass/fail the job.
How to get this column info?
Share Improve this question edited Mar 24 at 16:03 adarsh asked Mar 24 at 15:15 adarshadarsh 1,5131 gold badge11 silver badges19 bronze badges 2 |1 Answer
Reset to default 0If you don't use on
parameter for pd.merge
, I can explain like below.
both
is generated only when if nothing has changed in every column.
And when if any of the columns has changed,
left_only
is generated in the row after the change,
right_only
is generated in the row before the change.
Here is an answer to compare these dataframes.
First, you need to set criteria columns, like Primary Key.
Let's say the criteria columns are ['Ctry', 'Cat'].
# Changing the columns' name of base_df
base_df.columns = [i + '_base' if i not in ['Ctry', 'Cat'] else i for i in base_df.columns]
# Merging on criteria columns
merged_df = pd.merge(current_df, base_df, how="outer", on=['Ctry', 'Cat'], indicator=True)
# Detecting the part that changes
merged_df['is_changed'] =\
np.where(merged_df['_merge']=='left_only', 'new_item',
np.where(merged_df['Rank']!=merged_df['Rank_base'], 'Rank_changed',
np.where(merged_df['Issr']!=merged_df['Issr_base'], 'Issr_changed',
np.where(merged_df['Ref']!=merged_df['Ref_base'], 'Ref_changed', 'No_changed'))))
final_df = merged_df[current_df.columns.tolist() + ['is_changed']]
Then, you can get below.
>>> final_df
+----+--------+--------+--------+--------+-------+--------------+
| | Rank | Ctry | Cat | Issr | Ref | is_changed |
|----+--------+--------+--------+--------+-------+--------------|
| 0 | 10 | A | Book | Y | 100 | No_changed |
| 1 | 14 | B | Laptop | | 101 | Issr_changed |
| 2 | 15 | C | Pen | J | 102 | No_changed |
| 3 | 50 | D | Pen | U | 103 | Issr_changed |
| 4 | 24 | K | Pencil | W | 101 | new_item |
| 5 | 24 | RT | Pencil | | 201 | new_item |
+----+--------+--------+--------+--------+-------+--------------+
outer merge
function", you meanmerge(..., how="outer")
, but please be explicit. For related tips, see How to make good reproducible pandas examples and minimal reproducible example. – wjandrea Commented Mar 24 at 15:34on
followed by the list of columns you want to join the two dataframes on. That makes it more controllable to you. Those should be the columns that identify the entity and should uniquely identify a record (at most one). Then you can continue checking the column values which might have changed in the result set and if you get a record with left_- or right_only, you know that a record has been deleted respectively inserted (if the old on is on the left). – jottbe Commented Mar 24 at 18:12