We are working with a hashed identifier column (vmid
) in a Pandas DataFrame, which has been sanitized to remove special characters due to CSV storage constraints. The sanitization function we used replaces disallowed characters with an underscore (_
):
import re
def sanitize_filename(vmid):
return re.sub(r'[<>:"/\\|?*]', '_', vmid)
or
# Replace invalid characters in the filename
filename = f"{vmid.replace('/', '_')}.csv" #store using f-string
vmid_df = result_df[result_df['vmid'] == vmid]
vmid_df.to_csv(filename, index=False)
Problem:
The original vmid
values were anonymized before processing.
After storing the sanitized version (where special characters are replaced by _
), we lost the ability to recover the original vmid values.
Now, we need to join this DataFrame with another (vmtable.csv), which contains the vmid
column and its corresponding vmcategory
.
We attempted to use Levenshtein distance for approximate string matching, but it often returns incorrect categories due to multiple potential matches.
Attempts & Limitations:
- Using
Levenshtein
: Finds the closest match but can result in incorrect assignments. - Using direct equality comparison: Fails since the sanitized
vmid
is different from the original.
Checking existing solutions like this post and this post, each focuses on a single matching algorithm, but our issue is recovering the lost mapping before the join.
one possible naive potential solutions could store a Mapping before Sanitization
by storing a dictionary before applying sanitize_filename()
:
vmid_map = {original_vmid: sanitize_filename(original_vmid) for original_vmid in df['vmid']}
Later, use this mapping to restore the original vmid before merging or left
join with other tables.
Question:
What would be the best practice to handle this issue, ensuring we can store csv file names and later restore the original vmid
for correct dataframe merging?
Would a mapping strategy or an alternative encoding approach be more reliable? Any better approaches are highly appreciated!