I am using SDV to generate mock data by extracting real data and metadata from an existing CSV file and then saving the mock data to a new CSV file.
Here is my code:
import pandas as pd
from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer
data = pd.read_csv('file.csv', sep=';')
metadata = Metadata.detect_from_dataframe(
data=data,
table_name='test'
)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(10)
synthetic_data.to_csv('synthetic_file.csv', index=False, sep=';')
My issue is that SDV does not identify hash-like columns/fields correctly. For example, a field named ID in my data contains values like:
4959478426DF15EE67AZBED5B0B99EDB848597F2
AB28A95B91DE6637DE8D7728D6C945EFFC58F029
D304CE66B9204C637C8BA1B75B2952495C66321F
But in the synthetic output, SDV generates values like:
sdv-id-sVCqLP
sdv-id-CjXnSq
sdv-id-HuiFjs
I tried explicitly setting the field type using metadata.update_column:
metadata.update_column(
table_name='test',
column_name='ID',
sdtype='id',
)
But the results remained the same. SDV still replaces the hash-like values with generic synthetic identifiers. I understand that I can use a custom generator to manually create hashes, but this would break the relationship logic provided by SDV.
How can I make SDV generate synthetic data for hash-like fields while preserving the repetition logic from the original dataset?