最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - SDV generates incorrect values for hash-like fields when creating synthetic data from a CSV file" - Stack Over

programmeradmin0浏览0评论

I am using SDV to generate mock data by extracting real data and metadata from an existing CSV file and then saving the mock data to a new CSV file.

Here is my code:

import pandas as pd
from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer

data = pd.read_csv('file.csv', sep=';')
metadata = Metadata.detect_from_dataframe(
    data=data,
    table_name='test'
)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(10)

synthetic_data.to_csv('synthetic_file.csv', index=False, sep=';')

My issue is that SDV does not identify hash-like columns/fields correctly. For example, a field named ID in my data contains values like:

4959478426DF15EE67AZBED5B0B99EDB848597F2
AB28A95B91DE6637DE8D7728D6C945EFFC58F029
D304CE66B9204C637C8BA1B75B2952495C66321F

But in the synthetic output, SDV generates values like:

sdv-id-sVCqLP
sdv-id-CjXnSq
sdv-id-HuiFjs

I tried explicitly setting the field type using metadata.update_column:

metadata.update_column(
    table_name='test',
    column_name='ID',
    sdtype='id',
)

But the results remained the same. SDV still replaces the hash-like values with generic synthetic identifiers. I understand that I can use a custom generator to manually create hashes, but this would break the relationship logic provided by SDV.

How can I make SDV generate synthetic data for hash-like fields while preserving the repetition logic from the original dataset?

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论