I'm trying to read a CSV in DuckDB (Python) but one of the records has a string ID while the rest of the file has the ID field as an integer.
I'd like to force reading it as a string, how to do it?
I'm trying to read a CSV in DuckDB (Python) but one of the records has a string ID while the rest of the file has the ID field as an integer.
I'd like to force reading it as a string, how to do it?
Share Improve this question edited Jan 30 at 18:25 jarlh 44.8k8 gold badges50 silver badges67 bronze badges asked Jan 30 at 16:30 VzzarrVzzarr 5,7204 gold badges58 silver badges100 bronze badges 2 |2 Answers
Reset to default 3Using stations-2023-09.csv
from here Netherlands Stations:
SELECT * from read_csv('~/Downloads/stations-2023-09.csv') limit 5;
┌───────┬─────────┬─────────┬────────────┬──────────────────┬───┬──────────────────────┬─────────┬──────────────────────┬─────────────────┬─────────────────┐
│ id │ code │ uic │ name_short │ name_medium │ … │ slug │ country │ type │ geo_lat │ geo_lng │
│ int64 │ varchar │ int64 │ varchar │ varchar │ │ varchar │ varchar │ varchar │ double │ double │
├───────┼─────────┼─────────┼────────────┼──────────────────┼───┼──────────────────────┼─────────┼──────────────────────┼─────────────────┼─────────────────┤
│ 266 │ HT │ 8400319 │ Den Bosch │ 's-Hertogenbosch │ … │ s-hertogenbosch │ NL │ knooppuntIntercity… │ 51.69048 │ 5.29362 │
│ 269 │ HTO │ 8400320 │ Dn Bosch O │ 's-Hertogenb. O. │ … │ s-hertogenbosch-oost │ NL │ stoptreinstation │ 51.700553894043 │ 5.3183331489563 │
│ 227 │ HDE │ 8400388 │ 't Harde │ 't Harde │ … │ t-harde │ NL │ stoptreinstation │ 52.4091682 │ 5.893611 │
│ 8 │ AHBF │ 8015345 │ Aachen │ Aachen Hbf │ … │ aachen-hbf │ D │ knooppuntIntercity… │ 50.7678 │ 6.091499 │
│ 818 │ AW │ 8015199 │ Aachen W │ Aachen West │ … │ aachen-west │ D │ stoptreinstation │ 50.78036 │ 6.070715 │
├───────┴─────────┴─────────┴────────────┴──────────────────┴───┴──────────────────────┴─────────┴──────────────────────┴─────────────────┴─────────────────┤
│ 5 rows 11 columns (10 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
SELECT * from read_csv('~/Downloads/stations-2023-09.csv', types={'id': 'VARCHAR'}) limit 5;
┌─────────┬─────────┬─────────┬────────────┬───┬──────────────────────┬─────────┬──────────────────────┬─────────────────┬─────────────────┐
│ id │ code │ uic │ name_short │ … │ slug │ country │ type │ geo_lat │ geo_lng │
│ varchar │ varchar │ int64 │ varchar │ │ varchar │ varchar │ varchar │ double │ double │
├─────────┼─────────┼─────────┼────────────┼───┼──────────────────────┼─────────┼──────────────────────┼─────────────────┼─────────────────┤
│ 266 │ HT │ 8400319 │ Den Bosch │ … │ s-hertogenbosch │ NL │ knooppuntIntercity… │ 51.69048 │ 5.29362 │
│ 269 │ HTO │ 8400320 │ Dn Bosch O │ … │ s-hertogenbosch-oost │ NL │ stoptreinstation │ 51.700553894043 │ 5.3183331489563 │
│ 227 │ HDE │ 8400388 │ 't Harde │ … │ t-harde │ NL │ stoptreinstation │ 52.4091682 │ 5.893611 │
│ 8 │ AHBF │ 8015345 │ Aachen │ … │ aachen-hbf │ D │ knooppuntIntercity… │ 50.7678 │ 6.091499 │
│ 818 │ AW │ 8015199 │ Aachen W │ … │ aachen-west │ D │ stoptreinstation │ 50.78036 │ 6.070715 │
├─────────┴─────────┴─────────┴────────────┴───┴──────────────────────┴─────────┴──────────────────────┴─────────────────┴─────────────────┤
│ 5 rows 11 columns (9 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
In the second case types={'id': 'VARCHAR'}
was used to override the autodetect
and cast the values in the id
column into varchar
instead of int64
. The other columns where left alone.
UPDATE
The above works in the DuckDB CLI, but not in the Python API. There types
is changed to dtype
. For the Python case the working example is:
import duckdb
duckdb.read_csv('/home/aklaver/Downloads/stations-2023-09.csv', dtype={"id": "VARCHAR"}).limit(5)
┌─────────┬─────────┬─────────┬────────────┬───┬──────────────────────┬─────────┬──────────────────────┬─────────────────┬─────────────────┐
│ id │ code │ uic │ name_short │ … │ slug │ country │ type │ geo_lat │ geo_lng │
│ varchar │ varchar │ int64 │ varchar │ │ varchar │ varchar │ varchar │ double │ double │
├─────────┼─────────┼─────────┼────────────┼───┼──────────────────────┼─────────┼──────────────────────┼─────────────────┼─────────────────┤
│ 266 │ HT │ 8400319 │ Den Bosch │ … │ s-hertogenbosch │ NL │ knooppuntIntercity… │ 51.69048 │ 5.29362 │
│ 269 │ HTO │ 8400320 │ Dn Bosch O │ … │ s-hertogenbosch-oost │ NL │ stoptreinstation │ 51.700553894043 │ 5.3183331489563 │
│ 227 │ HDE │ 8400388 │ 't Harde │ … │ t-harde │ NL │ stoptreinstation │ 52.4091682 │ 5.893611 │
│ 8 │ AHBF │ 8015345 │ Aachen │ … │ aachen-hbf │ D │ knooppuntIntercity… │ 50.7678 │ 6.091499 │
│ 818 │ AW │ 8015199 │ Aachen W │ … │ aachen-west │ D │ stoptreinstation │ 50.78036 │ 6.070715 │
├─────────┴─────────┴─────────┴────────────┴───┴──────────────────────┴─────────┴──────────────────────┴─────────────────┴─────────────────┤
│ 5 rows 11 columns (9 shown)
The DuckDB function read_csv()
has a parameter all_varchar
that when se to True
allows to infer all the fields as string:
import duckdb
q = duckdb.read_csv(all_varchar=True, path_or_buffer=f'/my_path/my_csv.csv')
sql = duckdb.sql("""
SELECT *
FROM q
""")
sql.show(max_width=100, max_rows=1000)
allowing me to overcome the issue on inconsistent type for the ID I was trying to read, for reference: https://duckdb./docs/data/csv/overview.html#parameters
ID
value not compatible with casting tointeger
? – Adrian Klaver Commented Jan 30 at 18:01