python - How to enforce string type when reading CSVs in DuckDB?

I'm trying to read a CSV in DuckDB (Python) but one of the records has a string ID while the rest of the file has the ID field as an integer.

I'd like to force reading it as a string, how to do it?

I'm trying to read a CSV in DuckDB (Python) but one of the records has a string ID while the rest of the file has the ID field as an integer.

I'd like to force reading it as a string, how to do it?

Share Improve this question edited Jan 30 at 18:25 jarlh 44.8k8 gold badges50 silver badges67 bronze badges asked Jan 30 at 16:30 Vzzarr 5,7204 gold badges58 silver badges100 bronze badges

1) In a CSV all the record values will be strings and read as such. What you are referring to is DuckDB using AutoDetection to infer data types from the string values and then casting the strings as those types. 2) Is the one record ID value not compatible with casting to integer? – Adrian Klaver Commented Jan 30 at 18:01
@AdrianKlaver 1) correct, that's yet another alternative and more correct way of stating things - I'm new to DuckDB so still familiarising with these concepts. But again feel free to post in a new answer. 2) In some of the intermediate systems I'm working with they mixed strings and integers :( so I had to go string as more generic – Vzzarr Commented Feb 2 at 13:41

Add a comment |

2 Answers 2

Sorted by: Reset to default 3

Using stations-2023-09.csv from here Netherlands Stations:

SELECT * from read_csv('~/Downloads/stations-2023-09.csv') limit 5;

┌───────┬─────────┬─────────┬────────────┬──────────────────┬───┬──────────────────────┬─────────┬──────────────────────┬─────────────────┬─────────────────┐
│  id   │  code   │   uic   │ name_short │   name_medium    │ … │         slug         │ country │         type         │     geo_lat     │     geo_lng     │
│ int64 │ varchar │  int64  │  varchar   │     varchar      │   │       varchar        │ varchar │       varchar        │     double      │     double      │
├───────┼─────────┼─────────┼────────────┼──────────────────┼───┼──────────────────────┼─────────┼──────────────────────┼─────────────────┼─────────────────┤
│   266 │ HT      │ 8400319 │ Den Bosch  │ 's-Hertogenbosch │ … │ s-hertogenbosch      │ NL      │ knooppuntIntercity…  │        51.69048 │         5.29362 │
│   269 │ HTO     │ 8400320 │ Dn Bosch O │ 's-Hertogenb. O. │ … │ s-hertogenbosch-oost │ NL      │ stoptreinstation     │ 51.700553894043 │ 5.3183331489563 │
│   227 │ HDE     │ 8400388 │ 't Harde   │ 't Harde         │ … │ t-harde              │ NL      │ stoptreinstation     │      52.4091682 │        5.893611 │
│     8 │ AHBF    │ 8015345 │ Aachen     │ Aachen Hbf       │ … │ aachen-hbf           │ D       │ knooppuntIntercity…  │         50.7678 │        6.091499 │
│   818 │ AW      │ 8015199 │ Aachen W   │ Aachen West      │ … │ aachen-west          │ D       │ stoptreinstation     │        50.78036 │        6.070715 │
├───────┴─────────┴─────────┴────────────┴──────────────────┴───┴──────────────────────┴─────────┴──────────────────────┴─────────────────┴─────────────────┤
│ 5 rows                                                                                                                              11 columns (10 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

SELECT * from read_csv('~/Downloads/stations-2023-09.csv', types={'id': 'VARCHAR'}) limit 5;

┌─────────┬─────────┬─────────┬────────────┬───┬──────────────────────┬─────────┬──────────────────────┬─────────────────┬─────────────────┐
│   id    │  code   │   uic   │ name_short │ … │         slug         │ country │         type         │     geo_lat     │     geo_lng     │
│ varchar │ varchar │  int64  │  varchar   │   │       varchar        │ varchar │       varchar        │     double      │     double      │
├─────────┼─────────┼─────────┼────────────┼───┼──────────────────────┼─────────┼──────────────────────┼─────────────────┼─────────────────┤
│ 266     │ HT      │ 8400319 │ Den Bosch  │ … │ s-hertogenbosch      │ NL      │ knooppuntIntercity…  │        51.69048 │         5.29362 │
│ 269     │ HTO     │ 8400320 │ Dn Bosch O │ … │ s-hertogenbosch-oost │ NL      │ stoptreinstation     │ 51.700553894043 │ 5.3183331489563 │
│ 227     │ HDE     │ 8400388 │ 't Harde   │ … │ t-harde              │ NL      │ stoptreinstation     │      52.4091682 │        5.893611 │
│ 8       │ AHBF    │ 8015345 │ Aachen     │ … │ aachen-hbf           │ D       │ knooppuntIntercity…  │         50.7678 │        6.091499 │
│ 818     │ AW      │ 8015199 │ Aachen W   │ … │ aachen-west          │ D       │ stoptreinstation     │        50.78036 │        6.070715 │
├─────────┴─────────┴─────────┴────────────┴───┴──────────────────────┴─────────┴──────────────────────┴─────────────────┴─────────────────┤
│ 5 rows                                                                                                              11 columns (9 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

In the second case types={'id': 'VARCHAR'} was used to override the autodetect and cast the values in the id column into varchar instead of int64. The other columns where left alone.

UPDATE

The above works in the DuckDB CLI, but not in the Python API. There types is changed to dtype. For the Python case the working example is:

import duckdb

duckdb.read_csv('/home/aklaver/Downloads/stations-2023-09.csv', dtype={"id": "VARCHAR"}).limit(5)

┌─────────┬─────────┬─────────┬────────────┬───┬──────────────────────┬─────────┬──────────────────────┬─────────────────┬─────────────────┐
│   id    │  code   │   uic   │ name_short │ … │         slug         │ country │         type         │     geo_lat     │     geo_lng     │
│ varchar │ varchar │  int64  │  varchar   │   │       varchar        │ varchar │       varchar        │     double      │     double      │
├─────────┼─────────┼─────────┼────────────┼───┼──────────────────────┼─────────┼──────────────────────┼─────────────────┼─────────────────┤
│ 266     │ HT      │ 8400319 │ Den Bosch  │ … │ s-hertogenbosch      │ NL      │ knooppuntIntercity…  │        51.69048 │         5.29362 │
│ 269     │ HTO     │ 8400320 │ Dn Bosch O │ … │ s-hertogenbosch-oost │ NL      │ stoptreinstation     │ 51.700553894043 │ 5.3183331489563 │
│ 227     │ HDE     │ 8400388 │ 't Harde   │ … │ t-harde              │ NL      │ stoptreinstation     │      52.4091682 │        5.893611 │
│ 8       │ AHBF    │ 8015345 │ Aachen     │ … │ aachen-hbf           │ D       │ knooppuntIntercity…  │         50.7678 │        6.091499 │
│ 818     │ AW      │ 8015199 │ Aachen W   │ … │ aachen-west          │ D       │ stoptreinstation     │        50.78036 │        6.070715 │
├─────────┴─────────┴─────────┴────────────┴───┴──────────────────────┴─────────┴──────────────────────┴─────────────────┴─────────────────┤
│ 5 rows                                                                                                              11 columns (9 shown)

The DuckDB function read_csv() has a parameter all_varchar that when se to True allows to infer all the fields as string:

import duckdb


q = duckdb.read_csv(all_varchar=True, path_or_buffer=f'/my_path/my_csv.csv')

sql = duckdb.sql("""
    SELECT * 
    FROM q
""")

sql.show(max_width=100, max_rows=1000)

allowing me to overcome the issue on inconsistent type for the ID I was trying to read, for reference: https://duckdb./docs/data/csv/overview.html#parameters

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How to enforce string type when reading CSVs in DuckDB? - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)