I have a question regarding predicate pushdown in duckdb. I have some data where I want to do a select-query based on a list of strings. When I was working on this I realized that it seems that for WHERE x IN (a,b,...)
queries the predicate is not being pushed down into the table scan. This impacts performance quite a lot in my case since the tables I am trying to select from contains in the order of 100s of millions of rows.
Below is a toy example in python to illustrate the problem. Note that when using a WHERE x='y'
clause the filtering is pushed down into the table scan, but when using WHERE x in ('y')
, this is not the case.
Is there some way that I can rewrite this to ensure my filters get applied in the scan operation?
import duckdb
with duckdb.connect("example.db") as connection:
connection.execute(
"CREATE TABLE IF NOT EXISTS data AS SELECT * FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv';"
)
print("----------------")
print("DATA")
print("----------------")
print(connection.execute("SELECT * FROM data").fetchdf())
print()
print("----------------")
print("SELECT USING IN")
print("----------------")
result = connection.execute("""
EXPLAIN ANALYZE
SELECT * FROM DATA
WHERE kind IN ('dog', 'cat')
""").fetchone()
print(result[1])
print()
print("----------------")
print("SELECT USING =")
print("----------------")
result = connection.execute("""
EXPLAIN ANALYZE
SELECT * FROM DATA
WHERE kind = 'dog'
""").fetchone()
print(result[1])
print()
Results:
----------------
DATA
----------------
kind sound
0 dog woof
1 cat meow
2 pokemon pika
3 human hello
----------------
SELECT USING IN
----------------
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM DATA WHERE kind IN ('dog', 'cat')
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ HTTPFS HTTP Stats ││
││ ││
││ in: 0 bytes ││
││ out: 0 bytes ││
││ #HEAD: 0 ││
││ #GET: 0 ││
││ #PUT: 0 ││
││ #POST: 0 ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 0.0002s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ ((kind = 'dog') OR (kind =│
│ 'cat')) │
│ │
│ 2 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ TABLE_SCAN │
│ ──────────────────── │
│ data │
│ │
│ Projections: │
│ kind │
│ sound │
│ │
│ 4 Rows │
│ (0.00s) │
└───────────────────────────┘
----------------
SELECT USING =
----------------
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM DATA WHERE kind = 'dog'
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ HTTPFS HTTP Stats ││
││ ││
││ in: 0 bytes ││
││ out: 0 bytes ││
││ #HEAD: 0 ││
││ #GET: 0 ││
││ #PUT: 0 ││
││ #POST: 0 ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 0.0002s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ TABLE_SCAN │
│ ──────────────────── │
│ data │
│ │
│ Projections: │
│ kind │
│ sound │
│ │
│ Filters: │
│ kind='dog' AND kind IS NOT│
│ NULL │
│ │
│ 1 Rows │
│ (0.00s) │
└───────────────────────────┘