python - dask row filtering with boolean array

I have a quite large dask dataframe mydataframe and a numpy array mycodes. I want to filter the rows of mydataframe and keep only those where the column CODE is not in mycodes. I reseted the index of the dataframe so that partitions are known, as I read it was important after I got an error. I tried the following code

is_new = ~mydataframe["CODE"].isin(mycodes)pute().values.flatten()
new_codes = aduanas.loc[is_new,"codigo_nc"].drop_duplicates()pute()

and variations. I get errors regarding the number of partitions or the length of the index I pass to filter... I have tried other approaches and got other errors, sometimes assertion errors. I can't seem to be able to do something as simple as filtering the rows of a dataframe.

Forgive the lack of concrete examples, but I don't find them really necessary, the question is really general: can anyone please give me some indications on how to filter the rows of a large dask dataframe? The things I need to take into account, or limitations.

You can find the data I am working with for mydataframe here. I am testing with the data in the first zip file. It's a fwf and you have the design in this gist. The only relevant variable is CODE, which I read it as string. For mycodes you can try any subset.

is_new = ~mydataframe["CODE"].isin(mycodes).compute().values.flatten()
new_codes = aduanas.loc[is_new,"codigo_nc"].drop_duplicates().compute()

Share Improve this question edited Feb 7 at 8:41 asked Feb 6 at 16:29 miguelsxvi 256 bronze badges

Can you provide a small sample of your dataframe and codes to keep? – globglogabgalab Commented Feb 6 at 16:32
@globglogabgalab I added a link to the data :) – miguelsxvi Commented Feb 7 at 8:41

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Based on the information you've provided, I'd probably do something like this:

import dask.dataframe as dd
import numpy as np

# Example: Large Dask DataFrame
df = dd.read_parquet("mydata.parquet")  # Replace with your actual data source

# Example: NumPy array of codes to exclude
mycodes = np.array([1001, 1002, 1003])

# Convert NumPy array to a list for better compatibility
mycodes_list = mycodes.tolist()

# Filter rows where CODE is NOT in mycodes
filtered_df = df[~df["CODE"].isin(mycodes_list)]

# Trigger computation if needed
# This pulls the result into memory
result = filtered_df.compute()
print(result)

Regrading best practices, it's best to use compute sparingly (dask docs link).

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - dask row filtering with boolean array - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)