python - np.loadtxt skips unexpected lines when reading strings

I have a file structured as such:

IID SuperPops
FGXXXXXXXX  R6
FGXXXXXXXX  R12

Containing 524123 lines.

I need to read the first column as an array:

len(np.loadtxt('foo.txt',usecols=0,dtype=str))
524123

But if i also try to skip the header all of a sudden 10 lines go missing:

len(np.loadtxt('foo.txt',usecols=0,dtype=str,skiprows=1))
524112

This does not happen if i change type though:

len(np.loadtxt('foo.txt',usecols=0,dtype='<U20',skiprows=1))
524122
len(np.loadtxt('foo.txt',usecols=0,dtype='<U10',skiprows=1))
524122

The 10 missing entries are all of length 10:

[len(elem) for elem in set(np.loadtxt('foo.txt',usecols=0,dtype='<U20',skiprows=1)) - set(np.loadtxt('foo.txt',usecols=0,dtype=str,skiprows=1))]
[10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

What could be the reason and why is it specifically triggered by skiprows?

I have a file structured as such:

IID SuperPops
FGXXXXXXXX  R6
FGXXXXXXXX  R12

Containing 524123 lines.

I need to read the first column as an array:

len(np.loadtxt('foo.txt',usecols=0,dtype=str))
524123

But if i also try to skip the header all of a sudden 10 lines go missing:

len(np.loadtxt('foo.txt',usecols=0,dtype=str,skiprows=1))
524112

This does not happen if i change type though:

len(np.loadtxt('foo.txt',usecols=0,dtype='<U20',skiprows=1))
524122
len(np.loadtxt('foo.txt',usecols=0,dtype='<U10',skiprows=1))
524122

The 10 missing entries are all of length 10:

[len(elem) for elem in set(np.loadtxt('foo.txt',usecols=0,dtype='<U20',skiprows=1)) - set(np.loadtxt('foo.txt',usecols=0,dtype=str,skiprows=1))]
[10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

What could be the reason and why is it specifically triggered by skiprows?

Share Improve this question asked Mar 15 at 15:29 user2834012 211 bronze badge

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

This appears to be a bug with np.loadtxt: it loads the file in blocks of 50000 lines, and applies skiprows at each block instead of only at the beginning.

At 524K lines, you're getting 11 blocks, so 11 lines are skipped instead of just the first one.

A fix for this bug was merged a month ago, a week after NumPy 2.2.3 was published, so I would expect NumPy 2.2.4 to stop displaying this bug.

Refs:

issue on GitHub: https://github/numpy/numpy/issues/28315
merged PR fixing it: https://github/numpy/numpy/pull/28319

I figured this out by noticing that line 50002 got skipped if the file had more than 50001 lines, and then Googling "loadtxt 50000" brought me to the issue quoted above.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - np.loadtxt skips unexpected lines when reading strings - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)