最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - np.loadtxt skips unexpected lines when reading strings - Stack Overflow

programmeradmin3浏览0评论

I have a file structured as such:

IID SuperPops
FGXXXXXXXX  R6
FGXXXXXXXX  R12

Containing 524123 lines.

I need to read the first column as an array:

len(np.loadtxt('foo.txt',usecols=0,dtype=str))
524123

But if i also try to skip the header all of a sudden 10 lines go missing:

len(np.loadtxt('foo.txt',usecols=0,dtype=str,skiprows=1))
524112

This does not happen if i change type though:

len(np.loadtxt('foo.txt',usecols=0,dtype='<U20',skiprows=1))
524122
len(np.loadtxt('foo.txt',usecols=0,dtype='<U10',skiprows=1))
524122

The 10 missing entries are all of length 10:

[len(elem) for elem in set(np.loadtxt('foo.txt',usecols=0,dtype='<U20',skiprows=1)) - set(np.loadtxt('foo.txt',usecols=0,dtype=str,skiprows=1))]
[10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

What could be the reason and why is it specifically triggered by skiprows?

I have a file structured as such:

IID SuperPops
FGXXXXXXXX  R6
FGXXXXXXXX  R12

Containing 524123 lines.

I need to read the first column as an array:

len(np.loadtxt('foo.txt',usecols=0,dtype=str))
524123

But if i also try to skip the header all of a sudden 10 lines go missing:

len(np.loadtxt('foo.txt',usecols=0,dtype=str,skiprows=1))
524112

This does not happen if i change type though:

len(np.loadtxt('foo.txt',usecols=0,dtype='<U20',skiprows=1))
524122
len(np.loadtxt('foo.txt',usecols=0,dtype='<U10',skiprows=1))
524122

The 10 missing entries are all of length 10:

[len(elem) for elem in set(np.loadtxt('foo.txt',usecols=0,dtype='<U20',skiprows=1)) - set(np.loadtxt('foo.txt',usecols=0,dtype=str,skiprows=1))]
[10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

What could be the reason and why is it specifically triggered by skiprows?

Share Improve this question asked Mar 15 at 15:29 user2834012user2834012 211 bronze badge 0
Add a comment  | 

1 Answer 1

Reset to default 2

This appears to be a bug with np.loadtxt: it loads the file in blocks of 50000 lines, and applies skiprows at each block instead of only at the beginning.

At 524K lines, you're getting 11 blocks, so 11 lines are skipped instead of just the first one.

A fix for this bug was merged a month ago, a week after NumPy 2.2.3 was published, so I would expect NumPy 2.2.4 to stop displaying this bug.

Refs:

  • issue on GitHub: https://github/numpy/numpy/issues/28315
  • merged PR fixing it: https://github/numpy/numpy/pull/28319

I figured this out by noticing that line 50002 got skipped if the file had more than 50001 lines, and then Googling "loadtxt 50000" brought me to the issue quoted above.

发布评论

评论列表(0)

  1. 暂无评论