最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Does python reads all lines of a file when numpy.genfromtxt() is executed? - Stack Overflow

programmeradmin3浏览0评论

I have really large ASCII file (63 million lines or more) that I would like to read using numpy.genfromtxt(). But, it is taking up so much memory. I want to know what python actually does when numpy.genfromtxt() is executed. Does it read all the lines at once?

Look at the below code, for example.

import numpy as np
data = np.genfromtxt("large.file.txt")

When I execute the code above, would python read all the contents in large.file.txt and load it on to the memory? If yes, is there another way of reading a large file line-by-line so that python would not use large memory?

I have really large ASCII file (63 million lines or more) that I would like to read using numpy.genfromtxt(). But, it is taking up so much memory. I want to know what python actually does when numpy.genfromtxt() is executed. Does it read all the lines at once?

Look at the below code, for example.

import numpy as np
data = np.genfromtxt("large.file.txt")

When I execute the code above, would python read all the contents in large.file.txt and load it on to the memory? If yes, is there another way of reading a large file line-by-line so that python would not use large memory?

Share Improve this question asked Feb 15 at 19:09 RedshoeRedshoe 1615 bronze badges 9
  • 1 What do you think the difference is? Whether you read it all at once or line-by-line, it still has to fit into memory. The data object will contain the file contents. (BTW, it uses buffered I/O.) – Tim Roberts Commented Feb 15 at 19:14
  • Yeah, but fitting 63 million lines at one go and fitting a single line at one go are different. – Redshoe Commented Feb 15 at 19:30
  • 1 According to the documentation, np.genfromtxt() returns an ndarray. Therefore it must load the entire file in memory at once, because all ndarrays do that. (With the exception of memory mapped ndarrays, which doesn't apply here.) – Nick ODell Commented Feb 15 at 19:45
  • For reading the file one line at a time, have you tried open()? stackoverflow/questions/11555468/… – Nick ODell Commented Feb 15 at 19:48
  • genfromtxt docs say the input must be a File, filename, list, or generator to read.. Thus it cannot use readlines to read the whole file at once. – hpaulj Commented Feb 15 at 21:25
 |  Show 4 more comments

2 Answers 2

Reset to default 1

It reads all the lines. It has to. That data array has to hold all of the file's data, and NumPy can't build an array with all of the file's data without reading all of the file.

That said, the implementation uses a lot more memory than the output needs. The implementation parses the requested columns of the file's data into a list of tuples before applying further processing, and a list of tuples takes a lot more memory than a NumPy array.

If you want to use less intermediate memory, I think numpy.loadtxt is more efficient on that front - digging down into the implementation eventually hits a function that stores parsed data into an array directly, instead of using a list of tuples. numpy.loadtxt isn't as flexible as numpy.genfromtxt, but you don't seem to need the extra flexibility.

This won't make data itself take any less memory, though. Also, numpy.loadtxt does still need extra intermediate memory. It should just be less intermediate memory than numpy.genfromtxt.

As best I can tell from the genfromtxt source, it sets up a open file handle, fhd.

Then uses next(fhd) to skip initial lines.

The main reader block is

    # Parse each line
    for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):
        values = split_line(line)
        nbvalues = len(values)
        # Skip an empty line
        if nbvalues == 0:
            continue
        if usecols:
            try:
                values = [values[_] for _ in usecols]
            except IndexError:
                append_to_invalid((i + skip_header + 1, nbvalues))
                continue
        elif nbvalues != nbcols:
            append_to_invalid((i + skip_header + 1, nbvalues))
            continue
        # Store the values
        append_to_rows(tuple(values))
        ...

edit

Here's a demo of how you can read a csv file in blocks.

Write a modest size csv. Uniform dtype for simplicity:

In [35]: arr = np.arange(1000).reshape(200,5)

In [36]: np.savetxt('test.csv', arr, fmt='%6i', delimiter=',')

In [37]: ls test.csv
 Volume in drive C is Windows

 Directory of C:\Users\14256

02/15/2025  04:13 PM             7,200 test.csv
               1 File(s)          7,200 bytes
     

A sample read with readlines():

In [38]: with open('test.csv','rt') as f: print(f.readlines()[:5])
['     0,     1,     2,     3,     4\n', '     5,     6,     7,     8,     9\n', '    10,    11,    12,    13,    14\n', '    15,    16,    17,    18,    19\n', '    20,    21,    22,    23,    24\n']

Open the file in line read mode:

In [57]: f = open('test.csv','rt')

and read a block of lines:

In [58]: lines = [f.readline() for _ in range(50)]

In [59]: out = np.genfromtxt(lines, dtype='int', delimiter=','); out.shape
Out[59]: (50, 5)

In [60]: out[[0,-1]]
Out[60]: 
array([[  0,   1,   2,   3,   4],
       [245, 246, 247, 248, 249]])

Read another block, starting from where we left off:

In [61]: lines = [f.readline() for _ in range(50)]

In [62]: out = np.genfromtxt(lines, dtype='int', delimiter=','); out.shape
Out[62]: (50, 5)

In [63]: out[[0,-1]]
Out[63]: 
array([[250, 251, 252, 253, 254],
       [495, 496, 497, 498, 499]])

And yet another block:

In [64]: out = np.genfromtxt([f.readline() for _ in range(50)], dtype='int', delimiter=','); out.shape
Out[64]: (50, 5)

In [65]: out[[0,-1]]
Out[65]: 
array([[500, 501, 502, 503, 504],
       [745, 746, 747, 748, 749]])

Those separate out can be used or concatenated as needed.

timings

Loading by blocks doesn't hurt the timings:

In [77]: %%timeit 
    ...: blocks=[]
    ...: with open('test.csv','rt') as f:
    ...:     for _ in range(4):
    ...:         lines = [f.readline() for _ in range(50)]
    ...:         out = np.loadtxt(lines, dtype='int',delimiter=',')
    ...:         blocks.append(out)
    ...: res = np.vstack(blocks)
    ...: 
    ...: 
721 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [78]: timeit np.loadtxt('test.csv', dtype='int', delimiter=',')
850 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [79]: timeit np.genfromtxt('test.csv', dtype='int', delimiter=',')
2.93 ms ± 34.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
发布评论

评论列表(0)

  1. 暂无评论