I have really large ASCII file (63 million lines or more) that I would like to read using numpy.genfromtxt()
. But, it is taking up so much memory. I want to know what python actually does when numpy.genfromtxt()
is executed. Does it read all the lines at once?
Look at the below code, for example.
import numpy as np
data = np.genfromtxt("large.file.txt")
When I execute the code above, would python read all the contents in large.file.txt
and load it on to the memory? If yes, is there another way of reading a large file line-by-line so that python would not use large memory?
I have really large ASCII file (63 million lines or more) that I would like to read using numpy.genfromtxt()
. But, it is taking up so much memory. I want to know what python actually does when numpy.genfromtxt()
is executed. Does it read all the lines at once?
Look at the below code, for example.
import numpy as np
data = np.genfromtxt("large.file.txt")
When I execute the code above, would python read all the contents in large.file.txt
and load it on to the memory? If yes, is there another way of reading a large file line-by-line so that python would not use large memory?
2 Answers
Reset to default 1It reads all the lines. It has to. That data
array has to hold all of the file's data, and NumPy can't build an array with all of the file's data without reading all of the file.
That said, the implementation uses a lot more memory than the output needs. The implementation parses the requested columns of the file's data into a list of tuples before applying further processing, and a list of tuples takes a lot more memory than a NumPy array.
If you want to use less intermediate memory, I think numpy.loadtxt
is more efficient on that front - digging down into the implementation eventually hits a function that stores parsed data into an array directly, instead of using a list of tuples. numpy.loadtxt
isn't as flexible as numpy.genfromtxt
, but you don't seem to need the extra flexibility.
This won't make data
itself take any less memory, though. Also, numpy.loadtxt
does still need extra intermediate memory. It should just be less intermediate memory than numpy.genfromtxt
.
As best I can tell from the genfromtxt
source, it sets up a open file handle, fhd
.
Then uses next(fhd)
to skip initial lines.
The main reader block is
# Parse each line
for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):
values = split_line(line)
nbvalues = len(values)
# Skip an empty line
if nbvalues == 0:
continue
if usecols:
try:
values = [values[_] for _ in usecols]
except IndexError:
append_to_invalid((i + skip_header + 1, nbvalues))
continue
elif nbvalues != nbcols:
append_to_invalid((i + skip_header + 1, nbvalues))
continue
# Store the values
append_to_rows(tuple(values))
...
edit
Here's a demo of how you can read a csv file in blocks.
Write a modest size csv. Uniform dtype for simplicity:
In [35]: arr = np.arange(1000).reshape(200,5)
In [36]: np.savetxt('test.csv', arr, fmt='%6i', delimiter=',')
In [37]: ls test.csv
Volume in drive C is Windows
Directory of C:\Users\14256
02/15/2025 04:13 PM 7,200 test.csv
1 File(s) 7,200 bytes
A sample read with readlines()
:
In [38]: with open('test.csv','rt') as f: print(f.readlines()[:5])
[' 0, 1, 2, 3, 4\n', ' 5, 6, 7, 8, 9\n', ' 10, 11, 12, 13, 14\n', ' 15, 16, 17, 18, 19\n', ' 20, 21, 22, 23, 24\n']
Open the file in line read mode:
In [57]: f = open('test.csv','rt')
and read a block of lines:
In [58]: lines = [f.readline() for _ in range(50)]
In [59]: out = np.genfromtxt(lines, dtype='int', delimiter=','); out.shape
Out[59]: (50, 5)
In [60]: out[[0,-1]]
Out[60]:
array([[ 0, 1, 2, 3, 4],
[245, 246, 247, 248, 249]])
Read another block, starting from where we left off:
In [61]: lines = [f.readline() for _ in range(50)]
In [62]: out = np.genfromtxt(lines, dtype='int', delimiter=','); out.shape
Out[62]: (50, 5)
In [63]: out[[0,-1]]
Out[63]:
array([[250, 251, 252, 253, 254],
[495, 496, 497, 498, 499]])
And yet another block:
In [64]: out = np.genfromtxt([f.readline() for _ in range(50)], dtype='int', delimiter=','); out.shape
Out[64]: (50, 5)
In [65]: out[[0,-1]]
Out[65]:
array([[500, 501, 502, 503, 504],
[745, 746, 747, 748, 749]])
Those separate out
can be used or concatenated as needed.
timings
Loading by blocks doesn't hurt the timings:
In [77]: %%timeit
...: blocks=[]
...: with open('test.csv','rt') as f:
...: for _ in range(4):
...: lines = [f.readline() for _ in range(50)]
...: out = np.loadtxt(lines, dtype='int',delimiter=',')
...: blocks.append(out)
...: res = np.vstack(blocks)
...:
...:
721 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [78]: timeit np.loadtxt('test.csv', dtype='int', delimiter=',')
850 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [79]: timeit np.genfromtxt('test.csv', dtype='int', delimiter=',')
2.93 ms ± 34.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
data
object will contain the file contents. (BTW, it uses buffered I/O.) – Tim Roberts Commented Feb 15 at 19:14genfromtxt
docs say the input must be aFile, filename, list, or generator to read.
. Thus it cannot usereadlines
to read the whole file at once. – hpaulj Commented Feb 15 at 21:25