Any suggestions where this huge time difference results from when reading the same large text file by the same loop from inside a class in Python.
import timeit
fn = "some-large-file.txt"
pos = 2434735976
class TestCase:
def __init__(self, filename):
self.filename = filename
self.content = ""
def read_file(self, start):
self.readstart = start
with open(self.filename, "r") as f:
f.seek(self.readstart)
line = f.readline()
while line:
self.content += line.strip()
line = f.readline()
if line.strip().startswith('>'): return
timeit_start = timeit.default_timer()
a = TestCase(fn)
a.read_file(pos)
print(len(a.content))
timeit_stop = timeit.default_timer()
print('Elapsed time: ', timeit_stop - timeit_start)
90338456
Elapsed time: 31628.955818721
timeit_start = timeit.default_timer()
s = ''
with open(fn, "r") as f:
f.seek(pos)
line = f.readline()
while line:
s += line.strip()
line = f.readline()
if line.strip().startswith('>'): break
print(len(s))
timeit_stop = timeit.default_timer()
print('Elapsed time: ', timeit_stop - timeit_start)
90338456
Elapsed time: 1.233782830000564
I use Jupyter with Python 3.8.10 and IPython 8.12.2.
Any suggestions where this huge time difference results from when reading the same large text file by the same loop from inside a class in Python.
import timeit
fn = "some-large-file.txt"
pos = 2434735976
class TestCase:
def __init__(self, filename):
self.filename = filename
self.content = ""
def read_file(self, start):
self.readstart = start
with open(self.filename, "r") as f:
f.seek(self.readstart)
line = f.readline()
while line:
self.content += line.strip()
line = f.readline()
if line.strip().startswith('>'): return
timeit_start = timeit.default_timer()
a = TestCase(fn)
a.read_file(pos)
print(len(a.content))
timeit_stop = timeit.default_timer()
print('Elapsed time: ', timeit_stop - timeit_start)
90338456
Elapsed time: 31628.955818721
timeit_start = timeit.default_timer()
s = ''
with open(fn, "r") as f:
f.seek(pos)
line = f.readline()
while line:
s += line.strip()
line = f.readline()
if line.strip().startswith('>'): break
print(len(s))
timeit_stop = timeit.default_timer()
print('Elapsed time: ', timeit_stop - timeit_start)
90338456
Elapsed time: 1.233782830000564
I use Jupyter with Python 3.8.10 and IPython 8.12.2.
Share Improve this question edited Mar 21 at 4:35 Mark Tolonen 179k26 gold badges180 silver badges271 bronze badges asked Mar 21 at 3:48 user1491229user1491229 7091 gold badge10 silver badges15 bronze badges 7- Do you observe the same behavior if you switch the order of operations: first outside, then the class based one? – luk2302 Commented Mar 21 at 5:47
- Yes, actually I did it the other way around and I tried several other test cases in Jupyter and rebooting the machine in between. What I'm going to do next is creating a standalone Python test file and run it on different computer. – user1491229 Commented Mar 21 at 6:45
- could you upload your file? – vassiliev Commented Mar 21 at 6:55
- Can you try running this outside of Jupyter and with an up-to-date version of Python? If what you're saying is correct (the timings) then it surely has to be some bizarre issue with Jupyter. Also, how big is the file? Obviously it's over 2GB but just how big is it in total? – Adon Bilivit Commented Mar 21 at 8:45
- I can reproduce significant performance differences under Jupyter. Unfortunately, I can't run timings in a "normal" environment right now, but I think @AdonBilivit's guess is right: I too suspect Jupyter as the culprit. – Timus Commented Mar 21 at 9:23
3 Answers
Reset to default 4This isn't a file reading issue. One version of your code hits CPython's mutative string +=
optimization, while the other version goes through the unoptimized code path.
Your code relies on iteratively building up a string with +=
. Since Python strings are immutable, this would ordinarily have to create an entire new string and copy all the data into the new string on every iteration, which is extremely slow.
The right way to stick a bunch of strings together into one string is by building a list of strings, and calling ''.join
on that list. But people kept doing the +=
loop thing, seeing that it was painfully slow, and blaming Python for being slow. So the devs put in a hacky optimization.
If the bytecode evaluation loop sees that it's performing a +
or +=
on two strings, it attempts to perform the operation by mutating the left-hand string.
There's a safety check on this optimization: if there are any references to the string besides the one this code is operating on, or a variable the result is about to be assigned to, the optimization aborts. You can see the code for that in unicode_concatenate
.
For the purpose of this optimization, an object attribute does not count as a variable the result is about to be assigned to. (unicode_concatenate
has cases for opcodes that perform variable assignment, but no case for STORE_ATTR
.) Attribute access and attribute assignment can both trigger arbitrary code, and either it wasn't worth trying to figure out exactly when the optimization could be applied to attributes, or they just never implemented that handling.
So the version of your code that does s += line.strip()
hits the optimized code path, while the self.content += line.strip()
version does not.
For comparison, this is what the ''.join
code would look like:
l = []
with open(fn, "r") as f:
f.seek(pos)
line = f.readline()
while line:
l.append(line.strip())
line = f.readline()
if line.strip().startswith('>'): break
s = ''.join(l)
You can avoid the string concatenation in the read_file() function which may help with what you're seeing. It will certainly be more efficient.
class TestCase:
def __init__(self, filename):
self.filename = filename
self.content = ""
def read_file(self, start):
with open(self.filename, "r") as f:
f.seek(pos := start)
for line in f:
if line.startswith(">"):
f.seek(start)
self.content = "".join(f.read(pos-start).split())
break
pos += len(line)
I thank all contributors. To finish up the discussion I'd like to add the completed TestCase.
class TestCase:
def __init__(self, filename):
self.filename = filename
self.content = ""
def read_file(self, start):
with open(self.filename, "r") as f:
f.seek(start)
line = f.readline()
s = ""
while line:
s += line.strip()
line = f.readline()
if line.strip().startswith('>'):
self.content = s
return
def join_read_file(self, start):
l = []
with open(self.filename, "r") as f:
f.seek(start)
line = f.readline()
while line:
l.append(line.strip())
line = f.readline()
if line.strip().startswith('>'): break
self.content = ''.join(l)
def sub_blockread_file(self, start, end):
with open(self.filename, "r") as f:
f.seek(start)
self.content = re.sub(r"\s+", "", f.read(end-start))
def join_blockread_file(self, start, end):
with open(self.filename, "r") as f:
f.seek(start)
self.content = "".join(f.read(end-start).split())
by which I tested the performance of the different read_file functions. The result is summarized in the following table. Nothing surprising though.
function | elapsed time (in s) |
---|---|
read_file | 1.0445590410001842 |
join_read_file | 0.6792131259999223 |
sub_blockread_file | 1.937005585999941 |
join_blockread_file | 0.39088223399994604 |
.