python - Huge time difference when reading a file inside class

Any suggestions where this huge time difference results from when reading the same large text file by the same loop from inside a class in Python.

import timeit
fn = "some-large-file.txt"
pos = 2434735976

class TestCase:
    def __init__(self, filename):
        self.filename = filename
        self.content = ""
            
    def read_file(self, start):
        self.readstart = start

        with open(self.filename, "r") as f:
            f.seek(self.readstart)
            line = f.readline()
            while line:
                self.content += line.strip()
                line = f.readline()
                if line.strip().startswith('>'): return

timeit_start = timeit.default_timer()
a = TestCase(fn)
a.read_file(pos)
print(len(a.content))
timeit_stop = timeit.default_timer()
print('Elapsed time: ', timeit_stop - timeit_start)

90338456

Elapsed time: 31628.955818721

timeit_start = timeit.default_timer()
s = ''
with open(fn, "r") as f:
    f.seek(pos)
    line = f.readline()
    while line:
        s += line.strip()
        line = f.readline()
        if line.strip().startswith('>'): break

print(len(s))        
timeit_stop = timeit.default_timer()
print('Elapsed time: ', timeit_stop - timeit_start)

90338456

Elapsed time: 1.233782830000564

I use Jupyter with Python 3.8.10 and IPython 8.12.2.

Any suggestions where this huge time difference results from when reading the same large text file by the same loop from inside a class in Python.

import timeit
fn = "some-large-file.txt"
pos = 2434735976

class TestCase:
    def __init__(self, filename):
        self.filename = filename
        self.content = ""
            
    def read_file(self, start):
        self.readstart = start

        with open(self.filename, "r") as f:
            f.seek(self.readstart)
            line = f.readline()
            while line:
                self.content += line.strip()
                line = f.readline()
                if line.strip().startswith('>'): return

timeit_start = timeit.default_timer()
a = TestCase(fn)
a.read_file(pos)
print(len(a.content))
timeit_stop = timeit.default_timer()
print('Elapsed time: ', timeit_stop - timeit_start)

90338456

Elapsed time: 31628.955818721

timeit_start = timeit.default_timer()
s = ''
with open(fn, "r") as f:
    f.seek(pos)
    line = f.readline()
    while line:
        s += line.strip()
        line = f.readline()
        if line.strip().startswith('>'): break

print(len(s))        
timeit_stop = timeit.default_timer()
print('Elapsed time: ', timeit_stop - timeit_start)

90338456

Elapsed time: 1.233782830000564

I use Jupyter with Python 3.8.10 and IPython 8.12.2.

Share Improve this question edited Mar 21 at 4:35 Mark Tolonen 179k26 gold badges180 silver badges271 bronze badges asked Mar 21 at 3:48 user1491229 7091 gold badge10 silver badges15 bronze badges

Do you observe the same behavior if you switch the order of operations: first outside, then the class based one? – luk2302 Commented Mar 21 at 5:47
Yes, actually I did it the other way around and I tried several other test cases in Jupyter and rebooting the machine in between. What I'm going to do next is creating a standalone Python test file and run it on different computer. – user1491229 Commented Mar 21 at 6:45
could you upload your file? – vassiliev Commented Mar 21 at 6:55
Can you try running this outside of Jupyter and with an up-to-date version of Python? If what you're saying is correct (the timings) then it surely has to be some bizarre issue with Jupyter. Also, how big is the file? Obviously it's over 2GB but just how big is it in total? – Adon Bilivit Commented Mar 21 at 8:45
I can reproduce significant performance differences under Jupyter. Unfortunately, I can't run timings in a "normal" environment right now, but I think @AdonBilivit's guess is right: I too suspect Jupyter as the culprit. – Timus Commented Mar 21 at 9:23

| Show 2 more comments

3 Answers 3

Sorted by: Reset to default 4

This isn't a file reading issue. One version of your code hits CPython's mutative string += optimization, while the other version goes through the unoptimized code path.

Your code relies on iteratively building up a string with +=. Since Python strings are immutable, this would ordinarily have to create an entire new string and copy all the data into the new string on every iteration, which is extremely slow.

The right way to stick a bunch of strings together into one string is by building a list of strings, and calling ''.join on that list. But people kept doing the += loop thing, seeing that it was painfully slow, and blaming Python for being slow. So the devs put in a hacky optimization.

If the bytecode evaluation loop sees that it's performing a + or += on two strings, it attempts to perform the operation by mutating the left-hand string.

There's a safety check on this optimization: if there are any references to the string besides the one this code is operating on, or a variable the result is about to be assigned to, the optimization aborts. You can see the code for that in unicode_concatenate.

For the purpose of this optimization, an object attribute does not count as a variable the result is about to be assigned to. (unicode_concatenate has cases for opcodes that perform variable assignment, but no case for STORE_ATTR.) Attribute access and attribute assignment can both trigger arbitrary code, and either it wasn't worth trying to figure out exactly when the optimization could be applied to attributes, or they just never implemented that handling.

So the version of your code that does s += line.strip() hits the optimized code path, while the self.content += line.strip() version does not.

For comparison, this is what the ''.join code would look like:

l = []
with open(fn, "r") as f:
    f.seek(pos)
    line = f.readline()
    while line:
        l.append(line.strip())
        line = f.readline()
        if line.strip().startswith('>'): break
s = ''.join(l)

You can avoid the string concatenation in the read_file() function which may help with what you're seeing. It will certainly be more efficient.

class TestCase:
    def __init__(self, filename):
        self.filename = filename
        self.content = ""
            
    def read_file(self, start):
        with open(self.filename, "r") as f:
            f.seek(pos := start)
            for line in f:
                if line.startswith(">"):
                    f.seek(start)
                    self.content = "".join(f.read(pos-start).split())
                    break
                pos += len(line)

I thank all contributors. To finish up the discussion I'd like to add the completed TestCase.

    class TestCase:
    def __init__(self, filename):
        self.filename = filename
        self.content = ""
                
    def read_file(self, start):
        with open(self.filename, "r") as f:
            f.seek(start)
            line = f.readline()
            s = ""
            while line:
                s += line.strip()
                line = f.readline()
                if line.strip().startswith('>'): 
                    self.content = s
                    return
            
    def join_read_file(self, start):
        l = []
        with open(self.filename, "r") as f:
            f.seek(start)
            line = f.readline()
            while line:
                l.append(line.strip())
                line = f.readline()
                if line.strip().startswith('>'): break
        self.content = ''.join(l)
        
    def sub_blockread_file(self, start, end):
        with open(self.filename, "r") as f:
            f.seek(start)
            self.content = re.sub(r"\s+", "", f.read(end-start))

    def join_blockread_file(self, start, end):
        with open(self.filename, "r") as f:
            f.seek(start)
            self.content = "".join(f.read(end-start).split())

by which I tested the performance of the different read_file functions. The result is summarized in the following table. Nothing surprising though.

function	elapsed time (in s)
read_file	1.0445590410001842
join_read_file	0.6792131259999223
sub_blockread_file	1.937005585999941
join_blockread_file	0.39088223399994604

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Huge time difference when reading a file inside class - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)