python - Read text from file while keeping byte offset

Using Python, I want to read text from a (utf-8 encoded) text file, but at the same time need to know the start and end position of each character in the file (in bytes). As there might be multi-byte characters, this isn't a 1:1 mapping.

I can see that reading through the file character by character and keeping an offset akin to offset += len(c.encode('utf-8')) (or whatever the encoding of the file would be), but this seems a bit ad-hoc, especially once (ignored) decoding errors etc. come into play. Is there a standard way / library to do that? I would imagine a list of character-offset pairs or a str plus a list of integers which contains the offsets.

EDIT: The context in which this is applied is that the script would be reading in some (supposed) text file and return blocks of text which are "interesting" (e.g. a diff to a reference). However, the text file is untrusted (generated by student code), so it needs to be robust against any malformed input. And, yes, I could just fail as soon as there is non-utf8 input, but since this is used in a learning environment, I would like it to be best-effort.

Share Improve this question edited Mar 11 at 18:26 asked Mar 10 at 15:52 incaseoftrouble 3752 silver badges13 bronze badges

What are the characters you are reading? Python should be fine to read multi byte unicode out of the box. – Markus Hirsimäki Commented Mar 10 at 15:59
Why not just read everything, and then compute the offsets for each characters after? Like: data = Path(...).read_data(); offsets = [(c, len(data[:i].decode())) for i, c in enumerate(data)]? It is utf-8, if you know the string, you can compute the offsets anytime. – KamilCuk Commented Mar 10 at 16:17
@KamilCuk Translation of newlines would throw that method off. In text files on Windows \r\n is returned as \n. – Mark Tolonen Commented Mar 10 at 16:20
Calling .tell() on the stream should work. It's considered an opaque number on a text stream but can be used to seek back to the specific character. – Mark Tolonen Commented Mar 10 at 16:32
1 @KamilCuk that would give me more or less the same (apart from newline shenanigans) but it still fails when you have e.g. decode errors in the file. Let me add context to the question. – incaseoftrouble Commented Mar 11 at 18:24

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

According to Methods of File Objects:

f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.

So one could open the input file in binary mode and process the decoding manually to track the start and end bytes. Below is one way to do it with an IncrementalDecoder (may not handle all error conditions):

import codecs

# Manually writing UTF-8 characters with errors.
with open('input.txt', 'wb') as f:
    f.write('A'.encode())  # 1-byte UTF-8
    f.write('ü'.encode())  # 2-byte UTF-8
    f.write('\r\n'.encode())  # Windows line ending
    f.write('我'.encode())  # 3-byte UTF-8
    f.write('

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Read text from file while keeping byte offset - Stack Overflow

1 Answer 1

`与本文相关的文章`

`评论列表(0)`