Using Python, I want to read text from a (utf-8 encoded) text file, but at the same time need to know the start and end position of each character in the file (in bytes). As there might be multi-byte characters, this isn't a 1:1 mapping.
I can see that reading through the file character by character and keeping an offset akin to offset += len(c.encode('utf-8'))
(or whatever the encoding of the file would be), but this seems a bit ad-hoc, especially once (ignored) decoding errors etc. come into play. Is there a standard way / library to do that? I would imagine a list of character-offset pairs or a str plus a list of integers which contains the offsets.
EDIT: The context in which this is applied is that the script would be reading in some (supposed) text file and return blocks of text which are "interesting" (e.g. a diff to a reference). However, the text file is untrusted (generated by student code), so it needs to be robust against any malformed input. And, yes, I could just fail as soon as there is non-utf8 input, but since this is used in a learning environment, I would like it to be best-effort.
Using Python, I want to read text from a (utf-8 encoded) text file, but at the same time need to know the start and end position of each character in the file (in bytes). As there might be multi-byte characters, this isn't a 1:1 mapping.
I can see that reading through the file character by character and keeping an offset akin to offset += len(c.encode('utf-8'))
(or whatever the encoding of the file would be), but this seems a bit ad-hoc, especially once (ignored) decoding errors etc. come into play. Is there a standard way / library to do that? I would imagine a list of character-offset pairs or a str plus a list of integers which contains the offsets.
EDIT: The context in which this is applied is that the script would be reading in some (supposed) text file and return blocks of text which are "interesting" (e.g. a diff to a reference). However, the text file is untrusted (generated by student code), so it needs to be robust against any malformed input. And, yes, I could just fail as soon as there is non-utf8 input, but since this is used in a learning environment, I would like it to be best-effort.
Share Improve this question edited Mar 11 at 18:26 incaseoftrouble asked Mar 10 at 15:52 incaseoftroubleincaseoftrouble 3752 silver badges13 bronze badges 5 |1 Answer
Reset to default 2According to Methods of File Objects
:
f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.
So one could open the input file in binary mode and process the decoding manually to track the start and end bytes. Below is one way to do it with an IncrementalDecoder
(may not handle all error conditions):
import codecs
# Manually writing UTF-8 characters with errors.
with open('input.txt', 'wb') as f:
f.write('A'.encode()) # 1-byte UTF-8
f.write('ü'.encode()) # 2-byte UTF-8
f.write('\r\n'.encode()) # Windows line ending
f.write('我'.encode()) # 3-byte UTF-8
f.write('
data = Path(...).read_data(); offsets = [(c, len(data[:i].decode())) for i, c in enumerate(data)]
? It is utf-8, if you know the string, you can compute the offsets anytime. – KamilCuk Commented Mar 10 at 16:17\r\n
is returned as\n
. – Mark Tolonen Commented Mar 10 at 16:20.tell()
on the stream should work. It's considered an opaque number on a text stream but can be used to seek back to the specific character. – Mark Tolonen Commented Mar 10 at 16:32