最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Read text from file while keeping byte offset - Stack Overflow

programmeradmin4浏览0评论

Using Python, I want to read text from a (utf-8 encoded) text file, but at the same time need to know the start and end position of each character in the file (in bytes). As there might be multi-byte characters, this isn't a 1:1 mapping.

I can see that reading through the file character by character and keeping an offset akin to offset += len(c.encode('utf-8')) (or whatever the encoding of the file would be), but this seems a bit ad-hoc, especially once (ignored) decoding errors etc. come into play. Is there a standard way / library to do that? I would imagine a list of character-offset pairs or a str plus a list of integers which contains the offsets.

EDIT: The context in which this is applied is that the script would be reading in some (supposed) text file and return blocks of text which are "interesting" (e.g. a diff to a reference). However, the text file is untrusted (generated by student code), so it needs to be robust against any malformed input. And, yes, I could just fail as soon as there is non-utf8 input, but since this is used in a learning environment, I would like it to be best-effort.

Using Python, I want to read text from a (utf-8 encoded) text file, but at the same time need to know the start and end position of each character in the file (in bytes). As there might be multi-byte characters, this isn't a 1:1 mapping.

I can see that reading through the file character by character and keeping an offset akin to offset += len(c.encode('utf-8')) (or whatever the encoding of the file would be), but this seems a bit ad-hoc, especially once (ignored) decoding errors etc. come into play. Is there a standard way / library to do that? I would imagine a list of character-offset pairs or a str plus a list of integers which contains the offsets.

EDIT: The context in which this is applied is that the script would be reading in some (supposed) text file and return blocks of text which are "interesting" (e.g. a diff to a reference). However, the text file is untrusted (generated by student code), so it needs to be robust against any malformed input. And, yes, I could just fail as soon as there is non-utf8 input, but since this is used in a learning environment, I would like it to be best-effort.

Share Improve this question edited Mar 11 at 18:26 incaseoftrouble asked Mar 10 at 15:52 incaseoftroubleincaseoftrouble 3752 silver badges13 bronze badges 5
  • What are the characters you are reading? Python should be fine to read multi byte unicode out of the box. – Markus Hirsimäki Commented Mar 10 at 15:59
  • Why not just read everything, and then compute the offsets for each characters after? Like: data = Path(...).read_data(); offsets = [(c, len(data[:i].decode())) for i, c in enumerate(data)]? It is utf-8, if you know the string, you can compute the offsets anytime. – KamilCuk Commented Mar 10 at 16:17
  • @KamilCuk Translation of newlines would throw that method off. In text files on Windows \r\n is returned as \n. – Mark Tolonen Commented Mar 10 at 16:20
  • Calling .tell() on the stream should work. It's considered an opaque number on a text stream but can be used to seek back to the specific character. – Mark Tolonen Commented Mar 10 at 16:32
  • 1 @KamilCuk that would give me more or less the same (apart from newline shenanigans) but it still fails when you have e.g. decode errors in the file. Let me add context to the question. – incaseoftrouble Commented Mar 11 at 18:24
Add a comment  | 

1 Answer 1

Reset to default 2

According to Methods of File Objects:

f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.

So one could open the input file in binary mode and process the decoding manually to track the start and end bytes. Below is one way to do it with an IncrementalDecoder (may not handle all error conditions):

import codecs

# Manually writing UTF-8 characters with errors.
with open('input.txt', 'wb') as f:
    f.write('A'.encode())  # 1-byte UTF-8
    f.write('ü'.encode())  # 2-byte UTF-8
    f.write('\r\n'.encode())  # Windows line ending
    f.write('我'.encode())  # 3-byte UTF-8
    f.write('
发布评论

评论列表(0)

  1. 暂无评论