python - lxml target interface splits data on non-ascii characters -- how can I get the whole string?

Here's a file test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<list>
  <entry>data</entry>
  <entry>Łódź</entry>
  <entry>data Łódź</entry>
</list>

and here's a simple python script to parse it into a list with lxml:

from lxml import etree

class ParseTarget:
    def __init__(self):
        self.entries = []
    def start(self, tag, attrib):
        pass
    def end(self, tag):
        pass
    def data(self, data):
        str = data.strip()
        if str != '':
            self.entries.append(data)
    def close(self):
        # Reset parser
        entries = self.entries
        self.entries = []
        # And return results
        return entries

target = etree.XMLParser(target=ParseTarget(),
                         # Including/removing this makes no difference
                         encoding='UTF-8')

tree = etree.parse("./test.xml", target)

# Expected value of tree:
# ['data', 'Łódź', 'data Łódź']
# Actual value of tree
# ['data', 'Łódź', 'data ', 'Łódź']
# What gives!!!?

As the comment says, I would expect to end up with a list of three elements, but I get four. This is a minimal demonstration of a general problem: including strings with non-ascii characters (but at least one ascii char at the beginning) results in not a single string, but a list of two strings, split on where the non-ascii chars start.

I don't want this to happen (i.e. I want to just get a list of three strings). What should I do?

I'm using Python 3.11.2

Here's a file test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<list>
  <entry>data</entry>
  <entry>Łódź</entry>
  <entry>data Łódź</entry>
</list>

and here's a simple python script to parse it into a list with lxml:

from lxml import etree

class ParseTarget:
    def __init__(self):
        self.entries = []
    def start(self, tag, attrib):
        pass
    def end(self, tag):
        pass
    def data(self, data):
        str = data.strip()
        if str != '':
            self.entries.append(data)
    def close(self):
        # Reset parser
        entries = self.entries
        self.entries = []
        # And return results
        return entries

target = etree.XMLParser(target=ParseTarget(),
                         # Including/removing this makes no difference
                         encoding='UTF-8')

tree = etree.parse("./test.xml", target)

# Expected value of tree:
# ['data', 'Łódź', 'data Łódź']
# Actual value of tree
# ['data', 'Łódź', 'data ', 'Łódź']
# What gives!!!?

I don't want this to happen (i.e. I want to just get a list of three strings). What should I do?

I'm using Python 3.11.2

Share Improve this question edited Mar 31 at 21:51 asked Mar 31 at 20:05 modallyFragile 1374 bronze badges

A follow-up discussion was started:
“Insufficient data” Join the conversation

Add a comment |

1 Answer 1

Sorted by: Reset to default 4

You have to use the end handler to reset:

Explanation of Steps

With event based parsing, the parser may split the third <entry> (<entry>data Łódź</entry>) into multiple data() calls.:

First: "data " (with a space at the end).
Second: "Łódź".

This is why we need to accumulate text correctly to "data Łódź".

from lxml import etree

class ParseTarget:
    def __init__(self):
        self.entries = []
        self.current_text = []
    
    def start(self, tag, attrib):
        self.current_text = []
    
    def end(self, tag):
        if self.current_text:
            self.entries.append("".join(self.current_text))
        self.current_text = []  # Reset for the next element
    
    def data(self, data):
        if data.strip():  # Ignore completely empty segments but keep spaces
            self.current_text.append(data)  # Append raw data, preserving spaces
    
    def close(self):
        entries = self.entries
        self.entries = []
        return entries

target = etree.XMLParser(target=ParseTarget(), encoding='UTF-8')

tree = etree.parse("./test.xml", target)
print(tree)
# ['data', 'Łódź', 'data Łódź']

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - lxml target interface splits data on non-ascii characters -- how can I get the whole string? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)