python - How do I remove escape characters from output of nltk.word_tokenize?

How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the code examples, which inform me that the output should consist only of words and punctuation, however I'm still getting escapes in the output.

Here's my code:

from __future__ import division
import nltk, re, pprint
from urllib.request import urlopen

url = ".txt"
raw = urlopen(url).read()
raw = raw.decode('utf-8')
tokens = nltk.word_tokenize(raw)
print(type(tokens))
print(len(tokens))
print(tokens[:10])

And the output, with the escapes visible in the first list item:

I've poked around online and have a suspicion this may be to do with the fact that the book's sample code was written for Python 2, which has already caused me some encoding issues (I needed to add the line above to convert the output from bytes to a string). Am I on the right track? If not, what am I doing wrong?

I'm using Python 3.12.1 on Windows 11.

Thanks in advance - please do let me know if I can provide any further helpful information.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How do I remove escape characters from output of nltk.word_tokenize? - Stack Overflow

与本文相关的文章

评论列表(0)