How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the code examples, which inform me that the output should consist only of words and punctuation, however I'm still getting escapes in the output.
Here's my code:
from __future__ import division
import nltk, re, pprint
from urllib.request import urlopen
url = ".txt"
raw = urlopen(url).read()
raw = raw.decode('utf-8')
tokens = nltk.word_tokenize(raw)
print(type(tokens))
print(len(tokens))
print(tokens[:10])
And the output, with the escapes visible in the first list item:
I've poked around online and have a suspicion this may be to do with the fact that the book's sample code was written for Python 2, which has already caused me some encoding issues (I needed to add the line above to convert the output from bytes to a string). Am I on the right track? If not, what am I doing wrong?
I'm using Python 3.12.1 on Windows 11.
Thanks in advance - please do let me know if I can provide any further helpful information.