最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How do I remove escape characters from output of nltk.word_tokenize? - Stack Overflow

programmeradmin5浏览0评论

How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the code examples, which inform me that the output should consist only of words and punctuation, however I'm still getting escapes in the output.

Here's my code:

from __future__ import division
import nltk, re, pprint
from urllib.request import urlopen

url = ".txt"
raw = urlopen(url).read()
raw = raw.decode('utf-8')
tokens = nltk.word_tokenize(raw)
print(type(tokens))
print(len(tokens))
print(tokens[:10])

And the output, with the escapes visible in the first list item:

I've poked around online and have a suspicion this may be to do with the fact that the book's sample code was written for Python 2, which has already caused me some encoding issues (I needed to add the line above to convert the output from bytes to a string). Am I on the right track? If not, what am I doing wrong?

I'm using Python 3.12.1 on Windows 11.

Thanks in advance - please do let me know if I can provide any further helpful information.

发布评论

评论列表(0)

  1. 暂无评论