最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

tokenize - How to detect out-of-vocabulary words in a prompt - Stack Overflow

programmeradmin1浏览0评论

I need to detect words an LLM has no knowledge about, to add RAG-based definition of said word to the prompt, i.e.:

What is the best way to achieve slubalisme using the new fabridocium product ?, should highlight slubalisme and fabridocium as unknown words.

What is the best way to achieve this ?

What I've tried:

  • Tokenizer based: checking if the model tokenizer splits the word in multiple pieces. This is not accurate as some known words can easily be split in multiple pieces by the tokenizer. There are a lot of false positives
  • Comparing vocab list: prone to spelling issue
  • Prompting an LLM: works OK but really inefficient
发布评论

评论列表(0)

  1. 暂无评论