最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How to handle abbreviations in Embeddings for RAG? - Stack Overflow

programmeradmin1浏览0评论

This question popped up in my head while working for a client. Let's assume we want to built a RAG system with a knowledgebase of internal chat messages, emails etc. of a candy producing company. Now let's further assume that they use a lot of abbreviations for their products and positions inside their company like stake holders in their communication which is only limited to their intern company communication. An easy made up example would be: Instead of Snickers they may write Skrs or their Stakeholders they may refer to as TCP. Which means no embedding model has seen it before and this data is not used to train the model.

How do embedding models in general deal with such abbreviations? Do they take them into account or maybe ignore them by the context around the abbreviation? Let's take the example above:

  • "I like the new Skrs"

and

  • "I like the new TCP"

are semantically the same, but these two sentences might be interesting for two different departments. So when we put the embeddings of this two statements into a Vector DB and do a similarity search on a user query which might be something like "Did people like the new Snickers chocolate bar?", the VDB might return both records. But the sentence with "I like the new TCP" is irrelevant for that retrieval.

I know you could argue that you maybe should do some metadata filtering in the first place and flag the topics with something like "chocolate_bar_topic" = True or False. But let's ignore this for my question.

My general questions are:

  1. Can embeddings easily handle abbreviations which they have never seen before, just by understanding them in a context?

  2. Would it makes sense to preprocess the text before embedding it by something like replacing the abbreviations or appending extra info to them? So, something like:

  • "I like the new chocolate bar" and "I like the new Stakeholder"

or

  • "I like the new Skrs(chocolate bar)" and " I like the new TCP (Stakeholder)"
发布评论

评论列表(0)

  1. 暂无评论