最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

java - Initializing RAG using vectorstore without duplicates - Stack Overflow

programmeradmin1浏览0评论

I am creating a small application using spring-ai with mongodb-atlas (local docker container) to store the RAG data.

I want to "seed" the mongoDB with some content on the service start. The content is a list of documents with metadata. The problem is that this content will be inserted each time, the application starts and I have not found a way to prevent the insertion of duplicate data. I can't simply remove all data from the database, as I want to add data later that should be persisted and kept in there even when the service is restarted and maybe filled with different/newer presets.

Right now I'm trying something like that:

    @Autowired
    public void init(VectorStore vectorStore) {
        List<Document> documents = List.of(
                new Document("Once there was a little Girl",
                        Map.of("type", "init", "pos", "1", "plot", "1")),
                new Document("The girls name was Mary",
                        Map.of("type", "init", "pos", "2", "plot", "1")),
                new Document("Once there was a little Boy",
                        Map.of("type", "init", "pos", "1", "plot", "2")),
                new Document("The boys name was Peter",
                        Map.of("type", "init", "pos", "2", "plot", "2")),
                new Document("Peter was a wild kid",
                        Map.of("type", "init", "pos", "3", "plot", "2"))
        );

        List<String> collect = vectorStore.similaritySearch("type == 'init'")
                .stream().map(Document::getId).collect(Collectors.toList());
        vectorStore.delete(
                collect
        );
        
        vectorStore.add(documents);
    }

This doesn't work because there is one metadata map that is stored a bit differently (in mongoDB I can see that the order of fields in the metadata map is different somehow) and that row is not removed in the delete step. So with each start, this row is duplicated. The behaviour is pretty stable, When I change the value of type from init to story, a different row will escape deletion. This drives me mad...

I would like to have a way to provide initial data to the DB that may change when the service evolves, without filling up the DB with additional trash that presumably will lead to problems later. (I assume that will be tha case, but I'm not in a stage yet to verify that this will be a problem, nevertheless, it is anoying)

Has anyone solved something similar?

发布评论

评论列表(0)

  1. 暂无评论