最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Issue in retrieval after adding new data in langchain-chroma vectordb #29499 - Stack Overflow

programmeradmin1浏览0评论

Example Code


# ----------- code to store data in vectordb ----------------
ext_to_loader = {
    '.csv': CSVLoader,
    '.json': JSONLoader,
    '.txt': TextLoader,
    '.pdf': PDFPlumberLoader,
    '.docx': Docx2txtLoader,
    '.pptx': PPTXLoader,
    '.xlsx': ExcelLoader,
    '.xls': ExcelLoader,
    'single_page_url':WebBaseLoader,
    'all_urls_from_base_url':  RecursiveUrlLoader,
    'directory': DirectoryLoader
}

def get_loader_for_extension(file_path):
    _, ext = os.path.splitext(file_path)
    loader_class = ext_to_loader.get(ext.lower())
    if loader_class:
        return loader_class(file_path)
    else:
        print(f"Unsupported file extension: {ext}")
        return None

def normalize_documents(docs):
    return [
        doc.page_content if isinstance(doc.page_content, str) else '\n'.join(doc.page_content) if isinstance(doc.page_content, list) else ''
        for doc in docs
    ]

def vectorestore_function(split_documents_with_metadata, user_vector_store_path):
    try:
        # Create vector store with metadata
        embeddings = OpenAIEmbeddings(
            model = "text-embedding-ada-002", 
            openai_api_key=OPENAI_API_KEY
        )

        vector_store = Chroma(
            embedding_function=embeddings, 
            persist_directory=user_vector_store_path
        )
        
        vector_store.add_documents(documents=split_documents_with_metadata)
        
        return vector_store
    except Exception as e:
        print(f'Error in vectorestore_function {str(e)}')

loader = get_loader_for_extension(saved_file_path)
docs = loader.load()
normalized_docs = normalize_documents(docs)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
split_docs = text_splitter.create_documents(normalized_docs)

split_documents_with_metadata = [
    Document(page_content=document.page_content, metadata={"user_id": user_id, "doc_id": document_id})
    for document in split_docs
]
vectorestore_function(
    split_documents_with_metadata, 
    user_vector_store_path
)
#Note: I use above (same) code to add or update new data 


# ----------------------------------------------------------- code for interaction with AI -----------------------------------------------------------
def get_vector_store(user_vector_store_path):
    
    embeddings = OpenAIEmbeddings(
            model = "text-embedding-ada-002", 
            openai_api_key=OPENAI_API_KEY
        )
    vectorstore = Chroma(
            embedding_function=embeddings,
            persist_directory=user_vector_store_path
        )    
    return vectorstore
document_id_list = [str(document_id) if isinstance(document_id, int) else document_id for document_id in document_id_list]

user_vector_store_path = os.path.join(VECTOR_STORE_PATH, user_id)        
vectorstore = get_vector_store(user_vector_store_path)

retriever=vectorstore.as_retriever()

current_threshold = 0.25
try:
    # Configure filtering
    retriever.search_type = "similarity_score_threshold"
    retriever.search_kwargs = {
        "filter": {
            "$and": [
                {"user_id": user_id},
                {"doc_id": {"$in": document_id_list}}
            ]
        },
        "score_threshold": current_threshold,
        "k": 3
    }

    retrieved_docs = retriever.invoke(question)
except Exception as e:
    print(f'error: {str(e)}')

print(f"retrieved_docs : {retrieved_docs}")


if not retrieved_docs:
    return jsonify({'error': f'No relevant docs were retrieved.'}), 404

Error Message and Stack Trace (if applicable)

WARNING:langchain_core.vectorstores.base:No relevant docs were retrieved using the relevance score threshold 0.25

Description:
I’m facing an issue with my live server. When a new user is created, a new vector database is generated, and everything works fine. If I add more data, it gets stored in the vector database, but I’m unable to retrieve the newly added data.

Interestingly, this issue does not occur in my local environment—it only happens on the live server. To make the new data retrievable, I have to execute pm2 reload "id", as my application is running with PM2. However, if another user is in the middle of a conversation when I reload PM2, the socket connection gets disconnected, disrupting their session.

Tech Stack:
Flutter – Used for the mobile application
Node.js – Used for the back office
Python – Handles data extraction, vector database creation, and conversations
The file download, embedding creation, and vector database updates are handled using Celery.
The server is set up with Apache, and PM2 is used to manage the application process.

Issue:
New data is added to the vector database but cannot be retrieved until pm2 reload "id" is executed.
Reloading PM2 disconnects active socket connections, affecting ongoing user conversations.
What I Want to Achieve:
I want to ensure that the system works seamlessly when a user adds or updates data in the vector database. The new data should be immediately accessible for conversations without requiring a PM2 reload.

In the back office, I am using Socket.IO to send status updates:
socketio.emit('status', {'message': {
    "user_id": user_id,
    "document_id": document_id,
    "status": 200,
    "message": f"Document ID {document_id} processed successfully."
}}, room=room)

This message is successfully emitted, and users can start conversations after receiving it. However, I’m still facing the issue where newly added data is not retrievable until I reload PM2.

Question:
How can I ensure that the system updates the vector database dynamically without requiring a PM2 reload, while keeping active socket connections intact?

发布评论

评论列表(0)

  1. 暂无评论