最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Python, filter vectors from Pinecone vector store based on a field saved in the metadata of these vectors - Stack Overflow

programmeradmin4浏览0评论

I have vectors stored in a Pinecone vector store, each vector represents a content of a pdf file:

Metadata:: hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d" id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"

I saved a new field in the metadata, which is the hash_code of the pdf content, to avoid adding the same file again and again to the vector store.

To do that, I'm getting the new hash codes of the new documents that I want to add, then I want to scan the existing ones to find if any of them already exists and then filter it out.

I'm using python, and tried such a code, but didn't manage to acheive my goal yet:

First method:

def filter_existing_docs(index_name, docs):
   # Initialize the Pinecone index
   index = pinecone_client.Index(index_name)

   # Extract hash_codes from the docs list using the appropriate method for your Document objects
   hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
   print("Hash Codes:", hash_codes)

   # Fetch by list of hash_codes (ensure hash_codes are valid ids)
   fetch_response = index.fetch(ids=hash_codes)  
   print("Fetch Response:", fetch_response)

   # Get the existing hash_codes that are already in the Pinecone index
   existing_hash_codes = set(fetch_response.get('vectors', {}).keys())  # Extract existing IDs from the response
   print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

   # Filter out the docs that have already been added to Pinecone
   filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
   print("2 -----------> Filtered Docs:", len(filtered_docs))

   return filtered_docs

Then tried another approach:

def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
    top_k=100,  # Set a suitable `top_k` to return a reasonable number of documents
    include_metadata=True,
    #namespace=namespace
)

# Debug: Print the query response to see its structure
print("Query Response:", query_response)

# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs

I have vectors stored in a Pinecone vector store, each vector represents a content of a pdf file:

Metadata:: hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d" id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"

I saved a new field in the metadata, which is the hash_code of the pdf content, to avoid adding the same file again and again to the vector store.

To do that, I'm getting the new hash codes of the new documents that I want to add, then I want to scan the existing ones to find if any of them already exists and then filter it out.

I'm using python, and tried such a code, but didn't manage to acheive my goal yet:

First method:

def filter_existing_docs(index_name, docs):
   # Initialize the Pinecone index
   index = pinecone_client.Index(index_name)

   # Extract hash_codes from the docs list using the appropriate method for your Document objects
   hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
   print("Hash Codes:", hash_codes)

   # Fetch by list of hash_codes (ensure hash_codes are valid ids)
   fetch_response = index.fetch(ids=hash_codes)  
   print("Fetch Response:", fetch_response)

   # Get the existing hash_codes that are already in the Pinecone index
   existing_hash_codes = set(fetch_response.get('vectors', {}).keys())  # Extract existing IDs from the response
   print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

   # Filter out the docs that have already been added to Pinecone
   filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
   print("2 -----------> Filtered Docs:", len(filtered_docs))

   return filtered_docs

Then tried another approach:

def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
    top_k=100,  # Set a suitable `top_k` to return a reasonable number of documents
    include_metadata=True,
    #namespace=namespace
)

# Debug: Print the query response to see its structure
print("Query Response:", query_response)

# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs
Share Improve this question asked Jan 25 at 18:35 zbeedatmzbeedatm 6792 gold badges20 silver badges46 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 0
  1. Iterate through the new document hash codes
  2. Query Pinecone using each hash as a metadata filter
  3. If there are 0 results, add the corresponding file. Else, the file is already present, so skip it.

You can create a naming convention for each chunk, like "doc1#hash", "doc2#hash"
You can also filter records based on id prefixes, eg

    for ids in index.list(prefix='doc1#', namespace=''):
    print(ids)

You can use any prefix pattern you like, but make sure you use a consistent prefix pattern for all child records of a document.

ex:

    doc1#chunk1
    doc1_chunk1
    doc1___chunk1
    doc1:chunk1
    doc1chunk1

Refrence : pinecone-docs/id-prefixes

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论