Python, filter vectors from Pinecone vector store based on a field saved in the metadata of these vectors

I have vectors stored in a Pinecone vector store, each vector represents a content of a pdf file:

Metadata:: hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d" id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"

I saved a new field in the metadata, which is the hash_code of the pdf content, to avoid adding the same file again and again to the vector store.

To do that, I'm getting the new hash codes of the new documents that I want to add, then I want to scan the existing ones to find if any of them already exists and then filter it out.

I'm using python, and tried such a code, but didn't manage to acheive my goal yet:

First method:

def filter_existing_docs(index_name, docs):
   # Initialize the Pinecone index
   index = pinecone_client.Index(index_name)

   # Extract hash_codes from the docs list using the appropriate method for your Document objects
   hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
   print("Hash Codes:", hash_codes)

   # Fetch by list of hash_codes (ensure hash_codes are valid ids)
   fetch_response = index.fetch(ids=hash_codes)  
   print("Fetch Response:", fetch_response)

   # Get the existing hash_codes that are already in the Pinecone index
   existing_hash_codes = set(fetch_response.get('vectors', {}).keys())  # Extract existing IDs from the response
   print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

   # Filter out the docs that have already been added to Pinecone
   filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
   print("2 -----------> Filtered Docs:", len(filtered_docs))

   return filtered_docs

Then tried another approach:

def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
    top_k=100,  # Set a suitable `top_k` to return a reasonable number of documents
    include_metadata=True,
    #namespace=namespace
)

# Debug: Print the query response to see its structure
print("Query Response:", query_response)

# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs

I have vectors stored in a Pinecone vector store, each vector represents a content of a pdf file:

Metadata:: hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d" id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"

I saved a new field in the metadata, which is the hash_code of the pdf content, to avoid adding the same file again and again to the vector store.

To do that, I'm getting the new hash codes of the new documents that I want to add, then I want to scan the existing ones to find if any of them already exists and then filter it out.

I'm using python, and tried such a code, but didn't manage to acheive my goal yet:

First method:

def filter_existing_docs(index_name, docs):
   # Initialize the Pinecone index
   index = pinecone_client.Index(index_name)

   # Extract hash_codes from the docs list using the appropriate method for your Document objects
   hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
   print("Hash Codes:", hash_codes)

   # Fetch by list of hash_codes (ensure hash_codes are valid ids)
   fetch_response = index.fetch(ids=hash_codes)  
   print("Fetch Response:", fetch_response)

   # Get the existing hash_codes that are already in the Pinecone index
   existing_hash_codes = set(fetch_response.get('vectors', {}).keys())  # Extract existing IDs from the response
   print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

   # Filter out the docs that have already been added to Pinecone
   filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
   print("2 -----------> Filtered Docs:", len(filtered_docs))

   return filtered_docs

Then tried another approach:

def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
    top_k=100,  # Set a suitable `top_k` to return a reasonable number of documents
    include_metadata=True,
    #namespace=namespace
)

# Debug: Print the query response to see its structure
print("Query Response:", query_response)

# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs

Share Improve this question asked Jan 25 at 18:35 zbeedatm 6792 gold badges20 silver badges46 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

Iterate through the new document hash codes
Query Pinecone using each hash as a metadata filter
If there are 0 results, add the corresponding file. Else, the file is already present, so skip it.

You can create a naming convention for each chunk, like "doc1#hash", "doc2#hash"
You can also filter records based on id prefixes, eg

    for ids in index.list(prefix='doc1#', namespace=''):
    print(ids)

You can use any prefix pattern you like, but make sure you use a consistent prefix pattern for all child records of a document.

ex:

    doc1#chunk1
    doc1_chunk1
    doc1___chunk1
    doc1:chunk1
    doc1chunk1

Refrence : pinecone-docs/id-prefixes

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Python, filter vectors from Pinecone vector store based on a field saved in the metadata of these vectors - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)