I have vectors stored in a Pinecone vector store, each vector represents a content of a pdf file:
Metadata:: hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d" id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"
I saved a new field in the metadata, which is the hash_code of the pdf content, to avoid adding the same file again and again to the vector store.
To do that, I'm getting the new hash codes of the new documents that I want to add, then I want to scan the existing ones to find if any of them already exists and then filter it out.
I'm using python, and tried such a code, but didn't manage to acheive my goal yet:
First method:
def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)
# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs] # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)
# Fetch by list of hash_codes (ensure hash_codes are valid ids)
fetch_response = index.fetch(ids=hash_codes)
print("Fetch Response:", fetch_response)
# Get the existing hash_codes that are already in the Pinecone index
existing_hash_codes = set(fetch_response.get('vectors', {}).keys()) # Extract existing IDs from the response
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))
# Filter out the docs that have already been added to Pinecone
filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))
return filtered_docs
Then tried another approach:
def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)
# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs] # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)
# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
top_k=100, # Set a suitable `top_k` to return a reasonable number of documents
include_metadata=True,
#namespace=namespace
)
# Debug: Print the query response to see its structure
print("Query Response:", query_response)
# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))
# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))
return filtered_docs
I have vectors stored in a Pinecone vector store, each vector represents a content of a pdf file:
Metadata:: hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d" id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"
I saved a new field in the metadata, which is the hash_code of the pdf content, to avoid adding the same file again and again to the vector store.
To do that, I'm getting the new hash codes of the new documents that I want to add, then I want to scan the existing ones to find if any of them already exists and then filter it out.
I'm using python, and tried such a code, but didn't manage to acheive my goal yet:
First method:
def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)
# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs] # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)
# Fetch by list of hash_codes (ensure hash_codes are valid ids)
fetch_response = index.fetch(ids=hash_codes)
print("Fetch Response:", fetch_response)
# Get the existing hash_codes that are already in the Pinecone index
existing_hash_codes = set(fetch_response.get('vectors', {}).keys()) # Extract existing IDs from the response
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))
# Filter out the docs that have already been added to Pinecone
filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))
return filtered_docs
Then tried another approach:
def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)
# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs] # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)
# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
top_k=100, # Set a suitable `top_k` to return a reasonable number of documents
include_metadata=True,
#namespace=namespace
)
# Debug: Print the query response to see its structure
print("Query Response:", query_response)
# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))
# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))
return filtered_docs
Share
Improve this question
asked Jan 25 at 18:35
zbeedatmzbeedatm
6792 gold badges20 silver badges46 bronze badges
2 Answers
Reset to default 0- Iterate through the new document hash codes
- Query Pinecone using each hash as a metadata filter
- If there are 0 results, add the corresponding file. Else, the file is already present, so skip it.
You can create a naming convention for each chunk, like "doc1#hash", "doc2#hash"
You can also filter records based on id prefixes, eg
for ids in index.list(prefix='doc1#', namespace=''):
print(ids)
You can use any prefix pattern you like, but make sure you use a consistent prefix pattern for all child records of a document.
ex:
doc1#chunk1
doc1_chunk1
doc1___chunk1
doc1:chunk1
doc1chunk1
Refrence : pinecone-docs/id-prefixes