最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

chromadb - InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 1024 - Stack Overflow

programmeradmin2浏览0评论

Everytime i add my jsonl to a new chromadb it says that my vector shape 384 and it should be 1024.

Something seems to be going wrong with the chromadb insertion but I can't figure it out.

I start with jsonl with embeddings (size 1024), id, metadata, and document.

I have checked that all lines are 1024.

Then I check what chromadb collections I have (ensuring there are none), create a collection, and insert my jsonl into the collection. But then when I attempt to query: InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 1024.

I've deleted everything and tried to restart everything. What else should I try?

This is the insertion code:

import chromadb
import json

# Initialize the client (modify the path if using PersistentClient)
client = chromadb.PersistentClient(path="./chromadb_store")

# Load the JSONL file
documents = []
embeddings = []
metadatas = []
ids = []

# Read JSONL and extract data
with open('810_and_embeddings.jsonl', 'r') as f:
    for line in f:
        data = json.loads(line)
        documents.append(data['document'])
        embeddings.append(data['embedding'])
        metadatas.append(data['metadata'])
        # Convert id to string
        ids.append(str(data['id']))  # Ensure the ID is a string

# Create or load a collection
collection_name = "df_810"
try:
    collection = client.get_collection(name=collection_name)  # Try to load existing collection
    print(f"Collection '{collection_name}' loaded.")
except Exception as e:
    collection = client.create_collection(name=collection_name)  # If it doesn't exist, create it
    print(f"Collection '{collection_name}' created.")

# Set batch size for processing
batch_size = 100  # Adjust this based on memory constraints

# Function to insert data in batches
def insert_in_batches(documents, embeddings, metadatas, ids, batch_size):
    for i in range(0, len(documents), batch_size):
        # Get the batch slice
        batch_docs = documents[i:i + batch_size]
        batch_embeddings = embeddings[i:i + batch_size]
        batch_metadatas = metadatas[i:i + batch_size]
        batch_ids = ids[i:i + batch_size]
        
        # Add the batch to the collection
        collection.add(
            documents=batch_docs,
            embeddings=batch_embeddings,
            metadatas=batch_metadatas,
            ids=batch_ids
        )
        print(f"Inserted batch {i // batch_size + 1} of {len(documents) // batch_size + 1}")

# Insert data in batches
insert_in_batches(documents, embeddings, metadatas, ids, batch_size)

print("All documents and embeddings upserted successfully.")

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论