python - The FAISS indexing and the dataset indexing don't match

I'm trying to compute the recall after performing a HNSW search in FAISS. By recall, I mean the following metric:

Recall = TP + (TP + FN)

Where I consider an image as a True Positive (TP) if it appears in the top-10 search results, and as a False Negative (FN) if it does not.

However, I'm having trouble getting this to work correctly and I'm not sure what's going wrong. I'd appreciate some help.

From what I understand, FAISS generates sequential indexing numbers when indexing. So, if I load my dataset (let's take CIFAR-10 as an example) in a non-random order and index it sequentially, the nth index in FAISS should correspond to the nth index in my dataset.

However, when I compute recall using this approach, the values are strange. I’m also using a select parameter, as suggested in this Stack Overflow answer, but I'm still facing issues.

I am dealing with a 1M size dataset, so I've set HNSW parameters carefully following the Stack Overflow answer.

Insert thousands of documents into a chroma db

The indexing code in Python is as follows:

import faiss

index = faiss.IndexHNSWFlat(1280, 100, faiss.METRIC_L2)
index.hnsw.efSearch = 2000 
index.hnsw.efConstruction = 800

for data in tqdm(dataloader, desc="SYSTEM : Indexing Embeddings FAISS", unit="batch", leave=False): index.add(data.embedding)

and the below is the code of retrieving

If the query filename exists in top-10 result, then TP+=1 else FN+=1

TP, FN = 0, 0

for data in tqdm(dataloader, unit="batch", leave=False):
    D, I = index.search(data.embedding, 10)
    filenames = [dataset[I(idx)].filename for idx in range(10)]
        
    if data.filename in filenames:
        TP += 1
    else:
        FN += 1

recall = TP / (TP + FN)
print(f"Recall= {round(recall, 3)}, TP = {TP}, FN = {FN}")

and currently, the dataloader shuffle is set to False.

dataloader = DataLoader(dataset, batch_size=1, shuffle=False)

Also, the dataset is properly set up. and dataset can be

dataset = torchvision.datasets.CIFAR10

but I got It says Recall = 0 with TP = 0.

I have reviewed my code and carefully examined the output with the print function. The filename extracted from the dataloader is correctly stored in the filename variable, and the filenames in the filenames list are indeed the correct ones found by FAISS.

Just to be sure, I also fixed the seed with the following code:

seed = 111111
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)

However, the problem persists.

Why do the FAISS index numbers not match the sequential numbers extracted from the dataset?

I'm trying to compute the recall after performing a HNSW search in FAISS. By recall, I mean the following metric:

Recall = TP + (TP + FN)

Where I consider an image as a True Positive (TP) if it appears in the top-10 search results, and as a False Negative (FN) if it does not.

However, I'm having trouble getting this to work correctly and I'm not sure what's going wrong. I'd appreciate some help.

However, when I compute recall using this approach, the values are strange. I’m also using a select parameter, as suggested in this Stack Overflow answer, but I'm still facing issues.

I am dealing with a 1M size dataset, so I've set HNSW parameters carefully following the Stack Overflow answer.

Insert thousands of documents into a chroma db

The indexing code in Python is as follows:

import faiss

index = faiss.IndexHNSWFlat(1280, 100, faiss.METRIC_L2)
index.hnsw.efSearch = 2000 
index.hnsw.efConstruction = 800

for data in tqdm(dataloader, desc="SYSTEM : Indexing Embeddings FAISS", unit="batch", leave=False): index.add(data.embedding)

and the below is the code of retrieving

If the query filename exists in top-10 result, then TP+=1 else FN+=1

TP, FN = 0, 0

for data in tqdm(dataloader, unit="batch", leave=False):
    D, I = index.search(data.embedding, 10)
    filenames = [dataset[I(idx)].filename for idx in range(10)]
        
    if data.filename in filenames:
        TP += 1
    else:
        FN += 1

recall = TP / (TP + FN)
print(f"Recall= {round(recall, 3)}, TP = {TP}, FN = {FN}")

and currently, the dataloader shuffle is set to False.

dataloader = DataLoader(dataset, batch_size=1, shuffle=False)

Also, the dataset is properly set up. and dataset can be

dataset = torchvision.datasets.CIFAR10

but I got It says Recall = 0 with TP = 0.

Just to be sure, I also fixed the seed with the following code:

seed = 111111
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)

However, the problem persists.

Why do the FAISS index numbers not match the sequential numbers extracted from the dataset?

Share Improve this question asked Feb 17 at 1:52 No Yeah 13 bronze badges New contributor No Yeah is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

You should know the ground truth for each query you are making to calculate recall. This means from the dataset you have to calculate which are the top-10 using brute force, for each query. You can get a sample of 100, 1000 queries, don't need to to do that for all N queries you have.

Try to load the data to HNSW at once and not as a batch. Then get the 10 ANN and compare them with the IDs you have from the ground truth:

Recall = ANN-10 from FAISS / Ground Truth-10

>>> import numpy as np
>>> len(np.intersect1d(Igt, Ihnsw)) / n #10 in this case

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - The FAISS indexing and the dataset indexing don't match - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)