I'm trying to compute the recall after performing a HNSW search in FAISS. By recall, I mean the following metric:
Recall = TP + (TP + FN)
Where I consider an image as a True Positive (TP) if it appears in the top-10 search results, and as a False Negative (FN) if it does not.
However, I'm having trouble getting this to work correctly and I'm not sure what's going wrong. I'd appreciate some help.
From what I understand, FAISS generates sequential indexing numbers when indexing. So, if I load my dataset (let's take CIFAR-10 as an example) in a non-random order and index it sequentially, the nth index in FAISS should correspond to the nth index in my dataset.
However, when I compute recall using this approach, the values are strange. I’m also using a select parameter, as suggested in this Stack Overflow answer, but I'm still facing issues.
I am dealing with a 1M size dataset, so I've set HNSW parameters carefully following the Stack Overflow answer.
Insert thousands of documents into a chroma db
The indexing code in Python is as follows:
import faiss
index = faiss.IndexHNSWFlat(1280, 100, faiss.METRIC_L2)
index.hnsw.efSearch = 2000
index.hnsw.efConstruction = 800
for data in tqdm(dataloader, desc="SYSTEM : Indexing Embeddings FAISS", unit="batch", leave=False): index.add(data.embedding)
and the below is the code of retrieving
If the query filename exists in top-10 result, then TP+=1 else FN+=1
TP, FN = 0, 0
for data in tqdm(dataloader, unit="batch", leave=False):
D, I = index.search(data.embedding, 10)
filenames = [dataset[I(idx)].filename for idx in range(10)]
if data.filename in filenames:
TP += 1
else:
FN += 1
recall = TP / (TP + FN)
print(f"Recall= {round(recall, 3)}, TP = {TP}, FN = {FN}")
and currently, the dataloader shuffle is set to False.
dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
Also, the dataset is properly set up. and dataset can be
dataset = torchvision.datasets.CIFAR10
but I got It says Recall = 0 with TP = 0.
I have reviewed my code and carefully examined the output with the print function. The filename extracted from the dataloader is correctly stored in the filename variable, and the filenames in the filenames list are indeed the correct ones found by FAISS.
Just to be sure, I also fixed the seed with the following code:
seed = 111111
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
However, the problem persists.
Why do the FAISS index numbers not match the sequential numbers extracted from the dataset?
I'm trying to compute the recall after performing a HNSW search in FAISS. By recall, I mean the following metric:
Recall = TP + (TP + FN)
Where I consider an image as a True Positive (TP) if it appears in the top-10 search results, and as a False Negative (FN) if it does not.
However, I'm having trouble getting this to work correctly and I'm not sure what's going wrong. I'd appreciate some help.
From what I understand, FAISS generates sequential indexing numbers when indexing. So, if I load my dataset (let's take CIFAR-10 as an example) in a non-random order and index it sequentially, the nth index in FAISS should correspond to the nth index in my dataset.
However, when I compute recall using this approach, the values are strange. I’m also using a select parameter, as suggested in this Stack Overflow answer, but I'm still facing issues.
I am dealing with a 1M size dataset, so I've set HNSW parameters carefully following the Stack Overflow answer.
Insert thousands of documents into a chroma db
The indexing code in Python is as follows:
import faiss
index = faiss.IndexHNSWFlat(1280, 100, faiss.METRIC_L2)
index.hnsw.efSearch = 2000
index.hnsw.efConstruction = 800
for data in tqdm(dataloader, desc="SYSTEM : Indexing Embeddings FAISS", unit="batch", leave=False): index.add(data.embedding)
and the below is the code of retrieving
If the query filename exists in top-10 result, then TP+=1 else FN+=1
TP, FN = 0, 0
for data in tqdm(dataloader, unit="batch", leave=False):
D, I = index.search(data.embedding, 10)
filenames = [dataset[I(idx)].filename for idx in range(10)]
if data.filename in filenames:
TP += 1
else:
FN += 1
recall = TP / (TP + FN)
print(f"Recall= {round(recall, 3)}, TP = {TP}, FN = {FN}")
and currently, the dataloader shuffle is set to False.
dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
Also, the dataset is properly set up. and dataset can be
dataset = torchvision.datasets.CIFAR10
but I got It says Recall = 0 with TP = 0.
I have reviewed my code and carefully examined the output with the print function. The filename extracted from the dataloader is correctly stored in the filename variable, and the filenames in the filenames list are indeed the correct ones found by FAISS.
Just to be sure, I also fixed the seed with the following code:
seed = 111111
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
However, the problem persists.
Why do the FAISS index numbers not match the sequential numbers extracted from the dataset?
Share Improve this question asked Feb 17 at 1:52 No YeahNo Yeah 13 bronze badges New contributor No Yeah is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.1 Answer
Reset to default 0You should know the ground truth for each query you are making to calculate recall. This means from the dataset you have to calculate which are the top-10 using brute force, for each query. You can get a sample of 100, 1000 queries, don't need to to do that for all N queries you have.
Try to load the data to HNSW at once and not as a batch. Then get the 10 ANN and compare them with the IDs you have from the ground truth:
Recall = ANN-10 from FAISS / Ground Truth-10
>>> import numpy as np
>>> len(np.intersect1d(Igt, Ihnsw)) / n #10 in this case