python - Fastest way to search 5k rows inside of 100m row pair-wise dataframe

I am not sure title is well describing the problem but I will explain it step by step.

I have a correlation matrix of genes (10k x 10k) I convert this correlation matrix to pairwise dataframe (upper triangle) (around 100m row)

gene1	gene2	score
Gene3450	Gene9123	0.999706
Gene5219	Gene9161	0.999691
Gene27	Gene6467	0.999646
Gene3255	Gene4865	0.999636
Gene2512	Gene5730	0.999605
...	...	...

I am not sure title is well describing the problem but I will explain it step by step.

I have a correlation matrix of genes (10k x 10k) I convert this correlation matrix to pairwise dataframe (upper triangle) (around 100m row)

gene1	gene2	score
Gene3450	Gene9123	0.999706
Gene5219	Gene9161	0.999691
Gene27	Gene6467	0.999646
Gene3255	Gene4865	0.999636
Gene2512	Gene5730	0.999605
...	...	...

Then I have gold-standard TERMS table around 5k rows and columns are ID and used_genes

id	name	used_genes
1	Complex 1	[Gene3629, Gene8048, Gene9660, Gene4180, Gene1...]
2	Complex 2	[Gene3944, Gene931, Gene3769, Gene7523, Gene61...]
3	Complex 3	[Gene8236, Gene934, Gene5902, Gene165, Gene664...]
4	Complex 4	[Gene2399, Gene2236, Gene8932, Gene6670, Gene2...]
5	Complex 5	[Gene3860, Gene5792, Gene9214, Gene7174, Gene3...]

What I do:

I iterate of each Gold-standard complex row.
Convert used_gene list to pairwise, like geneA-geneB, geneA-geneC etc.
Check those complex-row gene pairs in the stacked correlation pairs.
If they are exist I put column TP=1, if not TP=0.
Based on the TP counts I calculate precision, recall, and area under the curve score.

name	used_genes	auc_score
Multisubunit ACTR coactivator complex	[CREBBP, KAT2B, NCOA3, EP300]	0.001695
Condensin I complex	[SMC4, NCAPH, SMC2, NCAPG, NCAPD2]	0.009233
BLOC-2 (biogenesis of lysosome-related anel...)	[HPS3, HPS5, HPS6]	0.000529
NCOR complex	[TBL1XR1, NCOR1, TBL1X, GPS2, HDAC3, CORO2A]	0.000839
BLOC-1 (biogenesis of lysosome-related anel...)	[DTNBP1, SNAPIN, BLOC1S6, BLOC1S1, BLOC1S5, BL...]	0.002227

So, in the end, for each of gold-standard row, I have PR-AUC score.

I will share my function below, with 100m stacked df, and 5k terms It takes around 25 minutes, and I am trying to find a way to reduce the time.

PS: for the calculation of PR-AUC part, I have compiled C++ code, I just give the ordered TP numbers as a input to C++ function and return the score, still runtime is the same. I guess the problem is iteration part.

from sklearn import metrics
def compute_per_complex_pr(corr_df, terms_df):

    pairwise_df = binary(corr_df)
    pairwise_df = quick_sort(pairwise_df).reset_index(drop=True)
    
    # Precompute a mapping from each gene to the row indices in the pairwise DataFrame where it appears.
    gene_to_pair_indices = {}
    for i, (gene_a, gene_b) in enumerate(zip(pairwise_df["gene1"], pairwise_df["gene2"])):
        gene_to_pair_indices.setdefault(gene_a, []).append(i)
        gene_to_pair_indices.setdefault(gene_b, []).append(i)
    
    # Initialize AUC scores (one for each complex) with NaNs.
    auc_scores = np.full(len(terms_df), np.nan)
    
    # Loop over each gene complex
    for idx, row in terms_df.iterrows():
        gene_set = set(row.used_genes)
   
        
        # Collect all row indices in the pairwise data where either gene belongs to the complex.
        candidate_indices = set()
        for gene in gene_set:
            candidate_indices.update(gene_to_pair_indices.get(gene, []))
        candidate_indices = sorted(candidate_indices)
        
        if not candidate_indices:
            continue
        
        # Select only the relevant pairwise comparisons.
        sub_df = pairwise_df.loc[candidate_indices]
        # A prediction is 1 if both genes in the pair are in the complex; otherwise 0.
        predictions = (sub_df["gene1"].isin(gene_set) & sub_df["gene2"].isin(gene_set)).astype(int)
        
        if predictions.sum() == 0:
            continue
        
        # Compute cumulative true positives and derive precision and recall.
        true_positive_cumsum = predictions.cumsum()
        precision = true_positive_cumsum / (np.arange(len(predictions)) + 1)
        recall = true_positive_cumsum / true_positive_cumsum.iloc[-1]
        
        if len(recall) < 2 or recall.iloc[-1] == 0:
            continue
        
        auc_scores[idx] = metrics.auc(recall, precision)
    
    # Add the computed AUC scores to the terms DataFrame.
    terms_df["auc_score"] = auc_scores
    return terms_df
    
    
 
def binary(corr):
    stack = corr.stack().rename_axis(index=['gene1', 'gene2']).reset_index(name='score')
    stack = drop_mirror_pairs(stack)
    return stack
    
    
def quick_sort(df, ascending=False):
    order = 1 if ascending else -1
    sorted_df = df.iloc[np.argsort(order * df["score"].values)].reset_index(drop=True)
    return sorted_df


def drop_mirror_pairs(df):
    gene_pairs = np.sort(df[["gene1", "gene2"]].to_numpy(), axis=1)
    df.loc[:, ["gene1", "gene2"]] = gene_pairs
    df = df.loc[~df.duplicated(subset=["gene1", "gene2"], keep="first")]
    return df

for dummy data (corr matrix, terms_df)

import numpy as np
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(0)

# -------------------------------
# Create the 10,000 x 10,000 correlation matrix
# -------------------------------
num_genes = 10000
genes = [f"Gene{i}" for i in range(num_genes)]


rand_matrix = np.random.uniform(-1, 1, (num_genes, num_genes))
corr_matrix = (rand_matrix + rand_matrix.T) / 2
np.fill_diagonal(corr_matrix, 1.0)

corr_df = pd.DataFrame(corr_matrix, index=genes, columns=genes)


num_terms = 5000
terms_list = []

for i in range(1, num_terms + 1):
    # Randomly choose a number of genes between 10 and 40 for this term
    n_genes = np.random.randint(10, 41)
    used_genes = np.random.choice(genes, size=n_genes, replace=False).tolist()
    term = {
        "id": i,
        "name": f"Complex {i}",
        "used_genes": used_genes
    }
    terms_list.append(term)

terms_df = pd.DataFrame(terms_list)

# Display sample outputs (for verification, you might want to show the first few rows)
print("Correlation Matrix Sample:")
print(corr_df.iloc[:5, :5])  # print a 5x5 sample

print("\nTerms DataFrame Sample:")
print(terms_df.head())

to run the function compute_per_complex_pr(corr_df, terms_df)

Share Improve this question edited Mar 31 at 9:44 asked Mar 30 at 10:07 Yasir 1,1201 gold badge14 silver badges31 bronze badges

3 Can you provide a minimal reproducible example? It help use to profile, test, and benchmark the code (especially between different answers). Without that, answers tend to compute different things, to contain bugs, and often to be sub-optimal. – Jérôme Richard Commented Mar 30 at 12:50
This cannot be answered; binary, quick_sort, min_complex_size, metrics are all missing. – Reinderien Commented Mar 30 at 15:17
Beyond that, do not add table screenshots. Add properly-formatted text tables in markdown. – Reinderien Commented Mar 30 at 15:18
@Reinderien yes you're right. I have edited the function, added helper functions, edited img table to markdown table. – Yasir Commented Mar 30 at 16:35
2 @JérômeRichard sorry my mistake again! I've put the missing function and now It is running. – Yasir Commented Mar 31 at 9:45

| Show 4 more comments

1 Answer 1

Sorted by: Reset to default 2

Several optimisations can be applied to compute_per_complex_pr.

First of all, pairwise_df.loc[candidate_indices] can be optimised by converting candidate_indices to a Numpy array and using iloc instead.

Actually, sorted(candidate_indices) is also bit slow because it is a set of integer Python object and code operating on pure-Python data structures are generally painfully slow (in CPython which is the standard Python interpreter). Thus, we can convert the set before and then sort it with Numpy so the sort can be significantly faster.

The thing is converting a set to Numpy is a bit slow because it even iterating to this data-structure is already quite slow... On simple way to fix this issue it to use a native language (like C++ or Rust) so to never pay the cost of inherently slow pure-Python objets. An alternative solution is to use a fast bit-set package like bitarray. The bad news is that bitarray cannot directly operate on Numpy arrays so it does not bring optimal performance (still because of the manipulation of slow pure-Python data structure), but it is at least significantly better than a naive set. We can Create a bit-set with the right size with candidate_indices = bitarray(len(pairwise_df)) and then fill it rather quickly with candidate_indices[gene_to_pair_indices[gene]] = True. The good news is that we do not need to sort anything now because we can directly index the dataframe with a Numpy boolean array created from the bitset. We can do that with unpackbits. However, the result is an array of uint8 items. We can use view(bool) so to reinterpret the 0-1 values to booleans very cheaply. One side effect of this approach is that the array can be a bit bigger than expected because bitarray pack bits in bytes so the output is a multiple of 8 (because bytes are octets on all mainstream modern machines). We can slice the array so to get the relevant part.

Last but not least, it is better to operate on categorical data than strings in this case. Indeed, strings are inherently slow because they are stored in pure-Python objects (yes, still them again). They should be converted in categorical data they are repeated many times or can be considered as being part of a limited set of a known set of strings. Categorical data are actually integers with a table associating the integer to string object. Besides speed, categorical data also take significantly less memory here.

Here is the final code:

from bitarray import bitarray

def fast_compute_per_complex_pr(corr_df, terms_df):
    pairwise_df = binary(corr_df)
    pairwise_df = quick_sort(pairwise_df).reset_index(drop=True)
    pairwise_df['gene1'] = pairwise_df['gene1'].astype("category")
    pairwise_df['gene2'] = pairwise_df['gene2'].astype("category")
    
    # Precompute a mapping from each gene to the row indices in the pairwise DataFrame where it appears.
    gene_to_pair_indices = {}
    for i, (gene_a, gene_b) in enumerate(zip(pairwise_df["gene1"], pairwise_df["gene2"])):
        gene_to_pair_indices.setdefault(gene_a, []).append(i)
        gene_to_pair_indices.setdefault(gene_b, []).append(i)
    
    # Initialize AUC scores (one for each complex) with NaNs.
    auc_scores = np.full(len(terms_df), np.nan)
    
    # Loop over each gene complex
    for idx, row in terms_df.iterrows():
        gene_set = set(row.used_genes)

        # Collect all row indices in the pairwise data where either gene belongs to the complex.
        candidate_indices = bitarray(len(pairwise_df))
        for gene in gene_set:
            if gene in gene_to_pair_indices:
                candidate_indices[gene_to_pair_indices[gene]] = True
        
        if not candidate_indices.any():
            continue
        
        # Select only the relevant pairwise comparisons.
        selected_rows = np.unpackbits(candidate_indices).view(bool)[:len(pairwise_df)]
        sub_df = pairwise_df.iloc[selected_rows]
        # A prediction is 1 if both genes in the pair are in the complex; otherwise 0.
        predictions = (sub_df["gene1"].isin(gene_set) & sub_df["gene2"].isin(gene_set)).astype(int)
        
        if predictions.sum() == 0:
            continue

        # Compute cumulative true positives and derive precision and recall.
        true_positive_cumsum = predictions.cumsum()
        precision = true_positive_cumsum / (np.arange(len(predictions)) + 1)
        recall = true_positive_cumsum / true_positive_cumsum.iloc[-1]
        
        if len(recall) < 2 or recall.iloc[-1] == 0:
            continue

        auc_scores[idx] = metrics.auc(recall, precision)
    
    # Add the computed AUC scores to the terms DataFrame.
    terms_df["auc_score"] = auc_scores
    return terms_df

This code seems about 2.5 times faster on my machine (with a i5-9600KF CPU on Windows). Note that the code before the loop take 20% of the time now since the loop is about 3 times faster.

To make this much faster, one way would be to parallelise the computation. That being said, I do not expect multithreading to be significantly faster because of the CPython GIL and I do not expect multiprocessing to be a silver-bullet because of data copying/pickling (which should result in a much higher memory consumption). Thus, I think it would be better to convert this part of the code in a native language like C++ first, and then parallelise the loop with tools like OpenMP. The result will be a faster execution not only due to parallelism but also faster operations thanks to a native execution.

One benefit with native language is that you can directly operate on pairwise_df.iloc[selected_rows] without creating a new data structure. A skilled developper can even do the operation chunk by chunk in a SIMD-friendly way so predictions can be computed much more efficiently. Similarly, np.unpackbits is not needed since you can directly iterate on the bit-array. This makes the operation more cache friendly (so it can then scale better once running in parallel). Last but not least, note that most of the values in the bit-array are set to False with the current dataset (>99%). This means this data structure is rather sparse. One can replace the bit-array data structure with something more memory efficient at the expense of a more complex code and slower sequential execution (but likely faster in parallel).

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Fastest way to search 5k rows inside of 100m row pair-wise dataframe - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)