I am not sure title is well describing the problem but I will explain it step by step.
I have a correlation matrix of genes (10k x 10k) I convert this correlation matrix to pairwise dataframe (upper triangle) (around 100m row)
gene1 | gene2 | score |
---|---|---|
Gene3450 | Gene9123 | 0.999706 |
Gene5219 | Gene9161 | 0.999691 |
Gene27 | Gene6467 | 0.999646 |
Gene3255 | Gene4865 | 0.999636 |
Gene2512 | Gene5730 | 0.999605 |
... | ... | ... |
I am not sure title is well describing the problem but I will explain it step by step.
I have a correlation matrix of genes (10k x 10k) I convert this correlation matrix to pairwise dataframe (upper triangle) (around 100m row)
gene1 | gene2 | score |
---|---|---|
Gene3450 | Gene9123 | 0.999706 |
Gene5219 | Gene9161 | 0.999691 |
Gene27 | Gene6467 | 0.999646 |
Gene3255 | Gene4865 | 0.999636 |
Gene2512 | Gene5730 | 0.999605 |
... | ... | ... |
Then I have gold-standard TERMS table around 5k rows and columns are ID and used_genes
id | name | used_genes |
---|---|---|
1 | Complex 1 | [Gene3629, Gene8048, Gene9660, Gene4180, Gene1...] |
2 | Complex 2 | [Gene3944, Gene931, Gene3769, Gene7523, Gene61...] |
3 | Complex 3 | [Gene8236, Gene934, Gene5902, Gene165, Gene664...] |
4 | Complex 4 | [Gene2399, Gene2236, Gene8932, Gene6670, Gene2...] |
5 | Complex 5 | [Gene3860, Gene5792, Gene9214, Gene7174, Gene3...] |
What I do:
I iterate of each Gold-standard complex row.
Convert used_gene list to pairwise, like geneA-geneB, geneA-geneC etc.
Check those complex-row gene pairs in the stacked correlation pairs.
If they are exist I put column TP=1, if not TP=0.
Based on the TP counts I calculate precision, recall, and area under the curve score.
name | used_genes | auc_score |
---|---|---|
Multisubunit ACTR coactivator complex | [CREBBP, KAT2B, NCOA3, EP300] | 0.001695 |
Condensin I complex | [SMC4, NCAPH, SMC2, NCAPG, NCAPD2] | 0.009233 |
BLOC-2 (biogenesis of lysosome-related anel...) | [HPS3, HPS5, HPS6] | 0.000529 |
NCOR complex | [TBL1XR1, NCOR1, TBL1X, GPS2, HDAC3, CORO2A] | 0.000839 |
BLOC-1 (biogenesis of lysosome-related anel...) | [DTNBP1, SNAPIN, BLOC1S6, BLOC1S1, BLOC1S5, BL...] | 0.002227 |
So, in the end, for each of gold-standard row, I have PR-AUC score.
I will share my function below, with 100m stacked df, and 5k terms It takes around 25 minutes, and I am trying to find a way to reduce the time.
PS: for the calculation of PR-AUC part, I have compiled C++ code, I just give the ordered TP numbers as a input to C++ function and return the score, still runtime is the same. I guess the problem is iteration part.
from sklearn import metrics
def compute_per_complex_pr(corr_df, terms_df):
pairwise_df = binary(corr_df)
pairwise_df = quick_sort(pairwise_df).reset_index(drop=True)
# Precompute a mapping from each gene to the row indices in the pairwise DataFrame where it appears.
gene_to_pair_indices = {}
for i, (gene_a, gene_b) in enumerate(zip(pairwise_df["gene1"], pairwise_df["gene2"])):
gene_to_pair_indices.setdefault(gene_a, []).append(i)
gene_to_pair_indices.setdefault(gene_b, []).append(i)
# Initialize AUC scores (one for each complex) with NaNs.
auc_scores = np.full(len(terms_df), np.nan)
# Loop over each gene complex
for idx, row in terms_df.iterrows():
gene_set = set(row.used_genes)
# Collect all row indices in the pairwise data where either gene belongs to the complex.
candidate_indices = set()
for gene in gene_set:
candidate_indices.update(gene_to_pair_indices.get(gene, []))
candidate_indices = sorted(candidate_indices)
if not candidate_indices:
continue
# Select only the relevant pairwise comparisons.
sub_df = pairwise_df.loc[candidate_indices]
# A prediction is 1 if both genes in the pair are in the complex; otherwise 0.
predictions = (sub_df["gene1"].isin(gene_set) & sub_df["gene2"].isin(gene_set)).astype(int)
if predictions.sum() == 0:
continue
# Compute cumulative true positives and derive precision and recall.
true_positive_cumsum = predictions.cumsum()
precision = true_positive_cumsum / (np.arange(len(predictions)) + 1)
recall = true_positive_cumsum / true_positive_cumsum.iloc[-1]
if len(recall) < 2 or recall.iloc[-1] == 0:
continue
auc_scores[idx] = metrics.auc(recall, precision)
# Add the computed AUC scores to the terms DataFrame.
terms_df["auc_score"] = auc_scores
return terms_df
def binary(corr):
stack = corr.stack().rename_axis(index=['gene1', 'gene2']).reset_index(name='score')
stack = drop_mirror_pairs(stack)
return stack
def quick_sort(df, ascending=False):
order = 1 if ascending else -1
sorted_df = df.iloc[np.argsort(order * df["score"].values)].reset_index(drop=True)
return sorted_df
def drop_mirror_pairs(df):
gene_pairs = np.sort(df[["gene1", "gene2"]].to_numpy(), axis=1)
df.loc[:, ["gene1", "gene2"]] = gene_pairs
df = df.loc[~df.duplicated(subset=["gene1", "gene2"], keep="first")]
return df
for dummy data (corr matrix, terms_df)
import numpy as np
import pandas as pd
# Set a random seed for reproducibility
np.random.seed(0)
# -------------------------------
# Create the 10,000 x 10,000 correlation matrix
# -------------------------------
num_genes = 10000
genes = [f"Gene{i}" for i in range(num_genes)]
rand_matrix = np.random.uniform(-1, 1, (num_genes, num_genes))
corr_matrix = (rand_matrix + rand_matrix.T) / 2
np.fill_diagonal(corr_matrix, 1.0)
corr_df = pd.DataFrame(corr_matrix, index=genes, columns=genes)
num_terms = 5000
terms_list = []
for i in range(1, num_terms + 1):
# Randomly choose a number of genes between 10 and 40 for this term
n_genes = np.random.randint(10, 41)
used_genes = np.random.choice(genes, size=n_genes, replace=False).tolist()
term = {
"id": i,
"name": f"Complex {i}",
"used_genes": used_genes
}
terms_list.append(term)
terms_df = pd.DataFrame(terms_list)
# Display sample outputs (for verification, you might want to show the first few rows)
print("Correlation Matrix Sample:")
print(corr_df.iloc[:5, :5]) # print a 5x5 sample
print("\nTerms DataFrame Sample:")
print(terms_df.head())
to run the function compute_per_complex_pr(corr_df, terms_df)
1 Answer
Reset to default 2Several optimisations can be applied to compute_per_complex_pr
.
First of all, pairwise_df.loc[candidate_indices]
can be optimised by converting candidate_indices
to a Numpy array and using iloc
instead.
Actually, sorted(candidate_indices)
is also bit slow because it is a set of integer Python object and code operating on pure-Python data structures are generally painfully slow (in CPython which is the standard Python interpreter). Thus, we can convert the set before and then sort it with Numpy so the sort can be significantly faster.
The thing is converting a set
to Numpy is a bit slow because it even iterating to this data-structure is already quite slow... On simple way to fix this issue it to use a native language (like C++ or Rust) so to never pay the cost of inherently slow pure-Python objets. An alternative solution is to use a fast bit-set package like bitarray
. The bad news is that bitarray cannot directly operate on Numpy arrays so it does not bring optimal performance (still because of the manipulation of slow pure-Python data structure), but it is at least significantly better than a naive set
. We can Create a bit-set with the right size with candidate_indices = bitarray(len(pairwise_df))
and then fill it rather quickly with candidate_indices[gene_to_pair_indices[gene]] = True
. The good news is that we do not need to sort anything now because we can directly index the dataframe with a Numpy boolean array created from the bitset. We can do that with unpackbits
. However, the result is an array of uint8
items. We can use view(bool)
so to reinterpret the 0-1 values to booleans very cheaply. One side effect of this approach is that the array can be a bit bigger than expected because bitarray pack bits in bytes so the output is a multiple of 8 (because bytes are octets on all mainstream modern machines). We can slice the array so to get the relevant part.
Last but not least, it is better to operate on categorical data than strings in this case. Indeed, strings are inherently slow because they are stored in pure-Python objects (yes, still them again). They should be converted in categorical data they are repeated many times or can be considered as being part of a limited set of a known set of strings. Categorical data are actually integers with a table associating the integer to string object. Besides speed, categorical data also take significantly less memory here.
Here is the final code:
from bitarray import bitarray
def fast_compute_per_complex_pr(corr_df, terms_df):
pairwise_df = binary(corr_df)
pairwise_df = quick_sort(pairwise_df).reset_index(drop=True)
pairwise_df['gene1'] = pairwise_df['gene1'].astype("category")
pairwise_df['gene2'] = pairwise_df['gene2'].astype("category")
# Precompute a mapping from each gene to the row indices in the pairwise DataFrame where it appears.
gene_to_pair_indices = {}
for i, (gene_a, gene_b) in enumerate(zip(pairwise_df["gene1"], pairwise_df["gene2"])):
gene_to_pair_indices.setdefault(gene_a, []).append(i)
gene_to_pair_indices.setdefault(gene_b, []).append(i)
# Initialize AUC scores (one for each complex) with NaNs.
auc_scores = np.full(len(terms_df), np.nan)
# Loop over each gene complex
for idx, row in terms_df.iterrows():
gene_set = set(row.used_genes)
# Collect all row indices in the pairwise data where either gene belongs to the complex.
candidate_indices = bitarray(len(pairwise_df))
for gene in gene_set:
if gene in gene_to_pair_indices:
candidate_indices[gene_to_pair_indices[gene]] = True
if not candidate_indices.any():
continue
# Select only the relevant pairwise comparisons.
selected_rows = np.unpackbits(candidate_indices).view(bool)[:len(pairwise_df)]
sub_df = pairwise_df.iloc[selected_rows]
# A prediction is 1 if both genes in the pair are in the complex; otherwise 0.
predictions = (sub_df["gene1"].isin(gene_set) & sub_df["gene2"].isin(gene_set)).astype(int)
if predictions.sum() == 0:
continue
# Compute cumulative true positives and derive precision and recall.
true_positive_cumsum = predictions.cumsum()
precision = true_positive_cumsum / (np.arange(len(predictions)) + 1)
recall = true_positive_cumsum / true_positive_cumsum.iloc[-1]
if len(recall) < 2 or recall.iloc[-1] == 0:
continue
auc_scores[idx] = metrics.auc(recall, precision)
# Add the computed AUC scores to the terms DataFrame.
terms_df["auc_score"] = auc_scores
return terms_df
This code seems about 2.5 times faster on my machine (with a i5-9600KF CPU on Windows). Note that the code before the loop take 20% of the time now since the loop is about 3 times faster.
To make this much faster, one way would be to parallelise the computation. That being said, I do not expect multithreading to be significantly faster because of the CPython GIL and I do not expect multiprocessing to be a silver-bullet because of data copying/pickling (which should result in a much higher memory consumption). Thus, I think it would be better to convert this part of the code in a native language like C++ first, and then parallelise the loop with tools like OpenMP. The result will be a faster execution not only due to parallelism but also faster operations thanks to a native execution.
One benefit with native language is that you can directly operate on pairwise_df.iloc[selected_rows]
without creating a new data structure. A skilled developper can even do the operation chunk by chunk in a SIMD-friendly way so predictions
can be computed much more efficiently. Similarly, np.unpackbits
is not needed since you can directly iterate on the bit-array. This makes the operation more cache friendly (so it can then scale better once running in parallel). Last but not least, note that most of the values in the bit-array are set to False with the current dataset (>99%). This means this data structure is rather sparse. One can replace the bit-array data structure with something more memory efficient at the expense of a more complex code and slower sequential execution (but likely faster in parallel).
binary
,quick_sort
,min_complex_size
,metrics
are all missing. – Reinderien Commented Mar 30 at 15:17