I want to add aliases to consecutive occurrences of the same gene name in column gene_id
. If the gene_id
value is unique, it should be unchanged.
Here is my example input:
df_genes_data = {'gene_id': ['g0', 'g1', 'g1', 'g2', 'g3', 'g4', 'g4', 'g4']}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())
gene_id
0 g0
1 g1
2 g1
3 g2
4 g3
5 g4
6 g4
7 g4
and there is the desired output:
gene_id
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
Any ideas on how to perform it? I've been looking for solutions but found only ways to count consecutive occurrences, not to label them with aliases.
EDIT:
I've tried to find gene_id
values which occur more than once in my data:
rep = []
gene_list = df_genes['gene_id']
for idx in range(0, len(gene_list) - 1):
if gene_list[idx] == gene_list[idx + 1]:
rep.append(gene_list[idx])
rep = list(set(rep))
print("Consecutive identical gene names are : " + str(rep))
but I have no idea how to add desired aliases to them.
I want to add aliases to consecutive occurrences of the same gene name in column gene_id
. If the gene_id
value is unique, it should be unchanged.
Here is my example input:
df_genes_data = {'gene_id': ['g0', 'g1', 'g1', 'g2', 'g3', 'g4', 'g4', 'g4']}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())
gene_id
0 g0
1 g1
2 g1
3 g2
4 g3
5 g4
6 g4
7 g4
and there is the desired output:
gene_id
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
Any ideas on how to perform it? I've been looking for solutions but found only ways to count consecutive occurrences, not to label them with aliases.
EDIT:
I've tried to find gene_id
values which occur more than once in my data:
rep = []
gene_list = df_genes['gene_id']
for idx in range(0, len(gene_list) - 1):
if gene_list[idx] == gene_list[idx + 1]:
rep.append(gene_list[idx])
rep = list(set(rep))
print("Consecutive identical gene names are : " + str(rep))
but I have no idea how to add desired aliases to them.
Share Improve this question edited Mar 28 at 14:28 emor asked Mar 28 at 14:00 emoremor 2051 silver badge9 bronze badges 1- what you have tried? – Bhargav Commented Mar 28 at 14:02
6 Answers
Reset to default 6Use shift
+ne
+cumsum
to group the consecutive values, then groupby.transform('size')
to identify the groups of more than 2 values, and groupby.cumcount
to increment the name:
# Series as name for shorter reference
s = df_genes['gene_id']
# group consecutive occurrences
group = s.ne(s.shift()).cumsum()
# form group and save as "g" for efficiency
g = s.groupby(group)
# identify groups with more than 1 value
m = g.transform('size').gt(1)
# increment values
df_genes.loc[m, 'gene_id'] += '_TE'+g.cumcount().add(1).astype(str)
Output:
gene_id
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
Intermediates:
gene_id group m cumcount+1 suffix
0 g0 1 False 1
1 g1 2 True 1 _TE1
2 g1 2 True 2 _TE2
3 g2 3 False 1
4 g3 4 False 1
5 g4 5 True 1 _TE1
6 g4 5 True 2 _TE2
7 g4 5 True 3 _TE3
Another possible solution:
df_genes['gene_id'] = np.where(
# logical condition that detects whether gene_id needs edition
(m := df_genes['gene_id'].eq(df_genes['gene_id'].shift())) |
df_genes['gene_id'].eq(df_genes['gene_id'].shift(-1)),
# if gene_id needs edition
df_genes['gene_id'] + '_TE' +
(m.cumsum() - m.cumsum().where(~m).ffill().fillna(0).astype(int) + 1)
.astype(str), # (m.cumsum() - ...) generates 1, 2, ... sequence
# otherwise
df_genes['gene_id'])
The solution uses eq
to identify consecutive duplicates via shift
, creates a mask m
for tracking duplicates, then calculates positional indices using cumsum
and resets counters at group boundaries via where
+ ffill
. The positional integers are cast to strings with astype
, and np.where
conditionally appends _TE
suffixes only to consecutive duplicates.
Output:
gene_id
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
An option ("run-length-encoding"-based approach) is probably using itertools.groupby
and itertools.chain
from itertools import groupby, chain
x = df_genes['gene_id']
df_genes['gene_id'] = x + np.array(list(chain.from_iterable([''] if (l:=len(list(g)))==1 else "_TE" + np.arange(1,l+1).astype(str) for _, g in groupby(x))))
and finally you will obtain
gene_id
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
Do the genes have to be strictly consecutive (i.e. adjacent) ? if not:
You can get the duplicated genes, then for each of them get all the rows that match it, then loop over them to add a suffix
import pandas as pd
df_genes_data = {"gene_id": ["g0", "g1", "g1", "g2", "g3", "g4", "g4", "g4"]}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())
duplicated_genes = df_genes[df_genes["gene_id"].duplicated()]["gene_id"]
for gene in duplicated_genes:
df_gene = df_genes[df_genes["gene_id"] == gene]
for i, (idx, row) in enumerate(df_gene.iterrows()):
df_genes.loc[idx, "gene_id"] = row["gene_id"] + f"_TE{i+1}"
print(df_genes)
out:
gene_id
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
if they have to be strictly adjacent then the answer would change
(
df_genes.assign(
result=lambda d: d.groupby(
df_genes.gene_id.ne(df_genes.gene_id.shift()).fillna(True).cumsum()
).gene_id.transform(
lambda x: (
x if len(x) == 1 else x.notna().cumsum().astype(str).radd("_TE").radd(x)
)
)
)
)
import pandas as pd
df = pd.DataFrame({'id': ['g0', 'g1', 'g1', 'g2', 'g3', 'g4', 'g4', 'g4']})
aa = (df['id'] != df['id'].shift()).cumsum()
bb = df.groupby(aa)['id'].transform(
lambda x : x if len(x) == 1 else [f"{x.iloc[0]}_TE{i+1}" for i in range(len(x))]
)
'''
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
Name: id, dtype: object
'''