python - How to add aliases to consecutive occurrences in column?

I want to add aliases to consecutive occurrences of the same gene name in column gene_id. If the gene_id value is unique, it should be unchanged.

Here is my example input:

df_genes_data = {'gene_id': ['g0', 'g1', 'g1', 'g2', 'g3', 'g4', 'g4', 'g4']}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())

  gene_id
0      g0
1      g1
2      g1
3      g2
4      g3
5      g4
6      g4
7      g4

and there is the desired output:

  gene_id
0      g0
1  g1_TE1
2  g1_TE2
3      g2
4      g3
5  g4_TE1
6  g4_TE2
7  g4_TE3

Any ideas on how to perform it? I've been looking for solutions but found only ways to count consecutive occurrences, not to label them with aliases.

EDIT:

I've tried to find gene_id values which occur more than once in my data:

rep = []
gene_list = df_genes['gene_id']
for idx in range(0, len(gene_list) - 1):
    if gene_list[idx] == gene_list[idx + 1]:
        rep.append(gene_list[idx])
rep = list(set(rep))
print("Consecutive identical gene names are : " + str(rep))

but I have no idea how to add desired aliases to them.

I want to add aliases to consecutive occurrences of the same gene name in column gene_id. If the gene_id value is unique, it should be unchanged.

Here is my example input:

df_genes_data = {'gene_id': ['g0', 'g1', 'g1', 'g2', 'g3', 'g4', 'g4', 'g4']}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())

  gene_id
0      g0
1      g1
2      g1
3      g2
4      g3
5      g4
6      g4
7      g4

and there is the desired output:

  gene_id
0      g0
1  g1_TE1
2  g1_TE2
3      g2
4      g3
5  g4_TE1
6  g4_TE2
7  g4_TE3

Any ideas on how to perform it? I've been looking for solutions but found only ways to count consecutive occurrences, not to label them with aliases.

EDIT:

I've tried to find gene_id values which occur more than once in my data:

rep = []
gene_list = df_genes['gene_id']
for idx in range(0, len(gene_list) - 1):
    if gene_list[idx] == gene_list[idx + 1]:
        rep.append(gene_list[idx])
rep = list(set(rep))
print("Consecutive identical gene names are : " + str(rep))

but I have no idea how to add desired aliases to them.

Share Improve this question edited Mar 28 at 14:28 asked Mar 28 at 14:00 emor 2051 silver badge9 bronze badges

what you have tried? – Bhargav Commented Mar 28 at 14:02

Add a comment |

6 Answers 6

Sorted by: Reset to default 6

Use shift+ne+cumsum to group the consecutive values, then groupby.transform('size') to identify the groups of more than 2 values, and groupby.cumcount to increment the name:

# Series as name for shorter reference
s = df_genes['gene_id']
# group consecutive occurrences
group = s.ne(s.shift()).cumsum()
# form group and save as "g" for efficiency
g = s.groupby(group)
# identify groups with more than 1 value
m = g.transform('size').gt(1)
# increment values
df_genes.loc[m, 'gene_id'] += '_TE'+g.cumcount().add(1).astype(str)

Output:

  gene_id
0      g0
1  g1_TE1
2  g1_TE2
3      g2
4      g3
5  g4_TE1
6  g4_TE2
7  g4_TE3

Intermediates:

  gene_id  group      m  cumcount+1 suffix
0      g0      1  False           1       
1      g1      2   True           1   _TE1
2      g1      2   True           2   _TE2
3      g2      3  False           1       
4      g3      4  False           1       
5      g4      5   True           1   _TE1
6      g4      5   True           2   _TE2
7      g4      5   True           3   _TE3

Another possible solution:

df_genes['gene_id'] = np.where(
    # logical condition that detects whether gene_id needs edition
    (m := df_genes['gene_id'].eq(df_genes['gene_id'].shift())) | 
    df_genes['gene_id'].eq(df_genes['gene_id'].shift(-1)), 

    # if gene_id needs edition
    df_genes['gene_id'] + '_TE' + 
    (m.cumsum() - m.cumsum().where(~m).ffill().fillna(0).astype(int) + 1)
    .astype(str), # (m.cumsum() - ...) generates 1, 2, ... sequence

    # otherwise
    df_genes['gene_id'])

The solution uses eq to identify consecutive duplicates via shift, creates a mask m for tracking duplicates, then calculates positional indices using cumsum and resets counters at group boundaries via where + ffill. The positional integers are cast to strings with astype, and np.where conditionally appends _TE suffixes only to consecutive duplicates.

Output:

  gene_id
0      g0
1  g1_TE1
2  g1_TE2
3      g2
4      g3
5  g4_TE1
6  g4_TE2
7  g4_TE3

An option ("run-length-encoding"-based approach) is probably using itertools.groupby and itertools.chain

from itertools import groupby, chain

x = df_genes['gene_id']
df_genes['gene_id'] = x + np.array(list(chain.from_iterable([''] if (l:=len(list(g)))==1 else "_TE" + np.arange(1,l+1).astype(str) for _, g in groupby(x))))

and finally you will obtain

  gene_id
0      g0
1  g1_TE1
2  g1_TE2
3      g2
4      g3
5  g4_TE1
6  g4_TE2
7  g4_TE3

Do the genes have to be strictly consecutive (i.e. adjacent) ? if not:

You can get the duplicated genes, then for each of them get all the rows that match it, then loop over them to add a suffix


import pandas as pd

df_genes_data = {"gene_id": ["g0", "g1", "g1", "g2", "g3", "g4", "g4", "g4"]}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())

duplicated_genes = df_genes[df_genes["gene_id"].duplicated()]["gene_id"]
for gene in duplicated_genes:
    df_gene = df_genes[df_genes["gene_id"] == gene]
    for i, (idx, row) in enumerate(df_gene.iterrows()):
        df_genes.loc[idx, "gene_id"] = row["gene_id"] + f"_TE{i+1}"

print(df_genes)

out:

  gene_id
0      g0
1  g1_TE1
2  g1_TE2
3      g2
4      g3
5  g4_TE1
6  g4_TE2
7  g4_TE3

if they have to be strictly adjacent then the answer would change

(
    df_genes.assign(
        result=lambda d: d.groupby(
            df_genes.gene_id.ne(df_genes.gene_id.shift()).fillna(True).cumsum()
        ).gene_id.transform(
            lambda x: (
                x if len(x) == 1 else x.notna().cumsum().astype(str).radd("_TE").radd(x)
            )
        )
    )
)

import pandas as pd

df = pd.DataFrame({'id': ['g0', 'g1', 'g1', 'g2', 'g3', 'g4', 'g4', 'g4']})

aa = (df['id'] != df['id'].shift()).cumsum()

bb  = df.groupby(aa)['id'].transform(
lambda x : x if len(x) == 1 else [f"{x.iloc[0]}_TE{i+1}"  for i in range(len(x))]    
)

'''
0        g0
1    g1_TE1
2    g1_TE2
3        g2
4        g3
5    g4_TE1
6    g4_TE2
7    g4_TE3
Name: id, dtype: object
'''

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How to add aliases to consecutive occurrences in column? - Stack Overflow

6 Answers 6

与本文相关的文章

评论列表(0)