pandas - How to change the font color or highlight a string in between to asterisks in python?

My df has 19 columns but for simplicity the df will have the following columns [['Gene_name','Genes_in_same_transcription_unit']]

The column 'Gene_name' list a bunch of genes in E. coli and the corresponding entry in the column 'Genes_in_same_transcription_unit' lists the genes that are under the same promoter. An entry in this column contains multiple genes in a list that are separated by a comma and a space (example: pphA, sdsR). I want to know if there is a gene that is listed in the 'Genes_in_same_transcription_unit' that isn't in the 'Gene_name' column and for it to be highlighted or the font color for that gene, and only that gene, be in red or any other color beside black. My ultimate goal is to be able to quickly look at the column 'Genes_in_same_transcription_unit' and see if a whole operon of genes have been affected or if only 1 or 2 genes are affected and where they are in the operon (first gene, middle, or last).

Below is an example of the modified dataframe

This is what I want to happen

I have tried a bunch of stuff and nothing seems to work. Below is the closest I have gotten to getting what I want. With the code below I can make a new column called 'Highlighted_Genes' where the genes that should be in red be placed between two asterisks and then highlight what is in between the two asterisks in red but when I save the file the font color styling doesn't get saved.

# Convert Gene_name column to a set for quick lookup
gene_set = set(df['Gene_name'].dropna())  # Drop NaNs from Gene_name to avoid issues

# Function to highlight genes
def highlight_genes(row):
    genes_str = row['Genes_in_same_transcription_unit']
    
    # Handle NaN or missing values
    if pd.isna(genes_str):
        return None  # Keep NaNs as is
    
    genes_list = genes_str.split(', ')  # Split into list

    # Add asterisks to genes NOT in the Gene_name column
    highlighted_list = [f"*{gene}*" if gene not in gene_set else gene for gene in genes_list]
    
    return ', '.join(highlighted_list)  # Join back into string

# Apply function
df['Highlighted_Genes'] = df.apply(highlight_genes, axis=1)

# Function to highlight words inside asterisks
def highlight_genes(val):
    if not isinstance(val, str):  # Ensure it's a string
        return val  # Return as is (preserves NaN or other values)
    
    def replace_func(match):
        return f'<span style="color: red;">{match.group(1)}</span>'
    
    # Replace text between * * with a red-colored span tag
    highlighted_text = re.sub(r'\*(.*?)\*', replace_func, val)
    
    return highlighted_text

# Apply the function using Styler
df_styled = df.style.format({'Highlighted_Genes': lambda x: highlight_genes(x)})

# Display in Jupyter Notebook
df_styled

df_styled.to_excel('highlighted_genes.xlsx')

My df has 19 columns but for simplicity the df will have the following columns [['Gene_name','Genes_in_same_transcription_unit']]

Below is an example of the modified dataframe

This is what I want to happen

# Convert Gene_name column to a set for quick lookup
gene_set = set(df['Gene_name'].dropna())  # Drop NaNs from Gene_name to avoid issues

# Function to highlight genes
def highlight_genes(row):
    genes_str = row['Genes_in_same_transcription_unit']
    
    # Handle NaN or missing values
    if pd.isna(genes_str):
        return None  # Keep NaNs as is
    
    genes_list = genes_str.split(', ')  # Split into list

    # Add asterisks to genes NOT in the Gene_name column
    highlighted_list = [f"*{gene}*" if gene not in gene_set else gene for gene in genes_list]
    
    return ', '.join(highlighted_list)  # Join back into string

# Apply function
df['Highlighted_Genes'] = df.apply(highlight_genes, axis=1)

# Function to highlight words inside asterisks
def highlight_genes(val):
    if not isinstance(val, str):  # Ensure it's a string
        return val  # Return as is (preserves NaN or other values)
    
    def replace_func(match):
        return f'<span style="color: red;">{match.group(1)}</span>'
    
    # Replace text between * * with a red-colored span tag
    highlighted_text = re.sub(r'\*(.*?)\*', replace_func, val)
    
    return highlighted_text

# Apply the function using Styler
df_styled = df.style.format({'Highlighted_Genes': lambda x: highlight_genes(x)})

# Display in Jupyter Notebook
df_styled

df_styled.to_excel('highlighted_genes.xlsx')

Share Improve this question asked Mar 31 at 17:43 Melissa Arroyo-Mendoza 555 bronze badges

By the way, for anyone want to try, here is a complete MRE as a gist using the OP's code. The df_styled.to_excel('highlighted_genes.xlsx') is not included as that step won't work to transfer the coloring of substrings to Excel at this time. – Wayne Commented Mar 31 at 20:02
Now that there is a path that works to use openpyxl to write the needed data to an .xslx file I deleted some of my speculative alternative ideas from the end of covering why way OP was trying won't work. Here is that without the speculation so those ending up here can know the 'why': as covered in the documentation, the HTML/CSS approach with <span> tags won't transfer to Excel when exporting the DataFrame. And with the way that will transfer using color with props, you cannot highlight with color a portion of a string, just the whole cell content. ... – Wayne Commented Mar 31 at 20:51
<continued> The ability to highlight only parts of text in a cell is a limitation of the Excel export functionality in Pandas, not something that can be directly accomplished through the Pandas API. This is the spot in the documentation: 'Export to Excel' under 'Table Visualization'. – Wayne Commented Mar 31 at 20:52

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

You can use openpyxl to achieve this.

import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl.cell.text import InlineFont 
from openpyxl.cell.rich_text import TextBlock, CellRichText

data = {
    'Gene_name': ['sdsR', 'arrS', 'gadF'],
    'Genes_in_same_transcription_unit': ['pphA, sdsR', 'arrS', 'mdtF, mdtE, gadF, gadE']
}
df = pd.DataFrame(data)

# Create a new Excel workbook and select the active worksheet
wb = Workbook()
ws = wb.active

# Define InlineFont colors
red = InlineFont(color='00FF0000')
black = InlineFont(color='00000000')

# Write the DataFrame to the worksheet
for r in dataframe_to_rows(df, index=False, header=True):
    ws.append(r)

unique_values = set(df['Gene_name'])

# Iterate based on column/header locations
for row in ws.iter_rows(min_row=2, min_col=2, max_col=2):
    for cell in row:
        # Individual strings stripped for whitespace
        parts = [s.strip() for s in cell.value.split(',')]
        rich_text_cell = CellRichText()

        for i, part in enumerate(parts):
            font = black if part in unique_values else red
            # Add separator unless last instance
            separator = ", " if i < len(parts) - 1 else ""
            # Append values to cell
            rich_text_cell.append(TextBlock(font, part + separator))

        cell.value = rich_text_cell

wb.save("highlighted.xlsx")

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pandas - How to change the font color or highlight a string in between to asterisks in python? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)