My df has 19 columns but for simplicity the df will have the following columns [['Gene_name','Genes_in_same_transcription_unit']]
The column 'Gene_name' list a bunch of genes in E. coli and the corresponding entry in the column 'Genes_in_same_transcription_unit' lists the genes that are under the same promoter. An entry in this column contains multiple genes in a list that are separated by a comma and a space (example: pphA, sdsR). I want to know if there is a gene that is listed in the 'Genes_in_same_transcription_unit' that isn't in the 'Gene_name' column and for it to be highlighted or the font color for that gene, and only that gene, be in red or any other color beside black. My ultimate goal is to be able to quickly look at the column 'Genes_in_same_transcription_unit' and see if a whole operon of genes have been affected or if only 1 or 2 genes are affected and where they are in the operon (first gene, middle, or last).
Below is an example of the modified dataframe
This is what I want to happen
I have tried a bunch of stuff and nothing seems to work. Below is the closest I have gotten to getting what I want. With the code below I can make a new column called 'Highlighted_Genes' where the genes that should be in red be placed between two asterisks and then highlight what is in between the two asterisks in red but when I save the file the font color styling doesn't get saved.
# Convert Gene_name column to a set for quick lookup
gene_set = set(df['Gene_name'].dropna()) # Drop NaNs from Gene_name to avoid issues
# Function to highlight genes
def highlight_genes(row):
genes_str = row['Genes_in_same_transcription_unit']
# Handle NaN or missing values
if pd.isna(genes_str):
return None # Keep NaNs as is
genes_list = genes_str.split(', ') # Split into list
# Add asterisks to genes NOT in the Gene_name column
highlighted_list = [f"*{gene}*" if gene not in gene_set else gene for gene in genes_list]
return ', '.join(highlighted_list) # Join back into string
# Apply function
df['Highlighted_Genes'] = df.apply(highlight_genes, axis=1)
# Function to highlight words inside asterisks
def highlight_genes(val):
if not isinstance(val, str): # Ensure it's a string
return val # Return as is (preserves NaN or other values)
def replace_func(match):
return f'<span style="color: red;">{match.group(1)}</span>'
# Replace text between * * with a red-colored span tag
highlighted_text = re.sub(r'\*(.*?)\*', replace_func, val)
return highlighted_text
# Apply the function using Styler
df_styled = df.style.format({'Highlighted_Genes': lambda x: highlight_genes(x)})
# Display in Jupyter Notebook
df_styled
df_styled.to_excel('highlighted_genes.xlsx')
My df has 19 columns but for simplicity the df will have the following columns [['Gene_name','Genes_in_same_transcription_unit']]
The column 'Gene_name' list a bunch of genes in E. coli and the corresponding entry in the column 'Genes_in_same_transcription_unit' lists the genes that are under the same promoter. An entry in this column contains multiple genes in a list that are separated by a comma and a space (example: pphA, sdsR). I want to know if there is a gene that is listed in the 'Genes_in_same_transcription_unit' that isn't in the 'Gene_name' column and for it to be highlighted or the font color for that gene, and only that gene, be in red or any other color beside black. My ultimate goal is to be able to quickly look at the column 'Genes_in_same_transcription_unit' and see if a whole operon of genes have been affected or if only 1 or 2 genes are affected and where they are in the operon (first gene, middle, or last).
Below is an example of the modified dataframe
This is what I want to happen
I have tried a bunch of stuff and nothing seems to work. Below is the closest I have gotten to getting what I want. With the code below I can make a new column called 'Highlighted_Genes' where the genes that should be in red be placed between two asterisks and then highlight what is in between the two asterisks in red but when I save the file the font color styling doesn't get saved.
# Convert Gene_name column to a set for quick lookup
gene_set = set(df['Gene_name'].dropna()) # Drop NaNs from Gene_name to avoid issues
# Function to highlight genes
def highlight_genes(row):
genes_str = row['Genes_in_same_transcription_unit']
# Handle NaN or missing values
if pd.isna(genes_str):
return None # Keep NaNs as is
genes_list = genes_str.split(', ') # Split into list
# Add asterisks to genes NOT in the Gene_name column
highlighted_list = [f"*{gene}*" if gene not in gene_set else gene for gene in genes_list]
return ', '.join(highlighted_list) # Join back into string
# Apply function
df['Highlighted_Genes'] = df.apply(highlight_genes, axis=1)
# Function to highlight words inside asterisks
def highlight_genes(val):
if not isinstance(val, str): # Ensure it's a string
return val # Return as is (preserves NaN or other values)
def replace_func(match):
return f'<span style="color: red;">{match.group(1)}</span>'
# Replace text between * * with a red-colored span tag
highlighted_text = re.sub(r'\*(.*?)\*', replace_func, val)
return highlighted_text
# Apply the function using Styler
df_styled = df.style.format({'Highlighted_Genes': lambda x: highlight_genes(x)})
# Display in Jupyter Notebook
df_styled
df_styled.to_excel('highlighted_genes.xlsx')
Share
Improve this question
asked Mar 31 at 17:43
Melissa Arroyo-MendozaMelissa Arroyo-Mendoza
555 bronze badges
3
|
1 Answer
Reset to default 1You can use openpyxl to achieve this.
import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl.cell.text import InlineFont
from openpyxl.cell.rich_text import TextBlock, CellRichText
data = {
'Gene_name': ['sdsR', 'arrS', 'gadF'],
'Genes_in_same_transcription_unit': ['pphA, sdsR', 'arrS', 'mdtF, mdtE, gadF, gadE']
}
df = pd.DataFrame(data)
# Create a new Excel workbook and select the active worksheet
wb = Workbook()
ws = wb.active
# Define InlineFont colors
red = InlineFont(color='00FF0000')
black = InlineFont(color='00000000')
# Write the DataFrame to the worksheet
for r in dataframe_to_rows(df, index=False, header=True):
ws.append(r)
unique_values = set(df['Gene_name'])
# Iterate based on column/header locations
for row in ws.iter_rows(min_row=2, min_col=2, max_col=2):
for cell in row:
# Individual strings stripped for whitespace
parts = [s.strip() for s in cell.value.split(',')]
rich_text_cell = CellRichText()
for i, part in enumerate(parts):
font = black if part in unique_values else red
# Add separator unless last instance
separator = ", " if i < len(parts) - 1 else ""
# Append values to cell
rich_text_cell.append(TextBlock(font, part + separator))
cell.value = rich_text_cell
wb.save("highlighted.xlsx")
df_styled.to_excel('highlighted_genes.xlsx')
is not included as that step won't work to transfer the coloring of substrings to Excel at this time. – Wayne Commented Mar 31 at 20:02.xslx
file I deleted some of my speculative alternative ideas from the end of covering why way OP was trying won't work. Here is that without the speculation so those ending up here can know the 'why': as covered in the documentation, the HTML/CSS approach with<span>
tags won't transfer to Excel when exporting the DataFrame. And with the way that will transfer usingcolor
withprops
, you cannot highlight with color a portion of a string, just the whole cell content. ... – Wayne Commented Mar 31 at 20:51