I’m working on a Python script to extract content from a Word document (.docx) and insert it into a SQL Server database. The challenge is that I need to preserve text styles like bold and italic, as well as handle line breaks and footnotes from the Word document.
Currently, I'm using the python-docx
library to process the document. Line breaks have been successfully transferred using <br\>
, but text styles (bold/italic) and footnotes are not being included in the output.
Here's what I’ve attempted so far:
1. For text styles:
I tried looping through paragraph.runs
to detect run.bold
and run.italic
. However, the styled text doesn’t appear in my database output.
2. For footnotes:
I tried extracting footnotes using a custom function with doc.footnotes
or checking for the style Footnote Text
. While the function doesn’t raise errors, footnotes don’t appear in the final output.
Here’s the snippet of my code for processing styles and footnotes:
text_with_style = []
if paragraph.runs:
for run in paragraph.runs:
styled_text = run.text.strip()
if run.bold:
styled_text = f"<b>{styled_text}</b>"
if run.italic:
styled_text = f"<i>{styled_text}</i>"
text_with_style.append(styled_text)
formatted_text = " ".join(text_with_style).replace("\n", "<br>")
For footnotes:
def extract_footnotes(doc):
footnotes_text = []
if hasattr(doc, 'footnotes'):
for footnote in doc.footnotes:
footnotes_text.append(footnote.text.strip())
return footnotes_text
What am I missing? How can I reliably preserve bold/italic styles and extract footnotes so they’re included in the output that gets inserted into SQL Server? Any advice or working examples would be greatly appreciated.