最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How to extract particular tags from soup using python? - Stack Overflow

programmeradmin2浏览0评论

From below webpages I like to extract data:

"03/19/2025" as "date", 
"United Fruit and Produce Co. – St. Louis, Missouri" as "title" and
"United Fruit and Produce Co. (United) withdraws its appeal, waives further appeal rights in this matter and agrees to stop selling product as anic until its anic certification is reinstated by NOP and to pay its total civil penalty within 30 days. If/when reinstated, United agrees to provide on-time responses to all certifier requests for information and detailed documentation required to maintain anic certification; to inform its certifier of operational or product changes with an updated anic system plan; and to maintain anic certificates of all products it handles."  as "text"

I am heading for all the p tags . But the issue is the strong tag has the dates with titles but it lies in some of the p tags for next text , how to extract them properly ?

I tried below code -

import requests
from bs4 import BeautifulSoup
link_url = ''
with requests.get(link_url) as response:
     response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
main_div = soup.find('main',{'id':'main-content'})
div_class = main_div.find('div',{'class':'field field--name-body field--type-text-with-summary field--label-hidden field__item'})
paragraphs = div_class.find_all('p')
print(paragraphs)
# Initialize lists to store the extracted data
strong_tags = []
text_tags = []

# Loop through all the <p> tags in the document
for p_tag in paragraphs:
    strong_tag = p_tag.find('strong')
    
    if strong_tag:
        # Extract the content of the <strong> tag
        strong_text = strong_tag.get_text(strip=True)
        strong_tags.append(strong_text)
        
        # Extract the remaining text in the <p> tag after the <strong> tag
        remaining_text = p_tag.get_text(strip=True).replace(strong_text, "").strip()
        if remaining_text:
            text_tags.append(remaining_text)
    else:
        # If no <strong> tag is found, just extract the full text
        full_text = p_tag.get_text(strip=True)
        if full_text:
            text_tags.append(full_text)

# Output the results
print("Strong Paragraphs:")
for tag in strong_tags:
    print(tag)

print("\nText Paragraphs:")
for tag in text_tags:
    print(tag)

From below webpages I like to extract data:

https://www.ams.usda.gov/services/enforcement/anic/settlements https://www.ams.usda.gov/services/enforcement/anic/settlements-2023

"03/19/2025" as "date", 
"United Fruit and Produce Co. – St. Louis, Missouri" as "title" and
"United Fruit and Produce Co. (United) withdraws its appeal, waives further appeal rights in this matter and agrees to stop selling product as anic until its anic certification is reinstated by NOP and to pay its total civil penalty within 30 days. If/when reinstated, United agrees to provide on-time responses to all certifier requests for information and detailed documentation required to maintain anic certification; to inform its certifier of operational or product changes with an updated anic system plan; and to maintain anic certificates of all products it handles."  as "text"

I am heading for all the p tags . But the issue is the strong tag has the dates with titles but it lies in some of the p tags for next text , how to extract them properly ?

I tried below code -

import requests
from bs4 import BeautifulSoup
link_url = 'https://www.ams.usda.gov/services/enforcement/anic/settlements'
with requests.get(link_url) as response:
     response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
main_div = soup.find('main',{'id':'main-content'})
div_class = main_div.find('div',{'class':'field field--name-body field--type-text-with-summary field--label-hidden field__item'})
paragraphs = div_class.find_all('p')
print(paragraphs)
# Initialize lists to store the extracted data
strong_tags = []
text_tags = []

# Loop through all the <p> tags in the document
for p_tag in paragraphs:
    strong_tag = p_tag.find('strong')
    
    if strong_tag:
        # Extract the content of the <strong> tag
        strong_text = strong_tag.get_text(strip=True)
        strong_tags.append(strong_text)
        
        # Extract the remaining text in the <p> tag after the <strong> tag
        remaining_text = p_tag.get_text(strip=True).replace(strong_text, "").strip()
        if remaining_text:
            text_tags.append(remaining_text)
    else:
        # If no <strong> tag is found, just extract the full text
        full_text = p_tag.get_text(strip=True)
        if full_text:
            text_tags.append(full_text)

# Output the results
print("Strong Paragraphs:")
for tag in strong_tags:
    print(tag)

print("\nText Paragraphs:")
for tag in text_tags:
    print(tag)
Share Improve this question edited Apr 1 at 6:53 HedgeHog 25.3k5 gold badges17 silver badges41 bronze badges asked Apr 1 at 5:03 Anjali KushwahaAnjali Kushwaha 511 silver badge9 bronze badges 0
Add a comment  | 

3 Answers 3

Reset to default 1

The HTML layout is irregular and therefore it's only possible to parse by inspection (using some external tool) and then applying logic that fits with what's been determined empirically.

As a starting point I would recommend designing a class that encapsulates each "article".

Something like this:

import requests
from bs4 import BeautifulSoup as BS
import re

try:
    import lxml
    PARSER = "lxml"
except ModuleNotFoundError:
    PARSER = "html.parser"

URLS = [
    "https://www.ams.usda.gov/services/enforcement/anic/settlements",
    "https://www.ams.usda.gov/services/enforcement/anic/settlements-2023",
    "https://www.ams.usda.gov/services/enforcement/anic/settlements-2024"
]
PATTERN = repile(r"^(\d{1,2}/\d{1,2}/\d{2,4})[:-]\s*(.*)$")

class Article:
    def __init__(self, date, title, text):
        self.date = date.strip()
        self.title = title.strip()
        self.text = text.strip()
    def append(self, s):
        if s:
            if not s[0].isspace():
                self.text += " "
            self.text += s
    def __str__(self):
        return f"Date={self.date}, Title={self.title}, Text={self.text}"
    __repr__ = __str__

articles = []

with requests.Session() as session:
    for url in URLS:
        with session.get(url) as response:
            response.raise_for_status()
            soup = BS(response.text, PARSER)
            ps = soup.select("#block-mainpagecontent > article > div > div > p")
            date = title = text = ""
            for p in ps[2:]:
                if s := p.select_one("strong"):
                    if m := PATTERN.match(s.text):
                        date, title = m.groups()
                    s.extract()
                text = p.text.strip()
                if articles and text and not date:
                    articles[-1].append(text)
                elif date and title and text:
                    articles.append(Article(date, title, text))
                    date = title = text = ""

for article in articles:
    print(article)

I guess your problem is because of the pseudo-selectors. i.e both <p></p> tag and <p><strong></strong></p> have no class or id name, but occurs alternatively. You can use find_next_sibling for such cases. Other explanations are given as comments inside the code.

from bs4 import BeautifulSoup as bs

url = 'https://www.ams.usda.gov/services/enforcement/anic/settlements-2015-16'

res = requests.get(url)

soup = bs(res.text, 'html.parser')
child_div = soup.find(class_="field field--name-body field--type-text-with-summary field--label-hidden field__item")

# Find all <p> tags containing <strong> (title paragraphs)
title_paragraphs = child_div.find_all('p', recursive=False)

# Instead of two lists I created a single list with multiple dicts
results = []

for title_p in title_paragraphs:

    # Make sure it's a title paragraph
    if title_p.strong:  
    
        title = title_p.strong.text
        
        # Get the next <p> sibling for content
        content_p = title_p.find_next_sibling('p')
        
        if content_p:
            content = content_p.text
            
            # title and content are made as dict and appended to the list results
            results.append({'title': title, 'content': content})

The HTML / structure to parse is unfortunately not uniform, so you may need to adjust something, e.g. the date and separator in the title are different in some cases.

Also try changing your selection and saving strategy and maybe use css selectors and a list of dicts, that makes things a bit easier.

import requests, re
from bs4 import BeautifulSoup
link_url = 'https://www.ams.usda.gov/services/enforcement/anic/settlements'
with requests.get(link_url) as response:
     response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')

def split_text(text):
    parts = re.split(r'[:\-]', text)
    parts = [part.strip() for part in parts]
    return parts


data = []
for title in soup.select('article[about] p strong'):
    item = dict(
        zip(
            ['date','title'],split_text(title.text)
            )
        )
    item.update({'text':title.find_parent('p').find_next_sibling('p').text})
    data.append(item)

data

Result

[{'date': '03/26/2025',
  'title': 'Spiritual Springs Farm – Heath, Ohio',
  'text': 'Spiritual Springs Farm (Spiritual) withdraws its appeal, waives further appeal rights in this matter and agrees to provide on-time responses to all certifier requests for information and detailed documentation required to maintain anic certification.\u202fSpiritual also agrees to pay certification-related fees on-time.\xa0'},
 {'date': '03/19/2025',
  'title': 'United Fruit and Produce Co. – St. Louis, Missouri',
  'text': 'United Fruit and Produce Co. (United) withdraws its appeal, waives further appeal rights in this matter and agrees to stop selling product as anic until its anic certification is reinstated by NOP and to pay its total civil penalty within 30 days. If/when reinstated, United agrees to provide on-time responses to all certifier requests for information and detailed documentation required to maintain anic certification; to inform its certifier of operational or product changes with an updated anic system plan; and to maintain anic certificates of all products it handles.\xa0'},
 {'date': '03/14/2025',
  'title': 'Mapeks USA LLC, dba Mac Global LLC – Allentown, Pennsylvania',
  'text': 'Mapeks USA LLC, dba Mac Global LLC (Mapeks USA) withdraws its appeal, waives further appeal rights in this matter and agrees to provide on-time responses to all certifier requests for information and documentation required to maintain anic certification, including a fraud prevention plan and anic certificates for all product they handle.\u202fMapeks USA agrees to maintain and submit records that fully disclose all activities and transactions; to not represent nonanic product as anic; to only use labels approved by its certifier; to take measures to prevent commingling of anic and nonanic products; and to divert product identified in the adverse action notice to the conventional market. Mapeks USA also agrees to submit evidence of the disposition of the aforementioned product to its certifier, to resolve any outstanding noncompliances, and agrees to an unannounced inspection within one year.\xa0'},...]

An alternativ that would be a bit more secure is a more complex pattern, that deals with the starting date:

def split_text(text):
    pattern = r'(\d{1,2}/\d{1,2}/\d{4})\s*(.*?)(?=\d{1,2}/\d{1,2}/\d{4}|$)'
    matches = re.findall(pattern, text)
    result = {'date': matches[0][0], 'title': matches[0][1].strip()}
    return result

data = []

for title in soup.select('article[about] p strong'):
    item = split_text(title.text)
    item.update({'text':title.find_parent('p').find_next_sibling('p').text})
    data.append(item)
发布评论

评论列表(0)

  1. 暂无评论