最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

nlp - Automating Web Scraping with Machine Learning and NER for Product Data Extraction - Stack Overflow

programmeradmin2浏览0评论

I’m currently doing a Data & AI internship. My job is to build a product database by retrieving information (product name, image, description, part number/SKU, technical specifications, datasheet, etc.) from the manufacturers' websites.

The challenge is that there are over 300 different manufacturers, each with its own website and structure, making traditional web scraping impractical and hard to maintain. To overcome this, I’m considering using AI and machine learning to make my scraping agent adaptable to changes in the HTML structure of each page.

I have downloaded and manually labeled 50 product pages. Here’s what my dataset looks like:

 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   text                52 non-null     object
 1   product_name        49 non-null     object
 2   html_product_name   51 non-null     object
 3   image_url           50 non-null     object
 4   html_image_url      50 non-null     object
 5   description         32 non-null     object
 6   html_description    51 non-null     object
 7   part_number         35 non-null     object
 8   html_part_number    36 non-null     object
 9   html_specification  44 non-null     object
 10  datasheet_url       40 non-null     object
 11  html_datasheet_url  41 non-null     object
 12  specification       2 non-null      object 

The text column contains the cleaned HTML of the product pages, while the other columns represent the target fields—the specific sections of the HTML that need to be identified and extracted.

This problem seems very similar to Named Entity Recognition (NER). How can I train a machine learning model to successfully extract these fields from raw HTML? What would be the best approach (e.g., fine-tuning a transformer model, sequence labeling, or another method)?

Thanks in advance!

发布评论

评论列表(0)

  1. 暂无评论