nlp - Automating Web Scraping with Machine Learning and NER for Product Data Extraction

I’m currently doing a Data & AI internship. My job is to build a product database by retrieving information (product name, image, description, part number/SKU, technical specifications, datasheet, etc.) from the manufacturers' websites.

The challenge is that there are over 300 different manufacturers, each with its own website and structure, making traditional web scraping impractical and hard to maintain. To overcome this, I’m considering using AI and machine learning to make my scraping agent adaptable to changes in the HTML structure of each page.

I have downloaded and manually labeled 50 product pages. Here’s what my dataset looks like:

 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   text                52 non-null     object
 1   product_name        49 non-null     object
 2   html_product_name   51 non-null     object
 3   image_url           50 non-null     object
 4   html_image_url      50 non-null     object
 5   description         32 non-null     object
 6   html_description    51 non-null     object
 7   part_number         35 non-null     object
 8   html_part_number    36 non-null     object
 9   html_specification  44 non-null     object
 10  datasheet_url       40 non-null     object
 11  html_datasheet_url  41 non-null     object
 12  specification       2 non-null      object

The text column contains the cleaned HTML of the product pages, while the other columns represent the target fields—the specific sections of the HTML that need to be identified and extracted.

This problem seems very similar to Named Entity Recognition (NER). How can I train a machine learning model to successfully extract these fields from raw HTML? What would be the best approach (e.g., fine-tuning a transformer model, sequence labeling, or another method)?

Thanks in advance!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

nlp - Automating Web Scraping with Machine Learning and NER for Product Data Extraction - Stack Overflow

与本文相关的文章

评论列表(0)