I’m currently doing a Data & AI internship. My job is to build a product database by retrieving information (product name, image, description, part number/SKU, technical specifications, datasheet, etc.) from the manufacturers' websites.
The challenge is that there are over 300 different manufacturers, each with its own website and structure, making traditional web scraping impractical and hard to maintain. To overcome this, I’m considering using AI and machine learning to make my scraping agent adaptable to changes in the HTML structure of each page.
I have downloaded and manually labeled 50 product pages. Here’s what my dataset looks like:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 52 non-null object
1 product_name 49 non-null object
2 html_product_name 51 non-null object
3 image_url 50 non-null object
4 html_image_url 50 non-null object
5 description 32 non-null object
6 html_description 51 non-null object
7 part_number 35 non-null object
8 html_part_number 36 non-null object
9 html_specification 44 non-null object
10 datasheet_url 40 non-null object
11 html_datasheet_url 41 non-null object
12 specification 2 non-null object
The text column contains the cleaned HTML of the product pages, while the other columns represent the target fields—the specific sections of the HTML that need to be identified and extracted.
This problem seems very similar to Named Entity Recognition (NER). How can I train a machine learning model to successfully extract these fields from raw HTML? What would be the best approach (e.g., fine-tuning a transformer model, sequence labeling, or another method)?
Thanks in advance!