最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Proxies with PyPartpicker (requests_html)...not being utilised? - Stack Overflow

programmeradmin3浏览0评论

I've just begun building a database updater that takes parts from PCPartPicker using PyPartPicker and uploads them to a supabase database. I've setup async functionality, but I'm having trouble with rate limits. As a result, I'm trying to implement proxy rotation to stop my requests being blocked.

Through some logging, I can see that the proxies are being cycled properly, but no matter what I can't see any proxy usage in my proxy dashboard. Not only that, it logs about 8 parts and then gets stuck on a Browser Listening: info msg, which is exactly what happens without using proxies. They're definitely not used for the requests, even though I've set it up exactly as the PyPartPicker GitHub and PIP documentation instructs. Anyone have any ideas as to how to actually see what ip is being utilized for the connection? My only though is that maybe the proxys MUST be socks5? But it looks like they are. I'm using a Webshare.io free plan. Or if I'm doing something wrong? Any help would be appreciated :)

import pypartpicker
import time
import random
from supabase import create_client, Client
import logging
from requests.exceptions import HTTPError
import requests_html
import asyncio
from itertools import cycle

proxy_list = [
    "socks5://lxpclygj:[email protected]:6540",
    "socks5://lxpclygj:[email protected]:6712",
    "socks5://lxpclygj:[email protected]:6543",
    "socks5://lxpclygj:[email protected]:5157",
    "socks5://lxpclygj:[email protected]:6641",
    "socks5://lxpclygj:[email protected]:6360",
    "socks5://lxpclygj:[email protected]:6754",
    "socks5://lxpclygj:[email protected]:6853",
    "socks5://lxpclygj:[email protected]:5653",
    "socks5://lxpclygj:[email protected]:5792",
]

proxy_cycle = cycle(proxy_list)
session = requests_html.HTMLSession()

# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Initialize Supabase client
url = "***CENSORED***"  # Replace with your Supabase URL
key = "***CENSORED***"  # Replace with your Supabase API key
try:
    supabase: Client = create_client(url, key)
    logging.info("Connected to Supabase successfully.")
except Exception as e:
    logging.error(f"Failed to connect to Supabase: {e}")
    exit(1)

# Supported Product Types
CPU_PRODUCTS = ["Intel Core i3", "Intel Core i5", "Intel Core i7", "Intel Core i9", "Intel Xeon", "AMD Ryzen 3", "AMD Ryzen 5", "AMD Ryzen 7", "AMD Ryzen 9", "AMD Threadripper", "AMD Athlon 3000G", "AMD Athlon 200GE", "Intel Pentium G6400", "Intel Pentium 5", "Intel Pentium 6", "Intel Pentium 7", "Intel Celeron G4", "Intel Celeron G5", "Intel Celeron G6"]    

def response_retriever(url):
    retries = 5
    backoff_time = 2  # Initial backoff in seconds
    for _ in range(retries):
        proxy = next(proxy_cycle)  # Rotate proxy
        try:
            logging.info(f"Using proxy {proxy} for {url}")
            response = session.get(url, proxies={"http": proxy, "https": proxy})
            if response.status_code == 200:
                logging.info(f"Successfully retrieved response from {url} using proxy {proxy}")
                return response
            elif response.status_code == 429:
                logging.warning(f"Rate limited (429) for {url}. Retrying in {backoff_time}s...")
                time.sleep(backoff_time)
                backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
            else:
                logging.warning(f"Unexpected status code {response.status_code} for {url}")
                return None
        except requests_html.RequestException as e:
            logging.error(f"Failed to connect to {url} using proxy {proxy}: {e}")
            time.sleep(backoff_time)
            backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
    logging.error(f"Failed to retrieve a valid response after {retries} attempts for {url}")
    return None

# Initialize PyPartPicker Client with the custom response retriever
pcpp = pypartpicker.Client(response_retriever=response_retriever)

async def fetch_top_parts():
    async with pypartpicker.AsyncClient() as pcpp:
        # Counters for debugging
        error_count = 0
        warning_count = 0
        skipped_parts = []

        # Iterate through each product type and fetch all results
        for product in CPU_PRODUCTS:
            page = 1
            while True:
                try:
                    logging.info(f"Fetching {product} parts on page {page}...")
                    result = await pcpp.get_part_search(product, page=page, region="au")
                    if result and result.parts:
                        for part_summary in result.parts:
                            if part_summary and part_summary.url:
                                while True:
                                    try:
                                        proxy = next(proxy_cycle)
                                        logging.info(f"Using proxy {proxy} for {part_summary.url}")
                                        part = await pcpp.get_part(part_summary.url)
                                        if part:
                                            # Validate and prepare data for insertion
                                            in_stock_vendors = [
                                                vendor for vendor in part.vendors if vendor.in_stock
                                            ] if part.vendors else []
                                            in_stock_vendors.sort(key=lambda v: v.price.total if v.price else float('inf'))

                                            cheapest_vendor = in_stock_vendors[0] if in_stock_vendors else None
                                            data = {
                                                "part_type": "processor",
                                                "name": part.name if part.name else None,
                                                "total_price": cheapest_vendor.price.total if cheapest_vendor and cheapest_vendor.price else None,
                                                "base_price": cheapest_vendor.price.base if cheapest_vendor and cheapest_vendor.price else None,
                                                "discounts": cheapest_vendor.price.discounts if cheapest_vendor and cheapest_vendor.price else None,
                                                "shipping_price": cheapest_vendor.price.shipping if cheapest_vendor and cheapest_vendor.price else None,
                                                "tax_price": cheapest_vendor.price.tax if cheapest_vendor and cheapest_vendor.price else None,
                                                "vendor_store": getattr(cheapest_vendor, "name", "N/A") if cheapest_vendor else None,
                                                "store_product_url": getattr(cheapest_vendor, "buy_url", "N/A") if cheapest_vendor else None,
                                                "vendor_logo_url": getattr(cheapest_vendor, "logo_url", "N/A") if cheapest_vendor else None,
                                                "in_stock": bool(in_stock_vendors) and cheapest_vendor is not None,
                                                "product_url": getattr(part, "url", "N/A"),
                                                "image_urls": part.image_urls if part.image_urls else None,
                                                "manufacturer": part.specs.get("Manufacturer", None) if part.specs else None,
                                                "part_number": part.specs.get("Part #", None) if part.specs else None,
                                                "series": part.specs.get("Series", None) if part.specs else None,
                                                "microarchitecture": part.specs.get("Microarchitecture", None) if part.specs else None,
                                                "core_family": part.specs.get("Core Family", None) if part.specs else None,
                                                "socket": part.specs.get("Socket", None) if part.specs else None,
                                                "core_count": part.specs.get("Core Count", None) if part.specs else None,
                                                "thread_count": part.specs.get("Thread Count", None) if part.specs else None,
                                                "performance_core_clock": part.specs.get("Performance Core Clock", None) if part.specs else None,
                                                "performance_core_boost_clock": part.specs.get("Performance Core Boost Clock", None) if part.specs else None,
                                                "l2_cache": part.specs.get("L2 Cache", None) if part.specs else None,
                                                "l3_cache": part.specs.get("L3 Cache", None) if part.specs else None,
                                                "tdp": part.specs.get("TDP", None) if part.specs else None,
                                                "integrated_graphics": part.specs.get("Integrated Graphics", None) if part.specs else None,
                                                "maximum_supported_memory": part.specs.get("Maximum Supported Memory", None) if part.specs else None,
                                                "ecc_support": part.specs.get("ECC Support", None) if part.specs else None,
                                                "includes_cooler": part.specs.get("Includes Cooler", None) if part.specs else None,
                                                "packaging": part.specs.get("Packaging", None) if part.specs else None,
                                                "lithography": part.specs.get("Lithography", None) if part.specs else None,
                                                "simultaneous_multithreading": part.specs.get("Simultaneous Multithreading", None) if part.specs else None,
                                                "rating_average": getattr(part.rating, "average", None) if part.rating else None,
                                                "rating_count": getattr(part.rating, "count", None) if part.rating else None,
                                            }

                                            supabase.table("cpus").insert([data]).execute()
                                            response = supabase.table("cpus").insert([data]).execute()
                                            logging.info(f"Inserted {data['name']} into database.")
                                        else:
                                            warning_count += 1
                                            logging.warning("Part details could not be fetched.")
                                        break  # Exit the retry loop if successful
                                    except AttributeError as e:
                                        if "'NoneType' object has no attribute 'text'" in str(e):
                                            input("Verify link and press Enter to continue...")
                                        else:
                                            raise e
                                    except Exception as e:
                                        error_count += 1
                                        logging.error(f"Error fetching part details: {e}")
                                        break
                    else:
                        logging.info(f"No more results for {product} on page {page}.")
                        break  # Exit loop if no more results
                    page += 1
                    await asyncio.sleep(4)  # Prevent hitting rate limits
                except HTTPError as e:
                    error_count += 1
                    logging.error(f"HTTP error occurred: {e}")
                    await asyncio.sleep(10)  # Wait before retrying
                except Exception as e:
                    error_count += 1
                    logging.error(f"Unexpected error: {e}")
                    await asyncio.sleep(10)  # Short wait before retrying
                    continue

        # Final Debug Summary
        logging.info("\nDebug Summary:")
        logging.info(f"Total Errors: {error_count}")
        logging.info(f"Total Warnings: {warning_count}")
        logging.info(f"Total Skipped Parts: {len(skipped_parts)}")
        if skipped_parts:
            for name, part_number in skipped_parts:
                logging.info(f"Skipped Part: {name} | Part Number: {part_number}")

asyncio.run(fetch_top_parts())
def response_retriever(url):
    retries = 5
    backoff_time = 2  # Initial backoff in seconds
    for _ in range(retries):
        proxy = next(proxy_cycle)  # Rotate proxy
        try:
            logging.info(f"Attempting to use proxy {proxy} for {url}")
            
            # Add proxy verification using an external service
            debug_url = ";  # Service to check the outgoing IP
            response = session.get(
                debug_url, proxies={"http": proxy, "https": proxy}, timeout=10
            )
            proxy_ip = response.json().get("origin")
            logging.info(f"Outgoing IP as per proxy: {proxy_ip}")
            
            # Verify proxy works by making the actual request
            response = session.get(
                url, proxies={"http": proxy, "https": proxy}, timeout=10
            )
            if response.status_code == 200:
                logging.info(f"Successfully retrieved response from {url} using proxy {proxy_ip}")
                return response
            elif response.status_code == 429:
                logging.warning(f"Rate limited (429) for {url}. Retrying in {backoff_time}s...")
                time.sleep(backoff_time)
                backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
            else:
                logging.warning(f"Unexpected status code {response.status_code} for {url}")
                return None
        except requests_html.RequestException as e:
            logging.error(f"Proxy {proxy} failed for {url}: {e}")
            time.sleep(backoff_time)
            backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
    logging.error(f"Failed to retrieve a valid response after {retries} attempts for {url}")
    return None

I've just begun building a database updater that takes parts from PCPartPicker using PyPartPicker and uploads them to a supabase database. I've setup async functionality, but I'm having trouble with rate limits. As a result, I'm trying to implement proxy rotation to stop my requests being blocked.

Through some logging, I can see that the proxies are being cycled properly, but no matter what I can't see any proxy usage in my proxy dashboard. Not only that, it logs about 8 parts and then gets stuck on a Browser Listening: info msg, which is exactly what happens without using proxies. They're definitely not used for the requests, even though I've set it up exactly as the PyPartPicker GitHub and PIP documentation instructs. Anyone have any ideas as to how to actually see what ip is being utilized for the connection? My only though is that maybe the proxys MUST be socks5? But it looks like they are. I'm using a Webshare.io free plan. Or if I'm doing something wrong? Any help would be appreciated :)

import pypartpicker
import time
import random
from supabase import create_client, Client
import logging
from requests.exceptions import HTTPError
import requests_html
import asyncio
from itertools import cycle

proxy_list = [
    "socks5://lxpclygj:[email protected]:6540",
    "socks5://lxpclygj:[email protected]:6712",
    "socks5://lxpclygj:[email protected]:6543",
    "socks5://lxpclygj:[email protected]:5157",
    "socks5://lxpclygj:[email protected]:6641",
    "socks5://lxpclygj:[email protected]:6360",
    "socks5://lxpclygj:[email protected]:6754",
    "socks5://lxpclygj:[email protected]:6853",
    "socks5://lxpclygj:[email protected]:5653",
    "socks5://lxpclygj:[email protected]:5792",
]

proxy_cycle = cycle(proxy_list)
session = requests_html.HTMLSession()

# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Initialize Supabase client
url = "***CENSORED***"  # Replace with your Supabase URL
key = "***CENSORED***"  # Replace with your Supabase API key
try:
    supabase: Client = create_client(url, key)
    logging.info("Connected to Supabase successfully.")
except Exception as e:
    logging.error(f"Failed to connect to Supabase: {e}")
    exit(1)

# Supported Product Types
CPU_PRODUCTS = ["Intel Core i3", "Intel Core i5", "Intel Core i7", "Intel Core i9", "Intel Xeon", "AMD Ryzen 3", "AMD Ryzen 5", "AMD Ryzen 7", "AMD Ryzen 9", "AMD Threadripper", "AMD Athlon 3000G", "AMD Athlon 200GE", "Intel Pentium G6400", "Intel Pentium 5", "Intel Pentium 6", "Intel Pentium 7", "Intel Celeron G4", "Intel Celeron G5", "Intel Celeron G6"]    

def response_retriever(url):
    retries = 5
    backoff_time = 2  # Initial backoff in seconds
    for _ in range(retries):
        proxy = next(proxy_cycle)  # Rotate proxy
        try:
            logging.info(f"Using proxy {proxy} for {url}")
            response = session.get(url, proxies={"http": proxy, "https": proxy})
            if response.status_code == 200:
                logging.info(f"Successfully retrieved response from {url} using proxy {proxy}")
                return response
            elif response.status_code == 429:
                logging.warning(f"Rate limited (429) for {url}. Retrying in {backoff_time}s...")
                time.sleep(backoff_time)
                backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
            else:
                logging.warning(f"Unexpected status code {response.status_code} for {url}")
                return None
        except requests_html.RequestException as e:
            logging.error(f"Failed to connect to {url} using proxy {proxy}: {e}")
            time.sleep(backoff_time)
            backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
    logging.error(f"Failed to retrieve a valid response after {retries} attempts for {url}")
    return None

# Initialize PyPartPicker Client with the custom response retriever
pcpp = pypartpicker.Client(response_retriever=response_retriever)

async def fetch_top_parts():
    async with pypartpicker.AsyncClient() as pcpp:
        # Counters for debugging
        error_count = 0
        warning_count = 0
        skipped_parts = []

        # Iterate through each product type and fetch all results
        for product in CPU_PRODUCTS:
            page = 1
            while True:
                try:
                    logging.info(f"Fetching {product} parts on page {page}...")
                    result = await pcpp.get_part_search(product, page=page, region="au")
                    if result and result.parts:
                        for part_summary in result.parts:
                            if part_summary and part_summary.url:
                                while True:
                                    try:
                                        proxy = next(proxy_cycle)
                                        logging.info(f"Using proxy {proxy} for {part_summary.url}")
                                        part = await pcpp.get_part(part_summary.url)
                                        if part:
                                            # Validate and prepare data for insertion
                                            in_stock_vendors = [
                                                vendor for vendor in part.vendors if vendor.in_stock
                                            ] if part.vendors else []
                                            in_stock_vendors.sort(key=lambda v: v.price.total if v.price else float('inf'))

                                            cheapest_vendor = in_stock_vendors[0] if in_stock_vendors else None
                                            data = {
                                                "part_type": "processor",
                                                "name": part.name if part.name else None,
                                                "total_price": cheapest_vendor.price.total if cheapest_vendor and cheapest_vendor.price else None,
                                                "base_price": cheapest_vendor.price.base if cheapest_vendor and cheapest_vendor.price else None,
                                                "discounts": cheapest_vendor.price.discounts if cheapest_vendor and cheapest_vendor.price else None,
                                                "shipping_price": cheapest_vendor.price.shipping if cheapest_vendor and cheapest_vendor.price else None,
                                                "tax_price": cheapest_vendor.price.tax if cheapest_vendor and cheapest_vendor.price else None,
                                                "vendor_store": getattr(cheapest_vendor, "name", "N/A") if cheapest_vendor else None,
                                                "store_product_url": getattr(cheapest_vendor, "buy_url", "N/A") if cheapest_vendor else None,
                                                "vendor_logo_url": getattr(cheapest_vendor, "logo_url", "N/A") if cheapest_vendor else None,
                                                "in_stock": bool(in_stock_vendors) and cheapest_vendor is not None,
                                                "product_url": getattr(part, "url", "N/A"),
                                                "image_urls": part.image_urls if part.image_urls else None,
                                                "manufacturer": part.specs.get("Manufacturer", None) if part.specs else None,
                                                "part_number": part.specs.get("Part #", None) if part.specs else None,
                                                "series": part.specs.get("Series", None) if part.specs else None,
                                                "microarchitecture": part.specs.get("Microarchitecture", None) if part.specs else None,
                                                "core_family": part.specs.get("Core Family", None) if part.specs else None,
                                                "socket": part.specs.get("Socket", None) if part.specs else None,
                                                "core_count": part.specs.get("Core Count", None) if part.specs else None,
                                                "thread_count": part.specs.get("Thread Count", None) if part.specs else None,
                                                "performance_core_clock": part.specs.get("Performance Core Clock", None) if part.specs else None,
                                                "performance_core_boost_clock": part.specs.get("Performance Core Boost Clock", None) if part.specs else None,
                                                "l2_cache": part.specs.get("L2 Cache", None) if part.specs else None,
                                                "l3_cache": part.specs.get("L3 Cache", None) if part.specs else None,
                                                "tdp": part.specs.get("TDP", None) if part.specs else None,
                                                "integrated_graphics": part.specs.get("Integrated Graphics", None) if part.specs else None,
                                                "maximum_supported_memory": part.specs.get("Maximum Supported Memory", None) if part.specs else None,
                                                "ecc_support": part.specs.get("ECC Support", None) if part.specs else None,
                                                "includes_cooler": part.specs.get("Includes Cooler", None) if part.specs else None,
                                                "packaging": part.specs.get("Packaging", None) if part.specs else None,
                                                "lithography": part.specs.get("Lithography", None) if part.specs else None,
                                                "simultaneous_multithreading": part.specs.get("Simultaneous Multithreading", None) if part.specs else None,
                                                "rating_average": getattr(part.rating, "average", None) if part.rating else None,
                                                "rating_count": getattr(part.rating, "count", None) if part.rating else None,
                                            }

                                            supabase.table("cpus").insert([data]).execute()
                                            response = supabase.table("cpus").insert([data]).execute()
                                            logging.info(f"Inserted {data['name']} into database.")
                                        else:
                                            warning_count += 1
                                            logging.warning("Part details could not be fetched.")
                                        break  # Exit the retry loop if successful
                                    except AttributeError as e:
                                        if "'NoneType' object has no attribute 'text'" in str(e):
                                            input("Verify link and press Enter to continue...")
                                        else:
                                            raise e
                                    except Exception as e:
                                        error_count += 1
                                        logging.error(f"Error fetching part details: {e}")
                                        break
                    else:
                        logging.info(f"No more results for {product} on page {page}.")
                        break  # Exit loop if no more results
                    page += 1
                    await asyncio.sleep(4)  # Prevent hitting rate limits
                except HTTPError as e:
                    error_count += 1
                    logging.error(f"HTTP error occurred: {e}")
                    await asyncio.sleep(10)  # Wait before retrying
                except Exception as e:
                    error_count += 1
                    logging.error(f"Unexpected error: {e}")
                    await asyncio.sleep(10)  # Short wait before retrying
                    continue

        # Final Debug Summary
        logging.info("\nDebug Summary:")
        logging.info(f"Total Errors: {error_count}")
        logging.info(f"Total Warnings: {warning_count}")
        logging.info(f"Total Skipped Parts: {len(skipped_parts)}")
        if skipped_parts:
            for name, part_number in skipped_parts:
                logging.info(f"Skipped Part: {name} | Part Number: {part_number}")

asyncio.run(fetch_top_parts())
def response_retriever(url):
    retries = 5
    backoff_time = 2  # Initial backoff in seconds
    for _ in range(retries):
        proxy = next(proxy_cycle)  # Rotate proxy
        try:
            logging.info(f"Attempting to use proxy {proxy} for {url}")
            
            # Add proxy verification using an external service
            debug_url = "https://httpbin./ip"  # Service to check the outgoing IP
            response = session.get(
                debug_url, proxies={"http": proxy, "https": proxy}, timeout=10
            )
            proxy_ip = response.json().get("origin")
            logging.info(f"Outgoing IP as per proxy: {proxy_ip}")
            
            # Verify proxy works by making the actual request
            response = session.get(
                url, proxies={"http": proxy, "https": proxy}, timeout=10
            )
            if response.status_code == 200:
                logging.info(f"Successfully retrieved response from {url} using proxy {proxy_ip}")
                return response
            elif response.status_code == 429:
                logging.warning(f"Rate limited (429) for {url}. Retrying in {backoff_time}s...")
                time.sleep(backoff_time)
                backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
            else:
                logging.warning(f"Unexpected status code {response.status_code} for {url}")
                return None
        except requests_html.RequestException as e:
            logging.error(f"Proxy {proxy} failed for {url}: {e}")
            time.sleep(backoff_time)
            backoff_time *= random.uniform(1.5, 2.0)  # Exponential backoff
    logging.error(f"Failed to retrieve a valid response after {retries} attempts for {url}")
    return None
Share Improve this question edited Jan 18 at 5:38 Stylus asked Jan 18 at 1:34 StylusStylus 11 bronze badge
Add a comment  | 

1 Answer 1

Reset to default 0

Try to check your client IP ether the addr with any IP addr checker like https://api.datascrape.tech/latest/ip or https://www.cloudflare/cdn-cgi/trace with and without proxy - the IP should be different. This way you will find out, whether the proxy is used or not.

If the proxy is used and you still limited by target WEB site - pay attention on your fingerprint like SSL fingerprints and TCP fingerprints. You can test them and find more info here https://datascrape.tech/tools/browser-leaks-test/

发布评论

评论列表(0)

  1. 暂无评论