最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Receiving an ERR_HTTP2_PROTOCOL_ERROR sporadically when web scraping with python selenium chromedriver, but same request works a

programmeradmin1浏览0评论

I have a fairly basic scraper set up to loop through a list of real estate agent profiles on Homes. My scraper worked consistenly for months, but then sporadically starts getting these sporadic ERR_HTTP2_PROTOCOL_ERROR errors. I'm using Python/Selenium/Chromedriver. At first I thought maybe my IP was somehow being blocked/flagged, but when I just hit the same URL in a different browser, it loads perfectly fine (the error happens first when running with Chromedriver as part of my scraper). This error used to happen once every maybe 1000 url's, now it happens almost every 5-10ish, to the point my scraper is unusable. As soon as my code gets this error once, then almost every other URL in the loop fails with the same error.

What I've tried:

  • Swapping to a different network, wired vs. wifi, etc. No change
  • Updating Chrome to latest version--no change. Updating to latest Chromedriver version. Also no change.
  • Using Chrome Developer tools, copying out the failed request as a CURL request and running it through Terminal--this WORKED in terminal, which seems to indicate the request itself is good??

What would you all suggest trying next? Is the browser not actually sending this request across the wire in the selenium-controlled Chrome browser, since it works in different browser?

Below is a simplified version of my larger scraping program, to isolate out the loop and so I can re-create this scenario:

import random
from selenium import webdriver
import time

urllist = ['/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/']

driver = webdriver.Chrome()

for url in urllist:
    delay = random.randint(1, 3)
    time.sleep(delay)
    driver.get(url)
    # SCRAPE DATA

And here is the CURL request from the failed browser attempt, copied out of Chrome Developer tools. When running this in terminal, it works:

curl '/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-US,en;q=0.9' \
  -b 'gp=%257b%2522k%2522%253a%257b%2522key%2522%253a%2522lqjs6ple4qbcy%2522%257d%252c%2522v%2522%253a4%252c%2522d%2522%253a%257b%2522lt%2522%253a33.392%252c%2522ln%2522%253a-84.57%257d%257d; hab=%257b%2522e%2522%253atrue%252c%2522r%2522%253a%255b%255d%257d; vr=vr-rU%2FpWRFUWU6wYzYaX5AuCg.1741881982; at=bAJyo0SDJ4EhYVyUtPvuVhUymGMGPlZLdn3tcKdBI0Xud82LQsbfeiq3TkganjvB; v=v-1; _gcl_au=1.1.1820879348.1741881984; sr=%7B%22h%22%3A849%2C%22w%22%3A1728%2C%22p%22%3A2%7D; ak_bmsc=12DA12F3DC9B2ABF4F47558632340484~000000000000000000000000000000~YAAQZP49F0HFSI2VAQAAI+wTkRsCvktt8G11jnkV+T4UIawsN55RsmH07lgCfYsaUcenprvfkFJYtQW8XJbekntDA3wSbEGFtp9/iVS5Usz1Jlb4BI7JmnEyCSVu8rhCkchU/JXrh4faaeEZo5cyb5Bb05sJNMLogmMjYOBjg4GK94+xREVCc8dc3ad6BwDROYZxK1iimOBrtLV+f2fSylGJSCJ9Yi16jyGvfPla4GNLGJ85FMItsAoZMeYcWV8kvOZ5izen2RNZ+ZlpHX4BJoCjDmQMw1oJz6DVfJC6JwK9Sn2L8cofye5Bvc9dTyocwUR6wYucqAZrXnkTBaDommnaP3BwNnbXvYRJoWeed3UrLSZFtcFVFCixOeSozKmEOJj5rXUNtisY; bm_s=YAAQbP49FxKX242VAQAAJiwykQNd7lmE3pGGW5rEW/e9aipMi2HPRTPv47MpvK1LD1mdWpD4yPN+3udsD/irFBZDQWjIHURN0JXwwkeWQZbBDXgO01UNoFbytX8qNZJOo0iaXKMkOHMUIYxQ6b4g+lw3/Co6NOQCgPX0SUqn67YbMmLoxASs0/Fa77d1+Q07ZqiXNOurWmPuDKWly9nOsvsAwst5Xz99SlRllSr4WQTN6EcfqN7n+wJhSlbcsFHl4OooRc12Tpw+pFomIOLEBT2WMaBhbIKEIdOesIKu9KjA6XTIXWJCGaOhVViH3p0rMUd3DoFVhU5UWsfB3JtqMYKwsXvEMjDphsSpsYdobIe71Xkoa+r4dRe4aFNf64oXIzocru00lPFKz2yTXxBcbekSqEkJKXQrAoSUXZFHXWN5NaFftyB4v1ajj1cJ1Te0ctlIVOEtCmE=; bm_so=ADB501C194325C53480AF6BD1A2519FE3E71A4085DF823AD806F6D4092C38F4E~YAAQbP49FxOX242VAQAAJiwykQKLX26FZYg/mwv+WVQCJRxmcyYl5XqJfcfH5WPg48BqbUCvyEmIdYXVijj9GnIA4xnp9yMGGgUAB7JUfRO7rZfctGbNvESzTZAgDc3BArgVCahd7woid4XxnxGRwsQFKKA4hYsmJZ74IxlNRaTf9Qyk13DB8/Xixt4btfCT3mcpudUNgGZmZayR+WaFVa+8lAIGkogZtuXLp9NgHRExXjqK3VULGfuMdSMAhMo5yQDI5t6ehwO17JmOckIaVgB7mM8Z5z8zN9i1EqM4CXaGhCkX9YLgEE20gZlQvmLhJ5edp/u+mN/SCKuOB1R5/LNqkVfrLVNHR9Yvchedwb46olXBLZ98cbT69LdqAT63qys6IC/tV4VWTlK0XpnOSFNKekz+E4kEMwgO9Mw5cmzoG/hxRfClnCdEdoAgzLX+7AelMv9vc8UsZxuL1Q==; bm_lso=ADB501C194325C53480AF6BD1A2519FE3E71A4085DF823AD806F6D4092C38F4E~YAAQbP49FxOX242VAQAAJiwykQKLX26FZYg/mwv+WVQCJRxmcyYl5XqJfcfH5WPg48BqbUCvyEmIdYXVijj9GnIA4xnp9yMGGgUAB7JUfRO7rZfctGbNvESzTZAgDc3BArgVCahd7woid4XxnxGRwsQFKKA4hYsmJZ74IxlNRaTf9Qyk13DB8/Xixt4btfCT3mcpudUNgGZmZayR+WaFVa+8lAIGkogZtuXLp9NgHRExXjqK3VULGfuMdSMAhMo5yQDI5t6ehwO17JmOckIaVgB7mM8Z5z8zN9i1EqM4CXaGhCkX9YLgEE20gZlQvmLhJ5edp/u+mN/SCKuOB1R5/LNqkVfrLVNHR9Yvchedwb46olXBLZ98cbT69LdqAT63qys6IC/tV4VWTlK0XpnOSFNKekz+E4kEMwgO9Mw5cmzoG/hxRfClnCdEdoAgzLX+7AelMv9vc8UsZxuL1Q==^1741897741056; AKA_A2=A; akaalb_www_homes_prd=1741902608~op=homes_Prd_Edge_US:www_homes_prd_usw2|~rv=79~m=www_homes_prd_usw2:0|~os=48c4c61a41b922746ef5062cf402e343~id=cba36d26f4d2ae56ef348d9619bda3d8' \
  -H 'priority: u=0, i' \
  -H 'sec-ch-ua: "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: none' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-gpc: 1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'

Oddly enough, I hit "refresh" enough times after my program bombed out, and then Chrome loaded this URL succesfully. I then copied out the CURL request once it was successful to see if it changed, here it is below. The one thing I see that is different is now the browser has a "'cache-control: max-age=0'" header in there now? I'm not well-versed enough to know why this would have changed (or if it's even relevant). What might you suggest to try so my python code can run this consistently? Any help pointing me in right direction would be tremendously helpful as this is driving me crazy! Thanks in advance!

curl '/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cache-control: max-age=0' \
  -b 'gp=%257b%2522k%2522%253a%257b%2522key%2522%253a%2522lqjs6ple4qbcy%2522%257d%252c%2522v%2522%253a4%252c%2522d%2522%253a%257b%2522lt%2522%253a33.392%252c%2522ln%2522%253a-84.57%257d%257d; hab=%257b%2522e%2522%253atrue%252c%2522r%2522%253a%255b%255d%257d; vr=vr-rU%2FpWRFUWU6wYzYaX5AuCg.1741881982; at=bAJyo0SDJ4EhYVyUtPvuVhUymGMGPlZLdn3tcKdBI0Xud82LQsbfeiq3TkganjvB; v=v-1; _gcl_au=1.1.1820879348.1741881984; ak_bmsc=12DA12F3DC9B2ABF4F47558632340484~000000000000000000000000000000~YAAQZP49F0HFSI2VAQAAI+wTkRsCvktt8G11jnkV+T4UIawsN55RsmH07lgCfYsaUcenprvfkFJYtQW8XJbekntDA3wSbEGFtp9/iVS5Usz1Jlb4BI7JmnEyCSVu8rhCkchU/JXrh4faaeEZo5cyb5Bb05sJNMLogmMjYOBjg4GK94+xREVCc8dc3ad6BwDROYZxK1iimOBrtLV+f2fSylGJSCJ9Yi16jyGvfPla4GNLGJ85FMItsAoZMeYcWV8kvOZ5izen2RNZ+ZlpHX4BJoCjDmQMw1oJz6DVfJC6JwK9Sn2L8cofye5Bvc9dTyocwUR6wYucqAZrXnkTBaDommnaP3BwNnbXvYRJoWeed3UrLSZFtcFVFCixOeSozKmEOJj5rXUNtisY; AKA_A2=A; akaalb_www_homes_prd=1741902608~op=homes_Prd_Edge_US:www_homes_prd_usw2|~rv=79~m=www_homes_prd_usw2:0|~os=48c4c61a41b922746ef5062cf402e343~id=cba36d26f4d2ae56ef348d9619bda3d8; vt=vt-LvPAD0gIJ0%2Br2gUpwbd1Hg; bm_ss=ab8e18ef4e; bm_so=2CF5C8E88F5A3FDEA0BB1148DC4B3E92B6CC33F58F3FE3D3146676D320D41C49~YAAQyDhjaGttCZCVAQAA+Ah1kQIgEX4lhXLMZwfe+NKUNGvgdpESA2rhSFmMZp6fyvgRZmvV7ToF+9hu5+pdmIryMhW1fvwXRpnL+M5c6LbJgjqJLunjBOrQ0/5txCQ6lKjqy71KqIPPKF8ngY4i69GLBew1Al0IeUKXxv+L0XdmlpcfvnILoy4pLjw700cwUpZfLRMYok40ObcCy7kaB0+bU9CK69GRSmCHo7bhWV/Lutv2cCulXn081knh1lLoVfZrR21C8074pIC4zMSrHhVkrEQV5NapK8+Bd/lz7u5EM5pw9pk+X+ozvrgUrsWvmo6LOl5yVxr74SRa1IwIQmFRrjLwFxDFmHsSEzgyiXPfbI55cH4EeTwXo9XDG34GwnuNcAxN/fTtglHUZyUC08VVtbGltUb6fVUO+yGVpSRvMPuHK85mR1TPZL9WR2Ujh45Ia4QGOOsL/F/fPQ==; sr=%7B%22h%22%3A350%2C%22w%22%3A1728%2C%22p%22%3A2%7D; bm_lso=2CF5C8E88F5A3FDEA0BB1148DC4B3E92B6CC33F58F3FE3D3146676D320D41C49~YAAQyDhjaGttCZCVAQAA+Ah1kQIgEX4lhXLMZwfe+NKUNGvgdpESA2rhSFmMZp6fyvgRZmvV7ToF+9hu5+pdmIryMhW1fvwXRpnL+M5c6LbJgjqJLunjBOrQ0/5txCQ6lKjqy71KqIPPKF8ngY4i69GLBew1Al0IeUKXxv+L0XdmlpcfvnILoy4pLjw700cwUpZfLRMYok40ObcCy7kaB0+bU9CK69GRSmCHo7bhWV/Lutv2cCulXn081knh1lLoVfZrR21C8074pIC4zMSrHhVkrEQV5NapK8+Bd/lz7u5EM5pw9pk+X+ozvrgUrsWvmo6LOl5yVxr74SRa1IwIQmFRrjLwFxDFmHsSEzgyiXPfbI55cH4EeTwXo9XDG34GwnuNcAxN/fTtglHUZyUC08VVtbGltUb6fVUO+yGVpSRvMPuHK85mR1TPZL9WR2Ujh45Ia4QGOOsL/F/fPQ==^1741902123250; bm_s=YAAQyDhjaNt8CZCVAQAAnj91kQM5FGaQxhEb31qC1Khm9ee44yiodp6KEpP89IIZt7Z3ojnCfyzXc813U5XxrxoQx0xdEqRNI+0MRmW7+/9D3iNzrA8Wb/CycNmbveJ1D86DkidL5Se9ZioFHNjAmfAuVvQDbB7k0xyHVbZO17i26rzpHZzVY09CdSj5YOlnw/A9rcLO2KNsFsvJHWIQfxd/U1mOJ1vOwvdiupeVV8ughbEQneQ+JP+vJ5f9au186PIkbgGQl3EaC3pUEMlMoOeA3kduXISq0uwGMuhY7M3AJWCnStfnT3EzF2pnYi1CPAaxvUdPaEv81nZ29HKpRStmUH4yGbke3i7CIiht3vg4efff9CnYCynRvmf+zpVlt0ns+OQmpJNPjGUvB1P10y2iPyvIuj4WWStkxsyRE7oSmHj2KQugyW4HCnWGRFyTvIS5DpqfH9Q=' \
  -H 'priority: u=0, i' \
  -H 'sec-ch-ua: "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: none' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-gpc: 1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'

I have a fairly basic scraper set up to loop through a list of real estate agent profiles on Homes. My scraper worked consistenly for months, but then sporadically starts getting these sporadic ERR_HTTP2_PROTOCOL_ERROR errors. I'm using Python/Selenium/Chromedriver. At first I thought maybe my IP was somehow being blocked/flagged, but when I just hit the same URL in a different browser, it loads perfectly fine (the error happens first when running with Chromedriver as part of my scraper). This error used to happen once every maybe 1000 url's, now it happens almost every 5-10ish, to the point my scraper is unusable. As soon as my code gets this error once, then almost every other URL in the loop fails with the same error.

What I've tried:

  • Swapping to a different network, wired vs. wifi, etc. No change
  • Updating Chrome to latest version--no change. Updating to latest Chromedriver version. Also no change.
  • Using Chrome Developer tools, copying out the failed request as a CURL request and running it through Terminal--this WORKED in terminal, which seems to indicate the request itself is good??

What would you all suggest trying next? Is the browser not actually sending this request across the wire in the selenium-controlled Chrome browser, since it works in different browser?

Below is a simplified version of my larger scraping program, to isolate out the loop and so I can re-create this scenario:

import random
from selenium import webdriver
import time

urllist = ['https://www.homes/real-estate-agents/susanne-guthrie/x5lx8zp/',
'https://www.homes/real-estate-agents/sue-pearce/362zpwe/',
'https://www.homes/real-estate-agents/katie-mihelich/kzppjy8/',
'https://www.homes/real-estate-agents/matt-pittman/etryj3q/',
'https://www.homes/real-estate-agents/mateen-ansari/mg2qg9l/',
'https://www.homes/real-estate-agents/rachael-real/dk3q8gl/',
'https://www.homes/real-estate-agents/annamarie-moise/21qtbtb/',
'https://www.homes/real-estate-agents/madison-verdun/0qvxe13/',
'https://www.homes/real-estate-agents/david-stob/hsztzf1/',
'https://www.homes/real-estate-agents/samuel-chrusciel/ww3525k/',
'https://www.homes/real-estate-agents/cathie-smith/b32xp59/',
'https://www.homes/real-estate-agents/jean-reedy-baren/7vk95ky/',
'https://www.homes/real-estate-agents/randy-stob/y9j077s/',
'https://www.homes/real-estate-agents/jeanne-jordan/sh90wf7/',
'https://www.homes/real-estate-agents/anthony-janega/p42zbfv/']

driver = webdriver.Chrome()

for url in urllist:
    delay = random.randint(1, 3)
    time.sleep(delay)
    driver.get(url)
    # SCRAPE DATA

And here is the CURL request from the failed browser attempt, copied out of Chrome Developer tools. When running this in terminal, it works:

curl 'https://www.homes/real-estate-agents/sue-pearce/362zpwe/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-US,en;q=0.9' \
  -b 'gp=%257b%2522k%2522%253a%257b%2522key%2522%253a%2522lqjs6ple4qbcy%2522%257d%252c%2522v%2522%253a4%252c%2522d%2522%253a%257b%2522lt%2522%253a33.392%252c%2522ln%2522%253a-84.57%257d%257d; hab=%257b%2522e%2522%253atrue%252c%2522r%2522%253a%255b%255d%257d; vr=vr-rU%2FpWRFUWU6wYzYaX5AuCg.1741881982; at=bAJyo0SDJ4EhYVyUtPvuVhUymGMGPlZLdn3tcKdBI0Xud82LQsbfeiq3TkganjvB; v=v-1; _gcl_au=1.1.1820879348.1741881984; sr=%7B%22h%22%3A849%2C%22w%22%3A1728%2C%22p%22%3A2%7D; ak_bmsc=12DA12F3DC9B2ABF4F47558632340484~000000000000000000000000000000~YAAQZP49F0HFSI2VAQAAI+wTkRsCvktt8G11jnkV+T4UIawsN55RsmH07lgCfYsaUcenprvfkFJYtQW8XJbekntDA3wSbEGFtp9/iVS5Usz1Jlb4BI7JmnEyCSVu8rhCkchU/JXrh4faaeEZo5cyb5Bb05sJNMLogmMjYOBjg4GK94+xREVCc8dc3ad6BwDROYZxK1iimOBrtLV+f2fSylGJSCJ9Yi16jyGvfPla4GNLGJ85FMItsAoZMeYcWV8kvOZ5izen2RNZ+ZlpHX4BJoCjDmQMw1oJz6DVfJC6JwK9Sn2L8cofye5Bvc9dTyocwUR6wYucqAZrXnkTBaDommnaP3BwNnbXvYRJoWeed3UrLSZFtcFVFCixOeSozKmEOJj5rXUNtisY; bm_s=YAAQbP49FxKX242VAQAAJiwykQNd7lmE3pGGW5rEW/e9aipMi2HPRTPv47MpvK1LD1mdWpD4yPN+3udsD/irFBZDQWjIHURN0JXwwkeWQZbBDXgO01UNoFbytX8qNZJOo0iaXKMkOHMUIYxQ6b4g+lw3/Co6NOQCgPX0SUqn67YbMmLoxASs0/Fa77d1+Q07ZqiXNOurWmPuDKWly9nOsvsAwst5Xz99SlRllSr4WQTN6EcfqN7n+wJhSlbcsFHl4OooRc12Tpw+pFomIOLEBT2WMaBhbIKEIdOesIKu9KjA6XTIXWJCGaOhVViH3p0rMUd3DoFVhU5UWsfB3JtqMYKwsXvEMjDphsSpsYdobIe71Xkoa+r4dRe4aFNf64oXIzocru00lPFKz2yTXxBcbekSqEkJKXQrAoSUXZFHXWN5NaFftyB4v1ajj1cJ1Te0ctlIVOEtCmE=; bm_so=ADB501C194325C53480AF6BD1A2519FE3E71A4085DF823AD806F6D4092C38F4E~YAAQbP49FxOX242VAQAAJiwykQKLX26FZYg/mwv+WVQCJRxmcyYl5XqJfcfH5WPg48BqbUCvyEmIdYXVijj9GnIA4xnp9yMGGgUAB7JUfRO7rZfctGbNvESzTZAgDc3BArgVCahd7woid4XxnxGRwsQFKKA4hYsmJZ74IxlNRaTf9Qyk13DB8/Xixt4btfCT3mcpudUNgGZmZayR+WaFVa+8lAIGkogZtuXLp9NgHRExXjqK3VULGfuMdSMAhMo5yQDI5t6ehwO17JmOckIaVgB7mM8Z5z8zN9i1EqM4CXaGhCkX9YLgEE20gZlQvmLhJ5edp/u+mN/SCKuOB1R5/LNqkVfrLVNHR9Yvchedwb46olXBLZ98cbT69LdqAT63qys6IC/tV4VWTlK0XpnOSFNKekz+E4kEMwgO9Mw5cmzoG/hxRfClnCdEdoAgzLX+7AelMv9vc8UsZxuL1Q==; bm_lso=ADB501C194325C53480AF6BD1A2519FE3E71A4085DF823AD806F6D4092C38F4E~YAAQbP49FxOX242VAQAAJiwykQKLX26FZYg/mwv+WVQCJRxmcyYl5XqJfcfH5WPg48BqbUCvyEmIdYXVijj9GnIA4xnp9yMGGgUAB7JUfRO7rZfctGbNvESzTZAgDc3BArgVCahd7woid4XxnxGRwsQFKKA4hYsmJZ74IxlNRaTf9Qyk13DB8/Xixt4btfCT3mcpudUNgGZmZayR+WaFVa+8lAIGkogZtuXLp9NgHRExXjqK3VULGfuMdSMAhMo5yQDI5t6ehwO17JmOckIaVgB7mM8Z5z8zN9i1EqM4CXaGhCkX9YLgEE20gZlQvmLhJ5edp/u+mN/SCKuOB1R5/LNqkVfrLVNHR9Yvchedwb46olXBLZ98cbT69LdqAT63qys6IC/tV4VWTlK0XpnOSFNKekz+E4kEMwgO9Mw5cmzoG/hxRfClnCdEdoAgzLX+7AelMv9vc8UsZxuL1Q==^1741897741056; AKA_A2=A; akaalb_www_homes_prd=1741902608~op=homes_Prd_Edge_US:www_homes_prd_usw2|~rv=79~m=www_homes_prd_usw2:0|~os=48c4c61a41b922746ef5062cf402e343~id=cba36d26f4d2ae56ef348d9619bda3d8' \
  -H 'priority: u=0, i' \
  -H 'sec-ch-ua: "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: none' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-gpc: 1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'

Oddly enough, I hit "refresh" enough times after my program bombed out, and then Chrome loaded this URL succesfully. I then copied out the CURL request once it was successful to see if it changed, here it is below. The one thing I see that is different is now the browser has a "'cache-control: max-age=0'" header in there now? I'm not well-versed enough to know why this would have changed (or if it's even relevant). What might you suggest to try so my python code can run this consistently? Any help pointing me in right direction would be tremendously helpful as this is driving me crazy! Thanks in advance!

curl 'https://www.homes/real-estate-agents/sue-pearce/362zpwe/' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cache-control: max-age=0' \
  -b 'gp=%257b%2522k%2522%253a%257b%2522key%2522%253a%2522lqjs6ple4qbcy%2522%257d%252c%2522v%2522%253a4%252c%2522d%2522%253a%257b%2522lt%2522%253a33.392%252c%2522ln%2522%253a-84.57%257d%257d; hab=%257b%2522e%2522%253atrue%252c%2522r%2522%253a%255b%255d%257d; vr=vr-rU%2FpWRFUWU6wYzYaX5AuCg.1741881982; at=bAJyo0SDJ4EhYVyUtPvuVhUymGMGPlZLdn3tcKdBI0Xud82LQsbfeiq3TkganjvB; v=v-1; _gcl_au=1.1.1820879348.1741881984; ak_bmsc=12DA12F3DC9B2ABF4F47558632340484~000000000000000000000000000000~YAAQZP49F0HFSI2VAQAAI+wTkRsCvktt8G11jnkV+T4UIawsN55RsmH07lgCfYsaUcenprvfkFJYtQW8XJbekntDA3wSbEGFtp9/iVS5Usz1Jlb4BI7JmnEyCSVu8rhCkchU/JXrh4faaeEZo5cyb5Bb05sJNMLogmMjYOBjg4GK94+xREVCc8dc3ad6BwDROYZxK1iimOBrtLV+f2fSylGJSCJ9Yi16jyGvfPla4GNLGJ85FMItsAoZMeYcWV8kvOZ5izen2RNZ+ZlpHX4BJoCjDmQMw1oJz6DVfJC6JwK9Sn2L8cofye5Bvc9dTyocwUR6wYucqAZrXnkTBaDommnaP3BwNnbXvYRJoWeed3UrLSZFtcFVFCixOeSozKmEOJj5rXUNtisY; AKA_A2=A; akaalb_www_homes_prd=1741902608~op=homes_Prd_Edge_US:www_homes_prd_usw2|~rv=79~m=www_homes_prd_usw2:0|~os=48c4c61a41b922746ef5062cf402e343~id=cba36d26f4d2ae56ef348d9619bda3d8; vt=vt-LvPAD0gIJ0%2Br2gUpwbd1Hg; bm_ss=ab8e18ef4e; bm_so=2CF5C8E88F5A3FDEA0BB1148DC4B3E92B6CC33F58F3FE3D3146676D320D41C49~YAAQyDhjaGttCZCVAQAA+Ah1kQIgEX4lhXLMZwfe+NKUNGvgdpESA2rhSFmMZp6fyvgRZmvV7ToF+9hu5+pdmIryMhW1fvwXRpnL+M5c6LbJgjqJLunjBOrQ0/5txCQ6lKjqy71KqIPPKF8ngY4i69GLBew1Al0IeUKXxv+L0XdmlpcfvnILoy4pLjw700cwUpZfLRMYok40ObcCy7kaB0+bU9CK69GRSmCHo7bhWV/Lutv2cCulXn081knh1lLoVfZrR21C8074pIC4zMSrHhVkrEQV5NapK8+Bd/lz7u5EM5pw9pk+X+ozvrgUrsWvmo6LOl5yVxr74SRa1IwIQmFRrjLwFxDFmHsSEzgyiXPfbI55cH4EeTwXo9XDG34GwnuNcAxN/fTtglHUZyUC08VVtbGltUb6fVUO+yGVpSRvMPuHK85mR1TPZL9WR2Ujh45Ia4QGOOsL/F/fPQ==; sr=%7B%22h%22%3A350%2C%22w%22%3A1728%2C%22p%22%3A2%7D; bm_lso=2CF5C8E88F5A3FDEA0BB1148DC4B3E92B6CC33F58F3FE3D3146676D320D41C49~YAAQyDhjaGttCZCVAQAA+Ah1kQIgEX4lhXLMZwfe+NKUNGvgdpESA2rhSFmMZp6fyvgRZmvV7ToF+9hu5+pdmIryMhW1fvwXRpnL+M5c6LbJgjqJLunjBOrQ0/5txCQ6lKjqy71KqIPPKF8ngY4i69GLBew1Al0IeUKXxv+L0XdmlpcfvnILoy4pLjw700cwUpZfLRMYok40ObcCy7kaB0+bU9CK69GRSmCHo7bhWV/Lutv2cCulXn081knh1lLoVfZrR21C8074pIC4zMSrHhVkrEQV5NapK8+Bd/lz7u5EM5pw9pk+X+ozvrgUrsWvmo6LOl5yVxr74SRa1IwIQmFRrjLwFxDFmHsSEzgyiXPfbI55cH4EeTwXo9XDG34GwnuNcAxN/fTtglHUZyUC08VVtbGltUb6fVUO+yGVpSRvMPuHK85mR1TPZL9WR2Ujh45Ia4QGOOsL/F/fPQ==^1741902123250; bm_s=YAAQyDhjaNt8CZCVAQAAnj91kQM5FGaQxhEb31qC1Khm9ee44yiodp6KEpP89IIZt7Z3ojnCfyzXc813U5XxrxoQx0xdEqRNI+0MRmW7+/9D3iNzrA8Wb/CycNmbveJ1D86DkidL5Se9ZioFHNjAmfAuVvQDbB7k0xyHVbZO17i26rzpHZzVY09CdSj5YOlnw/A9rcLO2KNsFsvJHWIQfxd/U1mOJ1vOwvdiupeVV8ughbEQneQ+JP+vJ5f9au186PIkbgGQl3EaC3pUEMlMoOeA3kduXISq0uwGMuhY7M3AJWCnStfnT3EzF2pnYi1CPAaxvUdPaEv81nZ29HKpRStmUH4yGbke3i7CIiht3vg4efff9CnYCynRvmf+zpVlt0ns+OQmpJNPjGUvB1P10y2iPyvIuj4WWStkxsyRE7oSmHj2KQugyW4HCnWGRFyTvIS5DpqfH9Q=' \
  -H 'priority: u=0, i' \
  -H 'sec-ch-ua: "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: none' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-gpc: 1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'

Share Improve this question asked Mar 13 at 21:55 BenBen 951 silver badge10 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 1

I tested your code and got the same issue. Looks like there is an issue to handle HTTP/2 when using Chrome with Selenium. I have also realized that homes asks to reload some js/css resources with preload hints.

I assume using Chrome is not the priority, and here are two possible solution that worked for me:

1. Use Firefox with Selenium

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

urllist = ['https://www.homes/real-estate-agents/susanne-guthrie/x5lx8zp/',
           'https://www.homes/real-estate-agents/sue-pearce/362zpwe/',
           'https://www.homes/real-estate-agents/katie-mihelich/kzppjy8/',
           'https://www.homes/real-estate-agents/matt-pittman/etryj3q/',
           'https://www.homes/real-estate-agents/mateen-ansari/mg2qg9l/',
           'https://www.homes/real-estate-agents/rachael-real/dk3q8gl/',
           'https://www.homes/real-estate-agents/annamarie-moise/21qtbtb/',
           'https://www.homes/real-estate-agents/madison-verdun/0qvxe13/',
           'https://www.homes/real-estate-agents/david-stob/hsztzf1/',
           'https://www.homes/real-estate-agents/samuel-chrusciel/ww3525k/',
           'https://www.homes/real-estate-agents/cathie-smith/b32xp59/',
           'https://www.homes/real-estate-agents/jean-reedy-baren/7vk95ky/',
           'https://www.homes/real-estate-agents/randy-stob/y9j077s/',
           'https://www.homes/real-estate-agents/jeanne-jordan/sh90wf7/',
           'https://www.homes/real-estate-agents/anthony-janega/p42zbfv/']

options = Options()
options.set_preference("network.http.http2.enabled", True)  # Ensure HTTP/2 is enabled
driver = webdriver.Firefox(options=options)

for url in urllist:
    time.sleep(3)
    driver.delete_all_cookies() # optional, you can clear cookies
    driver.get(url)
    print(f"Loaded: {url}")
    # SCRAPE DATA

driver.quit() # close the browser

I have tested this code multiple times with Firefox, and looks like it works. I did not get that http2 error with this approach. You can give it a try.

2. Use Playwright

In order to install playwright:

pip install playwright
playwright install

Note: I got an error saying I am missing the following libraries: libgtk-4.so.1 and libmanette-0.2.so.0

To fix this issue, on Linux (debian based):

sudo apt update
sudo apt install -y libgtk-4-bin libgtk-4-common libgtk-4-dev libgtk-4-1 libmanette-0.2-0 libglib2.0-dev

Then the code will look like the following:

from playwright.sync_api import sync_playwright
import random
import time

# List of URLs to scrape
urllist = [
    'https://www.homes/real-estate-agents/susanne-guthrie/x5lx8zp/',
    'https://www.homes/real-estate-agents/sue-pearce/362zpwe/',
    'https://www.homes/real-estate-agents/katie-mihelich/kzppjy8/',
    'https://www.homes/real-estate-agents/matt-pittman/etryj3q/',
    'https://www.homes/real-estate-agents/mateen-ansari/mg2qg9l/',
    'https://www.homes/real-estate-agents/rachael-real/dk3q8gl/',
    'https://www.homes/real-estate-agents/annamarie-moise/21qtbtb/',
    'https://www.homes/real-estate-agents/madison-verdun/0qvxe13/',
    'https://www.homes/real-estate-agents/david-stob/hsztzf1/',
    'https://www.homes/real-estate-agents/samuel-chrusciel/ww3525k/',
    'https://www.homes/real-estate-agents/cathie-smith/b32xp59/',
    'https://www.homes/real-estate-agents/jean-reedy-baren/7vk95ky/',
    'https://www.homes/real-estate-agents/randy-stob/y9j077s/',
    'https://www.homes/real-estate-agents/jeanne-jordan/sh90wf7/',
    'https://www.homes/real-estate-agents/anthony-janega/p42zbfv/'
]

with sync_playwright() as p:
    # Change to True for headless mode
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()

    for url in urllist:
        time.sleep(random.randint(3, 7))
        page.goto(url, wait_until="domcontentloaded")
        print(f"Visited: {url}")

    browser.close()

I hope these solutions will work for you too.

The issue is there is a set cookie you need to keep updating or use.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论