I have a fairly basic scraper set up to loop through a list of real estate agent profiles on Homes. My scraper worked consistenly for months, but then sporadically starts getting these sporadic ERR_HTTP2_PROTOCOL_ERROR errors. I'm using Python/Selenium/Chromedriver. At first I thought maybe my IP was somehow being blocked/flagged, but when I just hit the same URL in a different browser, it loads perfectly fine (the error happens first when running with Chromedriver as part of my scraper). This error used to happen once every maybe 1000 url's, now it happens almost every 5-10ish, to the point my scraper is unusable. As soon as my code gets this error once, then almost every other URL in the loop fails with the same error.
What I've tried:
- Swapping to a different network, wired vs. wifi, etc. No change
- Updating Chrome to latest version--no change. Updating to latest Chromedriver version. Also no change.
- Using Chrome Developer tools, copying out the failed request as a CURL request and running it through Terminal--this WORKED in terminal, which seems to indicate the request itself is good??
What would you all suggest trying next? Is the browser not actually sending this request across the wire in the selenium-controlled Chrome browser, since it works in different browser?
Below is a simplified version of my larger scraping program, to isolate out the loop and so I can re-create this scenario:
import random
from selenium import webdriver
import time
urllist = ['/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/',
'/']
driver = webdriver.Chrome()
for url in urllist:
delay = random.randint(1, 3)
time.sleep(delay)
driver.get(url)
# SCRAPE DATA
And here is the CURL request from the failed browser attempt, copied out of Chrome Developer tools. When running this in terminal, it works:
curl '/' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
-H 'accept-language: en-US,en;q=0.9' \
-b 'gp=%257b%2522k%2522%253a%257b%2522key%2522%253a%2522lqjs6ple4qbcy%2522%257d%252c%2522v%2522%253a4%252c%2522d%2522%253a%257b%2522lt%2522%253a33.392%252c%2522ln%2522%253a-84.57%257d%257d; hab=%257b%2522e%2522%253atrue%252c%2522r%2522%253a%255b%255d%257d; vr=vr-rU%2FpWRFUWU6wYzYaX5AuCg.1741881982; at=bAJyo0SDJ4EhYVyUtPvuVhUymGMGPlZLdn3tcKdBI0Xud82LQsbfeiq3TkganjvB; v=v-1; _gcl_au=1.1.1820879348.1741881984; sr=%7B%22h%22%3A849%2C%22w%22%3A1728%2C%22p%22%3A2%7D; ak_bmsc=12DA12F3DC9B2ABF4F47558632340484~000000000000000000000000000000~YAAQZP49F0HFSI2VAQAAI+wTkRsCvktt8G11jnkV+T4UIawsN55RsmH07lgCfYsaUcenprvfkFJYtQW8XJbekntDA3wSbEGFtp9/iVS5Usz1Jlb4BI7JmnEyCSVu8rhCkchU/JXrh4faaeEZo5cyb5Bb05sJNMLogmMjYOBjg4GK94+xREVCc8dc3ad6BwDROYZxK1iimOBrtLV+f2fSylGJSCJ9Yi16jyGvfPla4GNLGJ85FMItsAoZMeYcWV8kvOZ5izen2RNZ+ZlpHX4BJoCjDmQMw1oJz6DVfJC6JwK9Sn2L8cofye5Bvc9dTyocwUR6wYucqAZrXnkTBaDommnaP3BwNnbXvYRJoWeed3UrLSZFtcFVFCixOeSozKmEOJj5rXUNtisY; bm_s=YAAQbP49FxKX242VAQAAJiwykQNd7lmE3pGGW5rEW/e9aipMi2HPRTPv47MpvK1LD1mdWpD4yPN+3udsD/irFBZDQWjIHURN0JXwwkeWQZbBDXgO01UNoFbytX8qNZJOo0iaXKMkOHMUIYxQ6b4g+lw3/Co6NOQCgPX0SUqn67YbMmLoxASs0/Fa77d1+Q07ZqiXNOurWmPuDKWly9nOsvsAwst5Xz99SlRllSr4WQTN6EcfqN7n+wJhSlbcsFHl4OooRc12Tpw+pFomIOLEBT2WMaBhbIKEIdOesIKu9KjA6XTIXWJCGaOhVViH3p0rMUd3DoFVhU5UWsfB3JtqMYKwsXvEMjDphsSpsYdobIe71Xkoa+r4dRe4aFNf64oXIzocru00lPFKz2yTXxBcbekSqEkJKXQrAoSUXZFHXWN5NaFftyB4v1ajj1cJ1Te0ctlIVOEtCmE=; bm_so=ADB501C194325C53480AF6BD1A2519FE3E71A4085DF823AD806F6D4092C38F4E~YAAQbP49FxOX242VAQAAJiwykQKLX26FZYg/mwv+WVQCJRxmcyYl5XqJfcfH5WPg48BqbUCvyEmIdYXVijj9GnIA4xnp9yMGGgUAB7JUfRO7rZfctGbNvESzTZAgDc3BArgVCahd7woid4XxnxGRwsQFKKA4hYsmJZ74IxlNRaTf9Qyk13DB8/Xixt4btfCT3mcpudUNgGZmZayR+WaFVa+8lAIGkogZtuXLp9NgHRExXjqK3VULGfuMdSMAhMo5yQDI5t6ehwO17JmOckIaVgB7mM8Z5z8zN9i1EqM4CXaGhCkX9YLgEE20gZlQvmLhJ5edp/u+mN/SCKuOB1R5/LNqkVfrLVNHR9Yvchedwb46olXBLZ98cbT69LdqAT63qys6IC/tV4VWTlK0XpnOSFNKekz+E4kEMwgO9Mw5cmzoG/hxRfClnCdEdoAgzLX+7AelMv9vc8UsZxuL1Q==; bm_lso=ADB501C194325C53480AF6BD1A2519FE3E71A4085DF823AD806F6D4092C38F4E~YAAQbP49FxOX242VAQAAJiwykQKLX26FZYg/mwv+WVQCJRxmcyYl5XqJfcfH5WPg48BqbUCvyEmIdYXVijj9GnIA4xnp9yMGGgUAB7JUfRO7rZfctGbNvESzTZAgDc3BArgVCahd7woid4XxnxGRwsQFKKA4hYsmJZ74IxlNRaTf9Qyk13DB8/Xixt4btfCT3mcpudUNgGZmZayR+WaFVa+8lAIGkogZtuXLp9NgHRExXjqK3VULGfuMdSMAhMo5yQDI5t6ehwO17JmOckIaVgB7mM8Z5z8zN9i1EqM4CXaGhCkX9YLgEE20gZlQvmLhJ5edp/u+mN/SCKuOB1R5/LNqkVfrLVNHR9Yvchedwb46olXBLZ98cbT69LdqAT63qys6IC/tV4VWTlK0XpnOSFNKekz+E4kEMwgO9Mw5cmzoG/hxRfClnCdEdoAgzLX+7AelMv9vc8UsZxuL1Q==^1741897741056; AKA_A2=A; akaalb_www_homes_prd=1741902608~op=homes_Prd_Edge_US:www_homes_prd_usw2|~rv=79~m=www_homes_prd_usw2:0|~os=48c4c61a41b922746ef5062cf402e343~id=cba36d26f4d2ae56ef348d9619bda3d8' \
-H 'priority: u=0, i' \
-H 'sec-ch-ua: "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: none' \
-H 'sec-fetch-user: ?1' \
-H 'sec-gpc: 1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
Oddly enough, I hit "refresh" enough times after my program bombed out, and then Chrome loaded this URL succesfully. I then copied out the CURL request once it was successful to see if it changed, here it is below. The one thing I see that is different is now the browser has a "'cache-control: max-age=0'" header in there now? I'm not well-versed enough to know why this would have changed (or if it's even relevant). What might you suggest to try so my python code can run this consistently? Any help pointing me in right direction would be tremendously helpful as this is driving me crazy! Thanks in advance!
curl '/' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
-H 'accept-language: en-US,en;q=0.9' \
-H 'cache-control: max-age=0' \
-b 'gp=%257b%2522k%2522%253a%257b%2522key%2522%253a%2522lqjs6ple4qbcy%2522%257d%252c%2522v%2522%253a4%252c%2522d%2522%253a%257b%2522lt%2522%253a33.392%252c%2522ln%2522%253a-84.57%257d%257d; hab=%257b%2522e%2522%253atrue%252c%2522r%2522%253a%255b%255d%257d; vr=vr-rU%2FpWRFUWU6wYzYaX5AuCg.1741881982; at=bAJyo0SDJ4EhYVyUtPvuVhUymGMGPlZLdn3tcKdBI0Xud82LQsbfeiq3TkganjvB; v=v-1; _gcl_au=1.1.1820879348.1741881984; ak_bmsc=12DA12F3DC9B2ABF4F47558632340484~000000000000000000000000000000~YAAQZP49F0HFSI2VAQAAI+wTkRsCvktt8G11jnkV+T4UIawsN55RsmH07lgCfYsaUcenprvfkFJYtQW8XJbekntDA3wSbEGFtp9/iVS5Usz1Jlb4BI7JmnEyCSVu8rhCkchU/JXrh4faaeEZo5cyb5Bb05sJNMLogmMjYOBjg4GK94+xREVCc8dc3ad6BwDROYZxK1iimOBrtLV+f2fSylGJSCJ9Yi16jyGvfPla4GNLGJ85FMItsAoZMeYcWV8kvOZ5izen2RNZ+ZlpHX4BJoCjDmQMw1oJz6DVfJC6JwK9Sn2L8cofye5Bvc9dTyocwUR6wYucqAZrXnkTBaDommnaP3BwNnbXvYRJoWeed3UrLSZFtcFVFCixOeSozKmEOJj5rXUNtisY; AKA_A2=A; akaalb_www_homes_prd=1741902608~op=homes_Prd_Edge_US:www_homes_prd_usw2|~rv=79~m=www_homes_prd_usw2:0|~os=48c4c61a41b922746ef5062cf402e343~id=cba36d26f4d2ae56ef348d9619bda3d8; vt=vt-LvPAD0gIJ0%2Br2gUpwbd1Hg; bm_ss=ab8e18ef4e; bm_so=2CF5C8E88F5A3FDEA0BB1148DC4B3E92B6CC33F58F3FE3D3146676D320D41C49~YAAQyDhjaGttCZCVAQAA+Ah1kQIgEX4lhXLMZwfe+NKUNGvgdpESA2rhSFmMZp6fyvgRZmvV7ToF+9hu5+pdmIryMhW1fvwXRpnL+M5c6LbJgjqJLunjBOrQ0/5txCQ6lKjqy71KqIPPKF8ngY4i69GLBew1Al0IeUKXxv+L0XdmlpcfvnILoy4pLjw700cwUpZfLRMYok40ObcCy7kaB0+bU9CK69GRSmCHo7bhWV/Lutv2cCulXn081knh1lLoVfZrR21C8074pIC4zMSrHhVkrEQV5NapK8+Bd/lz7u5EM5pw9pk+X+ozvrgUrsWvmo6LOl5yVxr74SRa1IwIQmFRrjLwFxDFmHsSEzgyiXPfbI55cH4EeTwXo9XDG34GwnuNcAxN/fTtglHUZyUC08VVtbGltUb6fVUO+yGVpSRvMPuHK85mR1TPZL9WR2Ujh45Ia4QGOOsL/F/fPQ==; sr=%7B%22h%22%3A350%2C%22w%22%3A1728%2C%22p%22%3A2%7D; bm_lso=2CF5C8E88F5A3FDEA0BB1148DC4B3E92B6CC33F58F3FE3D3146676D320D41C49~YAAQyDhjaGttCZCVAQAA+Ah1kQIgEX4lhXLMZwfe+NKUNGvgdpESA2rhSFmMZp6fyvgRZmvV7ToF+9hu5+pdmIryMhW1fvwXRpnL+M5c6LbJgjqJLunjBOrQ0/5txCQ6lKjqy71KqIPPKF8ngY4i69GLBew1Al0IeUKXxv+L0XdmlpcfvnILoy4pLjw700cwUpZfLRMYok40ObcCy7kaB0+bU9CK69GRSmCHo7bhWV/Lutv2cCulXn081knh1lLoVfZrR21C8074pIC4zMSrHhVkrEQV5NapK8+Bd/lz7u5EM5pw9pk+X+ozvrgUrsWvmo6LOl5yVxr74SRa1IwIQmFRrjLwFxDFmHsSEzgyiXPfbI55cH4EeTwXo9XDG34GwnuNcAxN/fTtglHUZyUC08VVtbGltUb6fVUO+yGVpSRvMPuHK85mR1TPZL9WR2Ujh45Ia4QGOOsL/F/fPQ==^1741902123250; bm_s=YAAQyDhjaNt8CZCVAQAAnj91kQM5FGaQxhEb31qC1Khm9ee44yiodp6KEpP89IIZt7Z3ojnCfyzXc813U5XxrxoQx0xdEqRNI+0MRmW7+/9D3iNzrA8Wb/CycNmbveJ1D86DkidL5Se9ZioFHNjAmfAuVvQDbB7k0xyHVbZO17i26rzpHZzVY09CdSj5YOlnw/A9rcLO2KNsFsvJHWIQfxd/U1mOJ1vOwvdiupeVV8ughbEQneQ+JP+vJ5f9au186PIkbgGQl3EaC3pUEMlMoOeA3kduXISq0uwGMuhY7M3AJWCnStfnT3EzF2pnYi1CPAaxvUdPaEv81nZ29HKpRStmUH4yGbke3i7CIiht3vg4efff9CnYCynRvmf+zpVlt0ns+OQmpJNPjGUvB1P10y2iPyvIuj4WWStkxsyRE7oSmHj2KQugyW4HCnWGRFyTvIS5DpqfH9Q=' \
-H 'priority: u=0, i' \
-H 'sec-ch-ua: "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: none' \
-H 'sec-fetch-user: ?1' \
-H 'sec-gpc: 1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
I have a fairly basic scraper set up to loop through a list of real estate agent profiles on Homes. My scraper worked consistenly for months, but then sporadically starts getting these sporadic ERR_HTTP2_PROTOCOL_ERROR errors. I'm using Python/Selenium/Chromedriver. At first I thought maybe my IP was somehow being blocked/flagged, but when I just hit the same URL in a different browser, it loads perfectly fine (the error happens first when running with Chromedriver as part of my scraper). This error used to happen once every maybe 1000 url's, now it happens almost every 5-10ish, to the point my scraper is unusable. As soon as my code gets this error once, then almost every other URL in the loop fails with the same error.
What I've tried:
- Swapping to a different network, wired vs. wifi, etc. No change
- Updating Chrome to latest version--no change. Updating to latest Chromedriver version. Also no change.
- Using Chrome Developer tools, copying out the failed request as a CURL request and running it through Terminal--this WORKED in terminal, which seems to indicate the request itself is good??
What would you all suggest trying next? Is the browser not actually sending this request across the wire in the selenium-controlled Chrome browser, since it works in different browser?
Below is a simplified version of my larger scraping program, to isolate out the loop and so I can re-create this scenario:
import random
from selenium import webdriver
import time
urllist = ['https://www.homes/real-estate-agents/susanne-guthrie/x5lx8zp/',
'https://www.homes/real-estate-agents/sue-pearce/362zpwe/',
'https://www.homes/real-estate-agents/katie-mihelich/kzppjy8/',
'https://www.homes/real-estate-agents/matt-pittman/etryj3q/',
'https://www.homes/real-estate-agents/mateen-ansari/mg2qg9l/',
'https://www.homes/real-estate-agents/rachael-real/dk3q8gl/',
'https://www.homes/real-estate-agents/annamarie-moise/21qtbtb/',
'https://www.homes/real-estate-agents/madison-verdun/0qvxe13/',
'https://www.homes/real-estate-agents/david-stob/hsztzf1/',
'https://www.homes/real-estate-agents/samuel-chrusciel/ww3525k/',
'https://www.homes/real-estate-agents/cathie-smith/b32xp59/',
'https://www.homes/real-estate-agents/jean-reedy-baren/7vk95ky/',
'https://www.homes/real-estate-agents/randy-stob/y9j077s/',
'https://www.homes/real-estate-agents/jeanne-jordan/sh90wf7/',
'https://www.homes/real-estate-agents/anthony-janega/p42zbfv/']
driver = webdriver.Chrome()
for url in urllist:
delay = random.randint(1, 3)
time.sleep(delay)
driver.get(url)
# SCRAPE DATA
And here is the CURL request from the failed browser attempt, copied out of Chrome Developer tools. When running this in terminal, it works:
curl 'https://www.homes/real-estate-agents/sue-pearce/362zpwe/' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
-H 'accept-language: en-US,en;q=0.9' \
-b 'gp=%257b%2522k%2522%253a%257b%2522key%2522%253a%2522lqjs6ple4qbcy%2522%257d%252c%2522v%2522%253a4%252c%2522d%2522%253a%257b%2522lt%2522%253a33.392%252c%2522ln%2522%253a-84.57%257d%257d; hab=%257b%2522e%2522%253atrue%252c%2522r%2522%253a%255b%255d%257d; vr=vr-rU%2FpWRFUWU6wYzYaX5AuCg.1741881982; at=bAJyo0SDJ4EhYVyUtPvuVhUymGMGPlZLdn3tcKdBI0Xud82LQsbfeiq3TkganjvB; v=v-1; _gcl_au=1.1.1820879348.1741881984; sr=%7B%22h%22%3A849%2C%22w%22%3A1728%2C%22p%22%3A2%7D; ak_bmsc=12DA12F3DC9B2ABF4F47558632340484~000000000000000000000000000000~YAAQZP49F0HFSI2VAQAAI+wTkRsCvktt8G11jnkV+T4UIawsN55RsmH07lgCfYsaUcenprvfkFJYtQW8XJbekntDA3wSbEGFtp9/iVS5Usz1Jlb4BI7JmnEyCSVu8rhCkchU/JXrh4faaeEZo5cyb5Bb05sJNMLogmMjYOBjg4GK94+xREVCc8dc3ad6BwDROYZxK1iimOBrtLV+f2fSylGJSCJ9Yi16jyGvfPla4GNLGJ85FMItsAoZMeYcWV8kvOZ5izen2RNZ+ZlpHX4BJoCjDmQMw1oJz6DVfJC6JwK9Sn2L8cofye5Bvc9dTyocwUR6wYucqAZrXnkTBaDommnaP3BwNnbXvYRJoWeed3UrLSZFtcFVFCixOeSozKmEOJj5rXUNtisY; bm_s=YAAQbP49FxKX242VAQAAJiwykQNd7lmE3pGGW5rEW/e9aipMi2HPRTPv47MpvK1LD1mdWpD4yPN+3udsD/irFBZDQWjIHURN0JXwwkeWQZbBDXgO01UNoFbytX8qNZJOo0iaXKMkOHMUIYxQ6b4g+lw3/Co6NOQCgPX0SUqn67YbMmLoxASs0/Fa77d1+Q07ZqiXNOurWmPuDKWly9nOsvsAwst5Xz99SlRllSr4WQTN6EcfqN7n+wJhSlbcsFHl4OooRc12Tpw+pFomIOLEBT2WMaBhbIKEIdOesIKu9KjA6XTIXWJCGaOhVViH3p0rMUd3DoFVhU5UWsfB3JtqMYKwsXvEMjDphsSpsYdobIe71Xkoa+r4dRe4aFNf64oXIzocru00lPFKz2yTXxBcbekSqEkJKXQrAoSUXZFHXWN5NaFftyB4v1ajj1cJ1Te0ctlIVOEtCmE=; bm_so=ADB501C194325C53480AF6BD1A2519FE3E71A4085DF823AD806F6D4092C38F4E~YAAQbP49FxOX242VAQAAJiwykQKLX26FZYg/mwv+WVQCJRxmcyYl5XqJfcfH5WPg48BqbUCvyEmIdYXVijj9GnIA4xnp9yMGGgUAB7JUfRO7rZfctGbNvESzTZAgDc3BArgVCahd7woid4XxnxGRwsQFKKA4hYsmJZ74IxlNRaTf9Qyk13DB8/Xixt4btfCT3mcpudUNgGZmZayR+WaFVa+8lAIGkogZtuXLp9NgHRExXjqK3VULGfuMdSMAhMo5yQDI5t6ehwO17JmOckIaVgB7mM8Z5z8zN9i1EqM4CXaGhCkX9YLgEE20gZlQvmLhJ5edp/u+mN/SCKuOB1R5/LNqkVfrLVNHR9Yvchedwb46olXBLZ98cbT69LdqAT63qys6IC/tV4VWTlK0XpnOSFNKekz+E4kEMwgO9Mw5cmzoG/hxRfClnCdEdoAgzLX+7AelMv9vc8UsZxuL1Q==; bm_lso=ADB501C194325C53480AF6BD1A2519FE3E71A4085DF823AD806F6D4092C38F4E~YAAQbP49FxOX242VAQAAJiwykQKLX26FZYg/mwv+WVQCJRxmcyYl5XqJfcfH5WPg48BqbUCvyEmIdYXVijj9GnIA4xnp9yMGGgUAB7JUfRO7rZfctGbNvESzTZAgDc3BArgVCahd7woid4XxnxGRwsQFKKA4hYsmJZ74IxlNRaTf9Qyk13DB8/Xixt4btfCT3mcpudUNgGZmZayR+WaFVa+8lAIGkogZtuXLp9NgHRExXjqK3VULGfuMdSMAhMo5yQDI5t6ehwO17JmOckIaVgB7mM8Z5z8zN9i1EqM4CXaGhCkX9YLgEE20gZlQvmLhJ5edp/u+mN/SCKuOB1R5/LNqkVfrLVNHR9Yvchedwb46olXBLZ98cbT69LdqAT63qys6IC/tV4VWTlK0XpnOSFNKekz+E4kEMwgO9Mw5cmzoG/hxRfClnCdEdoAgzLX+7AelMv9vc8UsZxuL1Q==^1741897741056; AKA_A2=A; akaalb_www_homes_prd=1741902608~op=homes_Prd_Edge_US:www_homes_prd_usw2|~rv=79~m=www_homes_prd_usw2:0|~os=48c4c61a41b922746ef5062cf402e343~id=cba36d26f4d2ae56ef348d9619bda3d8' \
-H 'priority: u=0, i' \
-H 'sec-ch-ua: "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: none' \
-H 'sec-fetch-user: ?1' \
-H 'sec-gpc: 1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
Oddly enough, I hit "refresh" enough times after my program bombed out, and then Chrome loaded this URL succesfully. I then copied out the CURL request once it was successful to see if it changed, here it is below. The one thing I see that is different is now the browser has a "'cache-control: max-age=0'" header in there now? I'm not well-versed enough to know why this would have changed (or if it's even relevant). What might you suggest to try so my python code can run this consistently? Any help pointing me in right direction would be tremendously helpful as this is driving me crazy! Thanks in advance!
curl 'https://www.homes/real-estate-agents/sue-pearce/362zpwe/' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
-H 'accept-language: en-US,en;q=0.9' \
-H 'cache-control: max-age=0' \
-b 'gp=%257b%2522k%2522%253a%257b%2522key%2522%253a%2522lqjs6ple4qbcy%2522%257d%252c%2522v%2522%253a4%252c%2522d%2522%253a%257b%2522lt%2522%253a33.392%252c%2522ln%2522%253a-84.57%257d%257d; hab=%257b%2522e%2522%253atrue%252c%2522r%2522%253a%255b%255d%257d; vr=vr-rU%2FpWRFUWU6wYzYaX5AuCg.1741881982; at=bAJyo0SDJ4EhYVyUtPvuVhUymGMGPlZLdn3tcKdBI0Xud82LQsbfeiq3TkganjvB; v=v-1; _gcl_au=1.1.1820879348.1741881984; ak_bmsc=12DA12F3DC9B2ABF4F47558632340484~000000000000000000000000000000~YAAQZP49F0HFSI2VAQAAI+wTkRsCvktt8G11jnkV+T4UIawsN55RsmH07lgCfYsaUcenprvfkFJYtQW8XJbekntDA3wSbEGFtp9/iVS5Usz1Jlb4BI7JmnEyCSVu8rhCkchU/JXrh4faaeEZo5cyb5Bb05sJNMLogmMjYOBjg4GK94+xREVCc8dc3ad6BwDROYZxK1iimOBrtLV+f2fSylGJSCJ9Yi16jyGvfPla4GNLGJ85FMItsAoZMeYcWV8kvOZ5izen2RNZ+ZlpHX4BJoCjDmQMw1oJz6DVfJC6JwK9Sn2L8cofye5Bvc9dTyocwUR6wYucqAZrXnkTBaDommnaP3BwNnbXvYRJoWeed3UrLSZFtcFVFCixOeSozKmEOJj5rXUNtisY; AKA_A2=A; akaalb_www_homes_prd=1741902608~op=homes_Prd_Edge_US:www_homes_prd_usw2|~rv=79~m=www_homes_prd_usw2:0|~os=48c4c61a41b922746ef5062cf402e343~id=cba36d26f4d2ae56ef348d9619bda3d8; vt=vt-LvPAD0gIJ0%2Br2gUpwbd1Hg; bm_ss=ab8e18ef4e; bm_so=2CF5C8E88F5A3FDEA0BB1148DC4B3E92B6CC33F58F3FE3D3146676D320D41C49~YAAQyDhjaGttCZCVAQAA+Ah1kQIgEX4lhXLMZwfe+NKUNGvgdpESA2rhSFmMZp6fyvgRZmvV7ToF+9hu5+pdmIryMhW1fvwXRpnL+M5c6LbJgjqJLunjBOrQ0/5txCQ6lKjqy71KqIPPKF8ngY4i69GLBew1Al0IeUKXxv+L0XdmlpcfvnILoy4pLjw700cwUpZfLRMYok40ObcCy7kaB0+bU9CK69GRSmCHo7bhWV/Lutv2cCulXn081knh1lLoVfZrR21C8074pIC4zMSrHhVkrEQV5NapK8+Bd/lz7u5EM5pw9pk+X+ozvrgUrsWvmo6LOl5yVxr74SRa1IwIQmFRrjLwFxDFmHsSEzgyiXPfbI55cH4EeTwXo9XDG34GwnuNcAxN/fTtglHUZyUC08VVtbGltUb6fVUO+yGVpSRvMPuHK85mR1TPZL9WR2Ujh45Ia4QGOOsL/F/fPQ==; sr=%7B%22h%22%3A350%2C%22w%22%3A1728%2C%22p%22%3A2%7D; bm_lso=2CF5C8E88F5A3FDEA0BB1148DC4B3E92B6CC33F58F3FE3D3146676D320D41C49~YAAQyDhjaGttCZCVAQAA+Ah1kQIgEX4lhXLMZwfe+NKUNGvgdpESA2rhSFmMZp6fyvgRZmvV7ToF+9hu5+pdmIryMhW1fvwXRpnL+M5c6LbJgjqJLunjBOrQ0/5txCQ6lKjqy71KqIPPKF8ngY4i69GLBew1Al0IeUKXxv+L0XdmlpcfvnILoy4pLjw700cwUpZfLRMYok40ObcCy7kaB0+bU9CK69GRSmCHo7bhWV/Lutv2cCulXn081knh1lLoVfZrR21C8074pIC4zMSrHhVkrEQV5NapK8+Bd/lz7u5EM5pw9pk+X+ozvrgUrsWvmo6LOl5yVxr74SRa1IwIQmFRrjLwFxDFmHsSEzgyiXPfbI55cH4EeTwXo9XDG34GwnuNcAxN/fTtglHUZyUC08VVtbGltUb6fVUO+yGVpSRvMPuHK85mR1TPZL9WR2Ujh45Ia4QGOOsL/F/fPQ==^1741902123250; bm_s=YAAQyDhjaNt8CZCVAQAAnj91kQM5FGaQxhEb31qC1Khm9ee44yiodp6KEpP89IIZt7Z3ojnCfyzXc813U5XxrxoQx0xdEqRNI+0MRmW7+/9D3iNzrA8Wb/CycNmbveJ1D86DkidL5Se9ZioFHNjAmfAuVvQDbB7k0xyHVbZO17i26rzpHZzVY09CdSj5YOlnw/A9rcLO2KNsFsvJHWIQfxd/U1mOJ1vOwvdiupeVV8ughbEQneQ+JP+vJ5f9au186PIkbgGQl3EaC3pUEMlMoOeA3kduXISq0uwGMuhY7M3AJWCnStfnT3EzF2pnYi1CPAaxvUdPaEv81nZ29HKpRStmUH4yGbke3i7CIiht3vg4efff9CnYCynRvmf+zpVlt0ns+OQmpJNPjGUvB1P10y2iPyvIuj4WWStkxsyRE7oSmHj2KQugyW4HCnWGRFyTvIS5DpqfH9Q=' \
-H 'priority: u=0, i' \
-H 'sec-ch-ua: "Chromium";v="134", "Not:A-Brand";v="24", "Google Chrome";v="134"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: none' \
-H 'sec-fetch-user: ?1' \
-H 'sec-gpc: 1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
Share
Improve this question
asked Mar 13 at 21:55
BenBen
951 silver badge10 bronze badges
2 Answers
Reset to default 1I tested your code and got the same issue. Looks like there is an issue to handle HTTP/2 when using Chrome with Selenium. I have also realized that homes asks to reload some js/css resources with preload hints.
I assume using Chrome is not the priority, and here are two possible solution that worked for me:
1. Use Firefox with Selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
urllist = ['https://www.homes/real-estate-agents/susanne-guthrie/x5lx8zp/',
'https://www.homes/real-estate-agents/sue-pearce/362zpwe/',
'https://www.homes/real-estate-agents/katie-mihelich/kzppjy8/',
'https://www.homes/real-estate-agents/matt-pittman/etryj3q/',
'https://www.homes/real-estate-agents/mateen-ansari/mg2qg9l/',
'https://www.homes/real-estate-agents/rachael-real/dk3q8gl/',
'https://www.homes/real-estate-agents/annamarie-moise/21qtbtb/',
'https://www.homes/real-estate-agents/madison-verdun/0qvxe13/',
'https://www.homes/real-estate-agents/david-stob/hsztzf1/',
'https://www.homes/real-estate-agents/samuel-chrusciel/ww3525k/',
'https://www.homes/real-estate-agents/cathie-smith/b32xp59/',
'https://www.homes/real-estate-agents/jean-reedy-baren/7vk95ky/',
'https://www.homes/real-estate-agents/randy-stob/y9j077s/',
'https://www.homes/real-estate-agents/jeanne-jordan/sh90wf7/',
'https://www.homes/real-estate-agents/anthony-janega/p42zbfv/']
options = Options()
options.set_preference("network.http.http2.enabled", True) # Ensure HTTP/2 is enabled
driver = webdriver.Firefox(options=options)
for url in urllist:
time.sleep(3)
driver.delete_all_cookies() # optional, you can clear cookies
driver.get(url)
print(f"Loaded: {url}")
# SCRAPE DATA
driver.quit() # close the browser
I have tested this code multiple times with Firefox, and looks like it works. I did not get that http2 error with this approach. You can give it a try.
2. Use Playwright
In order to install playwright:
pip install playwright
playwright install
Note: I got an error saying I am missing the following libraries: libgtk-4.so.1 and libmanette-0.2.so.0
To fix this issue, on Linux (debian based):
sudo apt update
sudo apt install -y libgtk-4-bin libgtk-4-common libgtk-4-dev libgtk-4-1 libmanette-0.2-0 libglib2.0-dev
Then the code will look like the following:
from playwright.sync_api import sync_playwright
import random
import time
# List of URLs to scrape
urllist = [
'https://www.homes/real-estate-agents/susanne-guthrie/x5lx8zp/',
'https://www.homes/real-estate-agents/sue-pearce/362zpwe/',
'https://www.homes/real-estate-agents/katie-mihelich/kzppjy8/',
'https://www.homes/real-estate-agents/matt-pittman/etryj3q/',
'https://www.homes/real-estate-agents/mateen-ansari/mg2qg9l/',
'https://www.homes/real-estate-agents/rachael-real/dk3q8gl/',
'https://www.homes/real-estate-agents/annamarie-moise/21qtbtb/',
'https://www.homes/real-estate-agents/madison-verdun/0qvxe13/',
'https://www.homes/real-estate-agents/david-stob/hsztzf1/',
'https://www.homes/real-estate-agents/samuel-chrusciel/ww3525k/',
'https://www.homes/real-estate-agents/cathie-smith/b32xp59/',
'https://www.homes/real-estate-agents/jean-reedy-baren/7vk95ky/',
'https://www.homes/real-estate-agents/randy-stob/y9j077s/',
'https://www.homes/real-estate-agents/jeanne-jordan/sh90wf7/',
'https://www.homes/real-estate-agents/anthony-janega/p42zbfv/'
]
with sync_playwright() as p:
# Change to True for headless mode
browser = p.chromium.launch(headless=False)
page = browser.new_page()
for url in urllist:
time.sleep(random.randint(3, 7))
page.goto(url, wait_until="domcontentloaded")
print(f"Visited: {url}")
browser.close()
I hope these solutions will work for you too.
The issue is there is a set cookie you need to keep updating or use.