最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

web scraping - Loading A Public Webpage in PowerShell That Requires JS and Blocks Developer Tools - Stack Overflow

programmeradmin2浏览0评论

Can someone help me find a way to load a public web page that requires JavaScript and blocks access from developers tools? I had an automated process that that worked as follows.

$TdyDate = $(get-date -f yyyyMMdd)
$wsjurl = "/$TdyDate/frontpage"
$wsjweb = Invoke-WebRequest -Uri $wsjurl -UseBasicParsing

This recently started generating "Please enable JS and disable any ad blocker" errors.

Based on this Stack Overflow post I tried the following which gets me past these errors but is only able to pull down an "Access Blocked" landing page instead of the full web page that renders in my browser.

Set-Alias msedge 'C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe'
msedge --headless --dump-dom --disable-gpu $wsjurl

If anyone could help me figure out a way around this, it would be greatly appreciated. The web page I'm targeting is publicly accessible.

Can someone help me find a way to load a public web page that requires JavaScript and blocks access from developers tools? I had an automated process that that worked as follows.

$TdyDate = $(get-date -f yyyyMMdd)
$wsjurl = "https://www.wsj/print-edition/$TdyDate/frontpage"
$wsjweb = Invoke-WebRequest -Uri $wsjurl -UseBasicParsing

This recently started generating "Please enable JS and disable any ad blocker" errors.

Based on this Stack Overflow post I tried the following which gets me past these errors but is only able to pull down an "Access Blocked" landing page instead of the full web page that renders in my browser.

Set-Alias msedge 'C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe'
msedge --headless --dump-dom --disable-gpu $wsjurl

If anyone could help me figure out a way around this, it would be greatly appreciated. The web page I'm targeting is publicly accessible.

Share Improve this question asked Mar 16 at 16:01 jbug187jbug187 31 bronze badge 1
  • Try using Postman to make request and see if you get same error. Postman is very robust and adds HTTP header to the request automatically. If postman works than check the Postman Console for Raw Request. Then add any http headers that Postman added to your PS request. Often issue like this are caused by User-Agent Header being different in Postman than your PS request. – jdweng Commented Mar 16 at 17:17
Add a comment  | 

1 Answer 1

Reset to default 0

The following code snippet could help:

$wsjDate = Get-Date
if ( 0 -eq $wsjDate.DayOfWeek.value__ ) {
    $TdyDate = "{0:yyyyMMdd}" -f $wsjDate.AddDays( -1)  # Sunday -> Saturday
} else {
    $TdyDate = "{0:yyyyMMdd}" -f $wsjDate
}

$wsjurl = "https://www.wsj/print-edition/$TdyDate/frontpage"
$wsjweb = Invoke-WebRequest -Uri $wsjurl -Method Options -UseBasicParsing

Explanation:

  • a bit (seemingly) complicated calculation of $TdyDate respects that the pages are not defined on Sundays,
  • -Method Options circumvents the Please enable JS and disable any ad blocker error, so that
  • $wsjweb.Content contains full web page code: <!DOCTYPE html><html lang="en-US"> … … … </script></body></html>

Moreover, $wsjweb.Headers could enlighten the problem (see properties X-XSS-Protection and X-Content-Type-Options):

$wsjweb.Headers # truncated

Key                       Value
---                       -----
…
X-XSS-Protection          1; mode=block
X-Content-Type-Options    nosniff
…

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论