When I access the above URL by chrome (in a private window), I can see the first request which corresponds to the following curl
mand. But when I call this curl
mand in the mand line, it will be detected by distil network showing <div id="distilIdentificationBlock"> </div>
.
This looks strange to me as this is the first request. Unless there is a difference between the requests send by curl and chrome, there is no way by which distil network can tell what request is sent by a bot or by a real browser. Does anybody know what is the difference between the curl request and the chrome request?
curl '' \
-H 'Connection: keep-alive' \
-H 'Pragma: no-cache' \
-H 'Cache-Control: no-cache' \
-H 'Upgrade-Insecure-Requests: 1' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
-H 'Sec-Fetch-Site: none' \
-H 'Sec-Fetch-Mode: navigate' \
-H 'Sec-Fetch-User: ?1' \
-H 'Sec-Fetch-Dest: document' \
-H 'Accept-Language: en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7' \
--pressed
P.S. The problem is same if I use firefox to extract the corresponding curl mand. So the difference between firefox request and curl request is also relevant.
https://psycnet.apa/record/2016-47119-002
When I access the above URL by chrome (in a private window), I can see the first request which corresponds to the following curl
mand. But when I call this curl
mand in the mand line, it will be detected by distil network showing <div id="distilIdentificationBlock"> </div>
.
This looks strange to me as this is the first request. Unless there is a difference between the requests send by curl and chrome, there is no way by which distil network can tell what request is sent by a bot or by a real browser. Does anybody know what is the difference between the curl request and the chrome request?
curl 'https://psycnet.apa/record/2016-47119-002' \
-H 'Connection: keep-alive' \
-H 'Pragma: no-cache' \
-H 'Cache-Control: no-cache' \
-H 'Upgrade-Insecure-Requests: 1' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
-H 'Sec-Fetch-Site: none' \
-H 'Sec-Fetch-Mode: navigate' \
-H 'Sec-Fetch-User: ?1' \
-H 'Sec-Fetch-Dest: document' \
-H 'Accept-Language: en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7' \
--pressed
P.S. The problem is same if I use firefox to extract the corresponding curl mand. So the difference between firefox request and curl request is also relevant.
Share Improve this question edited May 29, 2020 at 11:09 user1424739 asked May 28, 2020 at 21:32 user1424739user1424739 13.9k23 gold badges93 silver badges190 bronze badges 2- This question should be renamed to "How does Distill detect and block curl requests pared to Chrome requests?" – Dai Commented May 29, 2020 at 11:16
- Please make the change as you feel appropriate. – user1424739 Commented May 29, 2020 at 11:33
1 Answer
Reset to default 6There is no difference.
To find out how Distill (their web-scraping protection system) works, you need to look at the initial response HTML:
When I make the request in Chrome and look at the initial response in the Dev Tools (make sure "Preserve log" is checked) I see that the response is actually a short web-page that contains an embedded <script>
which performs some simple "is this user-agent a web-browser?" checks, like running scripts that look for JavaScript DOM objects that won't exist outside of a web-browser (assuming the HTTP user-agent is capable of running scripts at all - which cURL
and wget
aren't, by the way).
If the script considers your user-agent to be a web-browser then it performs another request for the real content using a dynamically generated password (I didn't look at the details for how that works) - which is why you can't re-request the real content using cURL
or wget
as the password is unique for each request.
Here's a screenshot of the <script>
element in the initial page response, and notice the lack of real actual content in the page's HTML.
If you disable JavaScript in your browser you won't be able to access that web-page at all.
This kind of anti-scraping system protects web-pages against requests from user-agents that lack the means to evaluate JavaScript - so it will block curl
, wget
, HttpClient
, and in-browser fetch
/XMLHttpRequests
(at least not without further work).
You'd think this system would render a site un-indexable by search-engine spiders - but that's an old (and entrenched) belief and practice: because until the late-2000s the main search engine spiders (Google, Bing/Windows Live Search, Yahoo, etc) only indexed raw HTML and didn't run JavaScript - but since then the search-engine spiders started running JavaScript and even started indexing sites not using custom-built spider HTML-parser engines but using actual web-browser engines (Google really started it, so that they could index script-heavy websites, especially single-page-applications using Angular, Vue, etc). When I worked at Microsoft I got to use the Bing crawler system for some product research projects and that used a special build of Internet Explorer to "run" web-pages that it visited.