I have a website containing a large DB of products and prices.
I am being constantly cURLed for prices.
I thought of preventing it with a <noscript>
tag but all I can do with this is hide the content, bots would still be able to scrape my content.
Is there a way of running a JS test to see if js is disabled (to detect bots) and redirect these requests, maybe in a blacklist.
Will doing so block google from going through my website?
I have a website containing a large DB of products and prices.
I am being constantly cURLed for prices.
I thought of preventing it with a <noscript>
tag but all I can do with this is hide the content, bots would still be able to scrape my content.
Is there a way of running a JS test to see if js is disabled (to detect bots) and redirect these requests, maybe in a blacklist.
Will doing so block google from going through my website?
Share Improve this question asked Jun 8, 2014 at 7:21 Nir TzezanaNir Tzezana 2,3423 gold badges34 silver badges61 bronze badges 7-
You can deny requests without an
userAgent
(but with cURL you can bypass this) or whitelist Google, Facebook, Twitter botsuserAgent
etc.. – Adam Azad Commented Jun 8, 2014 at 7:26 - As long as the data is public, there really is no easy automated solution. The bots can always be rewritten to bypass your checks. – John V. Commented Jun 8, 2014 at 7:26
- Why don't use htaccess to block bots by IP or location ? – Vincent Decaux Commented Jun 8, 2014 at 7:29
- you may probably want to use some authentication or track users with cookies – source.rar Commented Jun 8, 2014 at 7:35
- @VincentDecaux they just change their IP, it won't last long – Nir Tzezana Commented Jun 8, 2014 at 7:38
3 Answers
Reset to default 1Since CURL is just an html request your server can't differentiate unless you limit certain urls' access or check for referrer url's and implement a filter for anything not referred locally. An example of how to build a check can be found here:
Checking the referrer
You can block unspoofed cURL
requests in php by checking the User Agent
. As far as I know none of the search engine crawlers have curl in their user user agent string, so this shouldn't block them.
if(stripos($_SERVER['HTTP_USER_AGENT'],'curl') !== false) {
http_response_code(403); //FORBIDDEN
exit;
}
Note that changing the User Agent
string of a cURL
request is trivial, so someone could easily bypass this.
You would need to create a block list and block the ips from accessing the content, all headers including referrer and user agent can be set in curl very easily with the simple following code
$agent = 'Mozilla/4.0 (patible; MSIE 6.0; Windows NT 5.1; SV1)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, 'http://www.yoursite.?data=anydata');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, 'http://www.yoursite.');
$html = curl_exec($ch);
the above will make the curl request look like a normal connection from a browser using firefox.