For pletely non-nefarious purposes - machine learning specifically, I'd like to download a huge dataset of CAPTCHA images. However, CAPTCHA is always implemented using some obfuscated javascript that makes getting at the actual images without a browser a non-trivial task, at least to me, who is a javascript novice.
So, can anyone give me some helpful pointers on how to download the image of the obscured word using a script pletely outside of a browser? And please don't point me to a dataset of already collected obscured words - I need to collect the images from a specific website for this particular experiment.
Thanks!
Edit: Another way this question could be asked is very simple. When you click "view source" on website with plicated javascript, you see the script references, but that's all you see. However, if you click "save webpage as..." (in firefox) and then view the source of the saved webpage, the javascript will be resolved and new html and the images (at least in the case of ASIRRA and reCAPTCHA) is in the source. How can I mimic this "save webpage as..." behavior using a script? This is an important web coding question in general, so please stop questioning me on my motives with this! This is knowledge I can use from now on in all web development involving scripting and I'm sure other stack overflow visitors can as well!
For pletely non-nefarious purposes - machine learning specifically, I'd like to download a huge dataset of CAPTCHA images. However, CAPTCHA is always implemented using some obfuscated javascript that makes getting at the actual images without a browser a non-trivial task, at least to me, who is a javascript novice.
So, can anyone give me some helpful pointers on how to download the image of the obscured word using a script pletely outside of a browser? And please don't point me to a dataset of already collected obscured words - I need to collect the images from a specific website for this particular experiment.
Thanks!
Edit: Another way this question could be asked is very simple. When you click "view source" on website with plicated javascript, you see the script references, but that's all you see. However, if you click "save webpage as..." (in firefox) and then view the source of the saved webpage, the javascript will be resolved and new html and the images (at least in the case of ASIRRA and reCAPTCHA) is in the source. How can I mimic this "save webpage as..." behavior using a script? This is an important web coding question in general, so please stop questioning me on my motives with this! This is knowledge I can use from now on in all web development involving scripting and I'm sure other stack overflow visitors can as well!
Share Improve this question edited Oct 9, 2009 at 14:58 JoeCool asked Oct 9, 2009 at 13:54 JoeCoolJoeCool 4,43211 gold badges53 silver badges68 bronze badges 5- 1 How about asking the owner of the website? If it is non-nefarious... – Greg Commented Oct 9, 2009 at 13:57
- The site is actually microsoft's research project called ASIRRA, which uses cats and dogs rather than obscured words - but it's implemented in basically the same way. They have a public dataset, but it's far too small. – JoeCool Commented Oct 9, 2009 at 14:00
- @Greg: agreed, the polite thing to do is contact the site owner before you bulk download content and suck up a lot of bandwidth. – D'Arcy Rittich Commented Oct 9, 2009 at 14:02
- Wouldn't it be better to pull the pictures directly from the source at PetFinder? That way you can bring the classification info (cat/dog) info at the same time. – Andrew Commented Oct 9, 2009 at 14:08
- In my defense, I just read an academic paper where the researchers did exactly what I am trying to do. Also, I would ping the site at a very reasonable rate, nothing extreme (especially for Microsoft). The main thing I want to gain from asking this question is the web experience of accessing those images using a script, which nobody has really helped me with. But I think I'm getting close to figuring it out myself, so I'll post what I figure out. – JoeCool Commented Oct 9, 2009 at 14:24
3 Answers
Reset to default 5While waiting for an answer here I kept digging and eventually figured out a sort of hacked way of getting done what I wanted.
First off, the reason this is a somewhat plicated problem (at least to a javascript novice like me) is that the images from ASIRRA are loaded onto the webpage via javascript, which is a client-side technology. This is a problem when you download the webpage using something like wget or curl because it doesn't actually run the javascript, it just downloads the source html. Therefore, you don't get the images.
However, I realized that using firefox's "Save Page As..." did exactly what I needed. It ran the javascript which loaded the images, and then it saved it all into the well-known directory structure on my hard drive. That's exactly what I wanted to automate. So... I found a firefox Add-on called "iMacros" and wrote this macro:
VERSION BUILD=6240709 RECORDER=FX
TAB T=1
URL GOTO=http://www.asirra./examples/ExampleService.html
SAVEAS TYPE=CPL FOLDER=C:\Cat-Dog\Downloads FILE=*
Set to loop 10,000 times, it worked perfectly. In fact, since it was always saving to the same folder, duplicate images were overwritten (which is what I wanted).
Why not just get CAPTCHA yourself and generate images? reCAPTCHA's free too. http://www.captcha/
Update: I see you want it from a specific site but if you get your own you can tweak it to give the same kind of images as the site you're targeting.
Get in contact with the people who run the site and ask for the dataset. If you try to download many images in any suspicious way, you'll end up on their kill list rather quickly which means that you won't get anything from them anymore.
CAPTCHAs are meant to protect people against abuse and what you do will look like abuse from their point of view.