I have to extract the data from the table from the following website:
.aspx
When I click on GO, I get a table appended to the page dynamically. I want export those data from the page to a csv file(which I know how to handle), but the source code does not contain any data points.
I have tried looking into the javascript code, when I inspect the elements after the table is generated, I get the data points, but not in the source. I am using mechanize in Python.
I think it is because the page is getting loaded dynamically. What should I do/use?
I have to extract the data from the table from the following website:
http://www.mcxindia./SitePages/indexhistory.aspx
When I click on GO, I get a table appended to the page dynamically. I want export those data from the page to a csv file(which I know how to handle), but the source code does not contain any data points.
I have tried looking into the javascript code, when I inspect the elements after the table is generated, I get the data points, but not in the source. I am using mechanize in Python.
I think it is because the page is getting loaded dynamically. What should I do/use?
Share Improve this question edited Jul 26, 2015 at 18:36 durron597 32.3k18 gold badges103 silver badges160 bronze badges asked Jul 30, 2013 at 5:25 Aakash AnujAakash Anuj 3,8718 gold badges36 silver badges48 bronze badges5 Answers
Reset to default 2mechanize doesn't/can't evaluate javascript. The easiest way that I've seen to evaluate javascript is by using Selenium, which will open a browser on your puter and municate with python.
I answered a similar question here
I agreed Matthew Wesly ment. We will get the dynamic page using Selenium, iMacro like a addons. It captures the dynamic pages response based on our recording. It also has the JS script capability.
I think thought, for easy extraction we will go for normal Content Fetch logic using urllib2 and urllib packages.
First get the page 'viewstate' parameter. i.e Get all hidden element information from the home page and pass the form information as like the JS script does.
And also pass Content-Type key value exactly. Here your response is in the form of "text/plain; charset=utf-8".
To avoid using javascript aware transports you need to:
- Install web debugger into your browser.
- Goto that page. Press F12 to open debugger. Reload page.
- Analyze contents of 'network' tab. Usually ajax pages downloads data as html fragments or as json. Just look into
response
tabs of each request made after pressing 'GO' and you will find familiar data. - Now you can create simple
urllib/urllib2
downloader for that url. - parse that data and convert to csv.
http://www.mcxindia./SitePages/indexhistory.aspx sends POST
request with search parameters on each 'GO' and recieves html fragment you need to parse and convert into csv.
So if to simulate that POST
- you dont need no new browser window.
This worked!!!
import httplib
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url = 'http://www.mcxindia./SitePages/indexhistory.aspx'
br.open(url)
response = br.response().read()
br.select_form(nr=0)
br.set_all_readonly(False)
br.form['mTbFromDate']='08/01/2013'
br.form['mTbToDate']='08/08/2013'
response = br.submit(name='mBtnGo').read()
print response
The best thing I personally do while dealing dynamic web pages is use PyQt webkit and try to mimic as a browser, and then pass the URL to the browser and finally getting the HTML after all Javascripts are rendered.
Example Code-
import sys
from PyQt4.QtGui import QApplication
from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebPage
import bs4 as bs
class Client(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self.on_page_load)
self.mainFrame().load(QUrl(url))
self.app.exec()
def on_page_load(self):
self.app.quit()
url = //your URL
client_response = Client(url)
source = client_response.mainFrame().toHtml()
soup = bs.BeautifulSoup(source, "lxml")
// BeautifulSoup stuff