最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Finding the web address of a Javascript link - Stack Overflow

programmeradmin3浏览0评论

I am using RCurl in R to try and download data from a website, but I'm having trouble finding out what URL to use. Here is the site:

.aspx?ticker=PGX

See how in the upper right, above the displayed sheet, there's a link to download the data as a .csv file? I was wondering if there was a way to find a regular HTTP address for that .csv file, because RCurl can't handle the Javascript mands.

I am using RCurl in R to try and download data from a website, but I'm having trouble finding out what URL to use. Here is the site:

http://www.invescopowershares./products/holdings.aspx?ticker=PGX

See how in the upper right, above the displayed sheet, there's a link to download the data as a .csv file? I was wondering if there was a way to find a regular HTTP address for that .csv file, because RCurl can't handle the Javascript mands.

Share Improve this question edited Apr 6, 2013 at 20:14 CraigTeegarden 8,2518 gold badges39 silver badges43 bronze badges asked Aug 6, 2010 at 18:08 Andrew PageAndrew Page 835 bronze badges
Add a ment  | 

4 Answers 4

Reset to default 10 +50

I will give you a quick and dirty way to get the data. Firstly you can use Fiddler2 http://www.fiddler2./fiddler2/ to inspect the POST that your browser sends. This results in the following POST:

POST http://www.invescopowershares./products/holdings.aspx?ticker=PGX HTTP/1.1
Host: www.invescopowershares.
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Referer: http://www.invescopowershares./products/holdings.aspx?ticker=PGX
Content-Type: application/x-www-form-urlencoded
Content-Length: 70669

__EVENTTARGET=ctl00%24MainPageLeft%24MainPageContent%24ExportHoldings1%24LinkButton1&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTE1OTcxNjYzNw9kFgJmD2QWBAIDD2QWBAIDD2QWCAIBDw9kFgQeC2........

So we can see that 3 parameters are being posted namely __EVENTTARGET, __EVENTVALIDATION and __VIEWSTATE.

The required form for the postForm call would be:

postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)

Now es the quick and dirty bit. I would just open a browser and get the relevant parameters that it recieves as follows:

library(r)
ie = CreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
ie$Navigate2("http://www.invescopowershares./products/holdings.aspx?ticker=PGX")
while(GetProperty(ie,"busy")||GetProperty(ie,"ReadyState")<4){
Sys.sleep(1)
print(GetProperty(ie,"ReadyState"))
}
myDoc<-GetProperty(ie,"Document")
myPW<-GetProperty(myDoc,"parentWindow")
Invoke(myPW,"execScript","var dumVar1=theForm.__EVENTVALIDATION.value;var dumVar2=theForm.__VIEWSTATE.value;","JavaScript")
event.val<-myPW[["dumVar1"]]
view.state<-myPW[["dumVar2"]]
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
ie$Quit()
ftarget<-"http://www.invescopowershares./products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')


> fin.data[1,]
  ticker SecurityNum                      Name CouponRate maturitydate
1    PGX   949746879 WELLS FARGO & COMPANY PFD       0.08             
     rating  Shares PercentageOfFund PositionDate
1 BBB+/Baa3 2538656       0.04442112   06/11/2012

__EVENTVALIDATION, __VIEWSTATE maybe be always the same or they maybe session cookies. You could probably get them using RCurl but as I say this is the quick and dirty solution and we just take the ones that Internet Explorer is given. Things to note:

1). This requires windows with IE installed to use the r bits.

2). If you are running ie9 you may need to add invescopowershares. to the Compatibility View Settings (as microsoft seems to have blocked event.val<-myPW[["dumVar1"]] type calls)

EDIT (UPDATE)

Having looked thru the website in more detail __EVENTVALIDATION, __VIEWSTATE are being set as javascript variables on the initial page. We can just parse these in a quick and dirty fashion as follows without resorting to calling a browser.

dum<-getURL("http://www.invescopowershares./products/holdings.aspx?ticker=PGX")
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
event.val<-unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2]
event.val<-unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1]
view.state<-unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2]
view.state<-unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1]
ftarget<-"http://www.invescopowershares./products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')

The above should work cross platform.

Clicking on the Download link executes this piece of JavaScript:

__doPostBack('ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1','')

That __doPostBack function appears to simply fill in a couple of hidden form fields on that page then submit a POST request.

A quick googling shows that RCurl is capable of submitting a POST request. So, what you would need to do is look in the source of that page, find the form with name "aspnetForm", take all the fields from that form, and create your own POST request that submits the fields to the action URL (http://www.invescopowershares./products/holdings.aspx?ticker=PGX).

Can't guarantee this will work, though. There appears to be a hidden form field named __VIEWSTATE that appears to encode some information, and I don't know how this factors in.

This is definitely the way to get the .csv file in RCurl, but I can't figure out what form fields I want to use in getForm to make it work. Should I use the fields from the doPostBack mand that's attached to the "Download" link on the page, or should I use the fields from aspnetForm on the source page. Just for reference, the aspnetForm field we're interested in is:

" form name="aspnetForm" method="post" action="holdings.aspx?ticker=PGX" id="aspnetForm" style="margin:0px" "

... and the postForm request I just tried that didn't work was

postForm("http://www.invescopowershares./products/holdings.aspx?ticker=PGX", "form name" = "aspnetForm", "method" = "post", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm", "style" = "margin:0px")

Thanks for all of the help!

There is now a function in the qmao package that will do this for you. (It is based on code from a now deleted answer to this question.)

You can use the dlPowerShares function like this:

require("qmao")
Symbol <- "PGX"
dat <- qmao:::dlPowerShares(event.target = "ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1", 
                            action = paste0("holdings.aspx?ticker=", Symbol))
> head(dat)
  ticker SecurityNum                           Name CouponRate maturitydate    rating  Shares PercentageOfFund PositionDate
1    PGX   173080201         CITIGROUP CAPITAL XIII    0.07875   10/30/2040    BB/Ba2 2998647       0.04274939   08/31/2012
2    PGX   949746879      WELLS FARGO & COMPANY PFD    0.08000              BBB+/Baa3 2549992       0.03935854   08/31/2012
3    PGX   06739H362      BARCLAYS BK PLC              0.08125                A-/Baa3 2757635       0.03644835   08/31/2012
4    PGX   46625H621      JPMORGAN CHASE               0.08625              BBB+/Baa1 2416021       0.03310707   08/31/2012
5    PGX   060505765   BANK OF AMERICA CORP PFD 8.2    0.08200                 BB+/B1 2345508       0.03128002   08/31/2012
6    PGX   060505559 BANC OF AMERICA CORP PFD 8.625    0.08625                 BB+/B1 2259484       0.03001599   08/31/2012

In the code above, event.target is the first string inside of the javascript:__doPostBack() function that you will get when you right click the “Download” link and “Copy Link Address”.

action is the product-specific part of the action url.

Internally, the code follows Jeff's suggestion in his answer and searches the source of the page for the values of the fields for the "aspnetForm". It then uses those values in a call to postForm (from the RCurl package.)

In the qmao package, dlPowerShares is used by getHoldings.powershares. Also, getHoldings will call getHoldings.powershares if one of the Symbols passed to it is the symbol of a PowerShares ETF.


p.s. if qmao:::dlPowerShares is called with its defaults, it will download the PowerShares product list from http://www.invescopowershares./products/

发布评论

评论列表(0)

  1. 暂无评论