web scraping - WebScraping in nodes and elements in R

I am trying to scrape the name and location of the following but I get an empty DF - any help is appreciated !Ive tried using a CSS selector and xpath and it is still not picking up anything

library(rvest)

Aeroframe<-data.frame()
url <- ";
webpage <- read_html(url)
title<-webpage%>%html_nodes(".field_item")%>%html_text()
location<-webpage%>%html_nodes(".field_label")%>%html_text()
AeroFrame<-data.frame(title,location)

I am trying to scrape the name and location of the following https://www.casa.gov.au/search-centre/aerodromes but I get an empty DF - any help is appreciated !Ive tried using a CSS selector and xpath and it is still not picking up anything

library(rvest)

Aeroframe<-data.frame()
url <- "https://www.casa.gov.au/search-centre/aerodromes"
webpage <- read_html(url)
title<-webpage%>%html_nodes(".field_item")%>%html_text()
location<-webpage%>%html_nodes(".field_label")%>%html_text()
AeroFrame<-data.frame(title,location)

Share Improve this question edited Mar 18 at 8:07 margusl 18.4k3 gold badges22 silver badges29 bronze badges asked Mar 18 at 7:41 evani 11 silver badge1 bronze badge

3 Welcome to SO! You might be dealing with a typo here, your selectors seem to miss a 2nd underscore, e.g. .field_item should read .field__item – margusl Commented Mar 18 at 7:51
Obtaining .field__item & .field__label solves your issue as per margusl's great comment, it does however not give two equal-sized vectors. So I would do Aeroframe <- webpage %>% html_nodes(".card-fields") %>% html_text() %>% sub("Aerodrome operator:", "", .) %>% gsub("Location:", ", ",.); res <- do.call(rbind.data.frame, strsplit(Aeroframe, ", ")); res <- data.frame(Aerodrome_operator = trimws(res[,1]),Location = trimws(res[,2])) and crawl over https://www.casa.gov.au/search-centre/aerodromes?page=x x=page, where 1 = page 2 – Tim G Commented Mar 18 at 10:26

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

It would help to know what exactly did you recieve from read_html(), but you may face couple of issues here.

By inspecting elements (and source) we can see that actual classes are spelled bit differently:

<div class="field field--label-inline">
  <div class="field__label">Aerodrome operator:</div>
  <div class="field__item"> Abra Mining Pty Limited </div>
</div>

Though there's a good chance that you never actually received any relevant content from read_html(). At least with my setup and from my location I first need to fiddle with request headers a bit to get anything back, something like:

library(httr2)
request(url) |> 
  req_user_agent("Mozilla/5.0") |> 
  req_headers(Connection = "Keep-Alive") |> 
  req_perform() |> 
  resp_body_html()

And then I'm treated with a small JavaScript challenge that is there to block some automated tools (like rvest ).

If you have Chrome or any other Chromium-based browser, like Edge, and {chromote} installed, you can try replacing read_html() with read_html_live(). And perhaps adjust your strategy a bit:

library(rvest)

url_ <- "https://www.casa.gov.au/search-centre/aerodromes"
webpage <- read_html_live(url_)

# collect containers
cards <- webpage |> html_elements(".card-fields")

# extract 1st & 2nd set of labels & fields from every container:
tibble::tibble(
  f1_label = cards |> html_element(xpath = "./div[1]/div[@class='field__label']") |> html_text(trim = TRUE),
  f1_item  = cards |> html_element(xpath = "./div[1]/div[@class='field__item']" ) |> html_text(trim = TRUE),
  f2_label = cards |> html_element(xpath = "./div[2]/div[@class='field__label']") |> html_text(trim = TRUE),
  f2_item  = cards |> html_element(xpath = "./div[2]/div[@class='field__item']" ) |> html_text(trim = TRUE)
)
#> # A tibble: 15 × 4
#>    f1_label            f1_item                                  f2_label f2_item
#>    <chr>               <chr>                                    <chr>    <chr>  
#>  1 Aerodrome operator: Abra Mining Pty Limited                  Locatio… WA     
#>  2 Aerodrome operator: Adelaide Airport Limited                 Locatio… SA     
#>  3 Aerodrome operator: City of Albany                           Locatio… WA     
#>  4 Aerodrome operator: Albury City Council                      Locatio… NSW    
#>  5 Aerodrome operator: Alice Springs Airport Pty Ltd            Locatio… NT     
#>  6 Aerodrome operator: Barcaldine Regional Council              Locatio… Qld    
#>  7 Aerodrome operator: Ararat Rural City Council                Locatio… Vic    
#>  8 Aerodrome operator: Archerfield Airport Corporation Pty Ltd  Locatio… Qld    
#>  9 Aerodrome operator: Argyle Diamonds Limited                  Locatio… WA     
#> 10 Aerodrome operator: Armidale Regional Council                Locatio… NSW    
#> 11 Aerodrome operator: Aurukun Shire Council                    Locatio… Qld    
#> 12 Aerodrome operator: Avalon Airport Australia Pty Ltd         Locatio… Vic    
#> 13 Aerodrome operator: Voyages Indigenous Tourism Australia Pt… Locatio… NT     
#> 14 Aerodrome operator: East Gippsland Shire Council             Locatio… Vic    
#> 15 Aerodrome operator: Wirrimanu Aboriginal Corporation         Locatio… WA

^{Created on 2025-03-18 with reprex v2.1.1}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

web scraping - WebScraping in nodes and elements in R - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)