I am trying to scrape the name and location of the following but I get an empty DF - any help is appreciated !Ive tried using a CSS selector and xpath and it is still not picking up anything
library(rvest)
Aeroframe<-data.frame()
url <- ";
webpage <- read_html(url)
title<-webpage%>%html_nodes(".field_item")%>%html_text()
location<-webpage%>%html_nodes(".field_label")%>%html_text()
AeroFrame<-data.frame(title,location)
I am trying to scrape the name and location of the following https://www.casa.gov.au/search-centre/aerodromes but I get an empty DF - any help is appreciated !Ive tried using a CSS selector and xpath and it is still not picking up anything
library(rvest)
Aeroframe<-data.frame()
url <- "https://www.casa.gov.au/search-centre/aerodromes"
webpage <- read_html(url)
title<-webpage%>%html_nodes(".field_item")%>%html_text()
location<-webpage%>%html_nodes(".field_label")%>%html_text()
AeroFrame<-data.frame(title,location)
Share
Improve this question
edited Mar 18 at 8:07
margusl
18.4k3 gold badges22 silver badges29 bronze badges
asked Mar 18 at 7:41
evanievani
11 silver badge1 bronze badge
2
|
1 Answer
Reset to default 3It would help to know what exactly did you recieve from read_html()
, but you may face couple of issues here.
By inspecting elements (and source) we can see that actual classes are spelled bit differently:
<div class="field field--label-inline">
<div class="field__label">Aerodrome operator:</div>
<div class="field__item"> Abra Mining Pty Limited </div>
</div>
Though there's a good chance that you never actually received any relevant content from read_html()
. At least with my setup and from my location I first need to fiddle with request headers a bit to get anything back, something like:
library(httr2)
request(url) |>
req_user_agent("Mozilla/5.0") |>
req_headers(Connection = "Keep-Alive") |>
req_perform() |>
resp_body_html()
And then I'm treated with a small JavaScript challenge that is there to block some automated tools (like rvest
).
If you have Chrome or any other Chromium-based browser, like Edge, and {chromote}
installed, you can try replacing read_html()
with read_html_live()
. And perhaps adjust your strategy a bit:
library(rvest)
url_ <- "https://www.casa.gov.au/search-centre/aerodromes"
webpage <- read_html_live(url_)
# collect containers
cards <- webpage |> html_elements(".card-fields")
# extract 1st & 2nd set of labels & fields from every container:
tibble::tibble(
f1_label = cards |> html_element(xpath = "./div[1]/div[@class='field__label']") |> html_text(trim = TRUE),
f1_item = cards |> html_element(xpath = "./div[1]/div[@class='field__item']" ) |> html_text(trim = TRUE),
f2_label = cards |> html_element(xpath = "./div[2]/div[@class='field__label']") |> html_text(trim = TRUE),
f2_item = cards |> html_element(xpath = "./div[2]/div[@class='field__item']" ) |> html_text(trim = TRUE)
)
#> # A tibble: 15 × 4
#> f1_label f1_item f2_label f2_item
#> <chr> <chr> <chr> <chr>
#> 1 Aerodrome operator: Abra Mining Pty Limited Locatio… WA
#> 2 Aerodrome operator: Adelaide Airport Limited Locatio… SA
#> 3 Aerodrome operator: City of Albany Locatio… WA
#> 4 Aerodrome operator: Albury City Council Locatio… NSW
#> 5 Aerodrome operator: Alice Springs Airport Pty Ltd Locatio… NT
#> 6 Aerodrome operator: Barcaldine Regional Council Locatio… Qld
#> 7 Aerodrome operator: Ararat Rural City Council Locatio… Vic
#> 8 Aerodrome operator: Archerfield Airport Corporation Pty Ltd Locatio… Qld
#> 9 Aerodrome operator: Argyle Diamonds Limited Locatio… WA
#> 10 Aerodrome operator: Armidale Regional Council Locatio… NSW
#> 11 Aerodrome operator: Aurukun Shire Council Locatio… Qld
#> 12 Aerodrome operator: Avalon Airport Australia Pty Ltd Locatio… Vic
#> 13 Aerodrome operator: Voyages Indigenous Tourism Australia Pt… Locatio… NT
#> 14 Aerodrome operator: East Gippsland Shire Council Locatio… Vic
#> 15 Aerodrome operator: Wirrimanu Aboriginal Corporation Locatio… WA
Created on 2025-03-18 with reprex v2.1.1
.field_item
should read.field__item
– margusl Commented Mar 18 at 7:51.field__item
&.field__label
solves your issue as per margusl's great comment, it does however not give two equal-sized vectors. So I would doAeroframe <- webpage %>% html_nodes(".card-fields") %>% html_text() %>% sub("Aerodrome operator:", "", .) %>% gsub("Location:", ", ",.); res <- do.call(rbind.data.frame, strsplit(Aeroframe, ", ")); res <- data.frame(Aerodrome_operator = trimws(res[,1]),Location = trimws(res[,2]))
and crawl overhttps://www.casa.gov.au/search-centre/aerodromes?page=x
x=page, where 1 = page 2 – Tim G Commented Mar 18 at 10:26