I am trying to automatize data extraction from a website on Austrian employemnt figures using R: .aspx
For example, I would like to specify
- On the left selection box: Erwerbstätige -> Unselbständig Beschäftige
- On the second box I do not check any of the options.
- And on the third column (outputformtat) I choose
Ausgabeformat = Zeitreihe
(i.e. timeseries) and specify a start date.
By clicking on Ausführen
(i.e Execute) the data is generated and I can export it as JSON or xlsx.
The data is also shown in a panel with id = main_UpdatePanel1
I am not sure how to generate this data using R: as I have to physcially select all the options.
I found out the options on the left column are specified in this part of the html:
<div id="tree">
<ul id="treeData" style="display: none;" class="ui-fancytree-source fancytree-helper-hidden">
<li id="id_1" class="folder">Erwerbstätige
<ul>
<li id="id1.1" data-content="hvs_Bestand_UB">Unselbständig Beschäftigte</li>
<li id="id1.2" data-content="hvs_Bestand_FD">Freie Dienstverträge</li>
<li title="Geringfügig Beschäftigte (siehe Hinweis in Information)" id="id1.3" data-content="hvs_Bestand_GB">Geringfügig Beschäftigte</li>
<li title="Geringfügig Freie Dienstverträge (siehe Hinweis in Information)" id="id1.4" data-content="hvs_Bestand_GD">Geringfügig Freie Dienstverträge</li>
<li id="id1.5" data-content="sbe_Bestand_SB">Selbständig Beschäftigte</li>
</ul>
</li>
<li id="id_2" class="folder">Arbeitskräftepotential
<ul><li id="id2.1" data-content="akpalq_Bestand_PO">Bestand</li></ul>
</li>
<li id="id_3" class="folder">(Register-)Arbeitslosenquoten
<ul><li id="id3.1" data-content="akpalq_Bestand_QU">Bestand</li></ul>
</li>
<li id="id_4" class="folder">Quoten
<ul>
<li id="id4.1" data-content="quoten_Bestand_beQU">Beschäftigungsquote</li>
<li id="id4.2" data-content="quoten_Bestand_erQU">Erwerbsquote</li>
</ul>
</li>
</ul>
Using this, I can see that the first option Unselbständig Beschäftigte
corresponds to id = id1.1
Similarly, the outputformat is controlled here:
<select id="lstAusgabe" class="form-control form-control-sm">
<option value="TA">Tabelle</option>
<option value="ZR">Zeitreihe</option>
</select>
So I would need value = "ZR"
.
But I am absolutely clueless on what to do with this information.
I am trying to automatize data extraction from a website on Austrian employemnt figures using R: https://www.dnet.at/amis/Datenbank/DB_Be.aspx
For example, I would like to specify
- On the left selection box: Erwerbstätige -> Unselbständig Beschäftige
- On the second box I do not check any of the options.
- And on the third column (outputformtat) I choose
Ausgabeformat = Zeitreihe
(i.e. timeseries) and specify a start date.
By clicking on Ausführen
(i.e Execute) the data is generated and I can export it as JSON or xlsx.
The data is also shown in a panel with id = main_UpdatePanel1
I am not sure how to generate this data using R: as I have to physcially select all the options.
I found out the options on the left column are specified in this part of the html:
<div id="tree">
<ul id="treeData" style="display: none;" class="ui-fancytree-source fancytree-helper-hidden">
<li id="id_1" class="folder">Erwerbstätige
<ul>
<li id="id1.1" data-content="hvs_Bestand_UB">Unselbständig Beschäftigte</li>
<li id="id1.2" data-content="hvs_Bestand_FD">Freie Dienstverträge</li>
<li title="Geringfügig Beschäftigte (siehe Hinweis in Information)" id="id1.3" data-content="hvs_Bestand_GB">Geringfügig Beschäftigte</li>
<li title="Geringfügig Freie Dienstverträge (siehe Hinweis in Information)" id="id1.4" data-content="hvs_Bestand_GD">Geringfügig Freie Dienstverträge</li>
<li id="id1.5" data-content="sbe_Bestand_SB">Selbständig Beschäftigte</li>
</ul>
</li>
<li id="id_2" class="folder">Arbeitskräftepotential
<ul><li id="id2.1" data-content="akpalq_Bestand_PO">Bestand</li></ul>
</li>
<li id="id_3" class="folder">(Register-)Arbeitslosenquoten
<ul><li id="id3.1" data-content="akpalq_Bestand_QU">Bestand</li></ul>
</li>
<li id="id_4" class="folder">Quoten
<ul>
<li id="id4.1" data-content="quoten_Bestand_beQU">Beschäftigungsquote</li>
<li id="id4.2" data-content="quoten_Bestand_erQU">Erwerbsquote</li>
</ul>
</li>
</ul>
Using this, I can see that the first option Unselbständig Beschäftigte
corresponds to id = id1.1
Similarly, the outputformat is controlled here:
<select id="lstAusgabe" class="form-control form-control-sm">
<option value="TA">Tabelle</option>
<option value="ZR">Zeitreihe</option>
</select>
So I would need value = "ZR"
.
But I am absolutely clueless on what to do with this information.
Share Improve this question asked Nov 15, 2024 at 19:10 CetttCettt 12k8 gold badges39 silver badges60 bronze badges 2- 2 In general the preferred way to do this would be via API if the anization provides them. To access these you would use httr package. If no API exists your second option would be to scrape the website, this is more analogous to what you are doing manually. To do this you would use RSelenium package. Both packages have extensive documentation so hopefully this can help you get a start. – Adam Commented Nov 15, 2024 at 20:18
- @Adam. Thank you. I will look into that – Cettt Commented Nov 15, 2024 at 20:23
1 Answer
Reset to default 1You probably could automate it with rvest::read_html_live()
and resulting LiveHTML object that let's you interact with a live page through chromote
.
But let's try this with {selenider}
instead, for richer interaction (quote from ?rvest::LiveHTML
).
As a first step you should probably just poke that page a bit in your browser's dev tools, i.e. how it's all glued together, what triggers additional requests and what gets requested, what js libraries are used for controls & widgets, can any of existing javascript be used instead of generating clicks and keypresses, are there any constraints set in frontend (max year span seems to be 5, so perhaps try to respect that) etc. Set breakpoints, check documentation of used libraries, dig into call stacks in network tab, search for elements that stick out in js code.
Apparently frontend javascript (mostly here & here ) is pretty well structured, not minified and super-verbose with lots of comments; as we can evaluate javascript with selenider
/ chromote
, many objects & functions are already exposed for use to use, so there's really no need to invent everything from scratch. Left pane for example is a Fancytree widget and we can use its API to select items, which in turn will trigger required events.
library(selenider)
library(rvest)
library(dplyr)
library(tidyr)
selenider_session(
"chromote",
timeout = 10
)
#> A selenider session object
#> • Open for 2ms
#> • Session: "chromote"
#> • Browser: "Chrome"
#> • Port: NA
#> • Timeout: 10s
open_url("https://www.dnet.at/amis/Datenbank/DB_Be.aspx")
# we can view current (chromote) session in a browser
get_session()$driver$view()
#> [1] 0
# get fancytree keys (id values):
ss("#tree li[data-content]")
#> { selenider_elements (9) }
#> [1] <li id="id1.1" data-content="hvs_Bestand_UB">Unselbständig Beschäftigte</li>
#> [2] <li id="id1.2" data-content="hvs_Bestand_FD">Freie Dienstverträge</li>
#> [3] <li title="Geringfügig Beschäftigte (siehe Hinweis in Information)" id="id1.3 ...
#> [4] <li title="Geringfügig Freie Dienstverträge (siehe Hinweis in Information)" i ...
#> [5] <li id="id1.5" data-content="sbe_Bestand_SB">Selbständig Beschäftigte</li>
#> [6] <li id="id2.1" data-content="akpalq_Bestand_PO">Bestand</li>
#> [7] <li id="id3.1" data-content="akpalq_Bestand_QU">Bestand</li>
#> [8] <li id="id4.1" data-content="quoten_Bestand_beQU">Beschäftigungsquote</li>
#> [9] <li id="id4.2" data-content="quoten_Bestand_erQU">Erwerbsquote</li>
# activate `id1.3`, "Geringfügig Beschäftigte (siehe Hinweis in Information)"
execute_js_expr("$.ui.fancytree.getTree('#tree').activateKey(arguments[0]);", "id1.3")
# switch Ausgabeformat to Zeitreihe
execute_js_expr("lstAusgabe.value = arguments[0]; lstAusgabeOnChange();", "ZR")
# set years
execute_js_expr(
"lstJahrBis.value = arguments[1];
lstJahrBisOnChange();
lstJahrVon.value = arguments[0];
lstJahrVonOnChange();",
2023, 2024)
# we need some kind of a marker to know when request is completed,
# for this let's remove some content and later wait for it to reappear
execute_js_expr("document.querySelector('#divContentTemplate').innerHTML = ''")
s("[name = 'ctl00$main$btnAspBtn']") |>
elem_click()
# successful request recreates #divContentTemplate content,
# wait for it for max 30s
s("#divContentTemplate > div") |>
elem_expect(is_present, timeout = 30)
# request title:
s("#headerAktAuswahl") |>
elem_text()
#> [1] "Erwerbstätige: Geringfügig Beschäftigte - Zeitreihe: Monate 2024-2024"
# parse table:
s("table#main_gAktuell") |>
# switch to rvest
read_html() |>
html_table() |>
# read_html() creates a new html doc, so html_table() returns a list
# with a single tibble
first() |>
pivot_longer(everything())
#> # A tibble: 12 × 2
#> name value
#> <chr> <dbl>
#> 1 2024_01 339.
#> 2 2024_02 340.
#> 3 2024_03 341.
#> 4 2024_04 336.
#> 5 2024_05 339.
#> 6 2024_06 341.
#> 7 2024_07 327.
#> 8 2024_08 319.
#> 9 2024_09 321.
#> 10 2024_10 327.
#> 11 2024_11 0
#> 12 2024_12 0
Created on 2024-11-17 with reprex v2.1.1