I'm learning how to use R to scrape tables of baseball stats from different places on the web. For example, I adapted this post to scrape a player's minor league fielding data from the player register page. Here's that code for Junior Caminero:
library(tidyverse)
player_html <- ".fcgi?id=camine000jun#standard_fielding"
fielding_data <-
player_html |>
read_html() |>
as.character() |> # Extracts the HTML content as a string
str_replace_all("<!--|-->","") |> # Remove all HTML comments; this is necessary
read_html() |> # Parse the cleaned html
html_nodes(xpath = '//*[@id="div_standard_fielding"]') |>
html_table() |> # Extract, parse,
pluck(1) # and pluck the fielding data table from the cleaned HTML
I would like to do a similar thing, but with a stats table in MLB. For example, here is a link to Caminero's MLB page:
Specifically, I would like to pull the table of his minor league fielding stats. Note that to get the minor league fielding data to appear, you need to hit the "Minors" and "Fielding" buttons respectively.
I tried to figure out the syntax for accessing the desired table from the MLB table (did "Inspect element" within Chrome), but having zero html coding experience, I'm not really even sure where to begin.
One possible complication: upon hitting those two buttons, the link changes to:
;year=2024
OTOH, if you click on this link (or simply paste that text into my browser address bar), you'll end up Caminero's minor league *batting* page. Maybe this doesn't matter. But I don't really have any expertise in HTML, so it's all a mystery to me.
I'm learning how to use R to scrape tables of baseball stats from different places on the web. For example, I adapted this post to scrape a player's minor league fielding data from the player register page. Here's that code for Junior Caminero:
library(tidyverse)
player_html <- "https://www.baseball-reference/register/player.fcgi?id=camine000jun#standard_fielding"
fielding_data <-
player_html |>
read_html() |>
as.character() |> # Extracts the HTML content as a string
str_replace_all("<!--|-->","") |> # Remove all HTML comments; this is necessary
read_html() |> # Parse the cleaned html
html_nodes(xpath = '//*[@id="div_standard_fielding"]') |>
html_table() |> # Extract, parse,
pluck(1) # and pluck the fielding data table from the cleaned HTML
I would like to do a similar thing, but with a stats table in MLB. For example, here is a link to Caminero's MLB page:
https://www.mlb/player/junior-caminero-691406
Specifically, I would like to pull the table of his minor league fielding stats. Note that to get the minor league fielding data to appear, you need to hit the "Minors" and "Fielding" buttons respectively.
I tried to figure out the syntax for accessing the desired table from the MLB table (did "Inspect element" within Chrome), but having zero html coding experience, I'm not really even sure where to begin.
One possible complication: upon hitting those two buttons, the link changes to:
https://www.mlb/player/junior-caminero-691406?stats=career-r-fielding-minors&year=2024
OTOH, if you click on this link (or simply paste that text into my browser address bar), you'll end up Caminero's minor league *batting* page. Maybe this doesn't matter. But I don't really have any expertise in HTML, so it's all a mystery to me.
Share Improve this question edited Feb 16 at 5:09 Buckaroo Banzai asked Feb 15 at 19:36 Buckaroo BanzaiBuckaroo Banzai 11 silver badge2 bronze badges New contributor Buckaroo Banzai is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.2 Answers
Reset to default 2Sniff the network traffic and fetch data from their open API
library(tidyverse)
library(httr2)
"https://statsapi.mlb/api/v1/people/691406/stats?stats=yearByYear,career,yearByYearAdvanced,careerAdvanced&gameType=R&leagueListId=milb_all&group=fielding&hydrate=team(league)&language=en" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("stats") %>%
as_tibble() %>%
unnest(splits)
# A tibble: 50 × 13
type$displayName group$displayName exemptions season stat$gamesPlayed
<chr> <chr> <list> <chr> <int>
1 yearByYear fielding <list [0]> 2021 10
2 yearByYear fielding <list [0]> 2021 4
3 yearByYear fielding <list [0]> 2021 18
4 yearByYear fielding <list [0]> 2021 5
5 yearByYear fielding <list [0]> 2021 6
6 yearByYear fielding <list [0]> 2022 33
7 yearByYear fielding <list [0]> 2022 15
8 yearByYear fielding <list [0]> 2022 16
9 yearByYear fielding <list [0]> 2022 12
10 yearByYear fielding <list [0]> 2022 8
# ℹ 40 more rows
# ℹ 22 more variables: stat$gamesStarted <int>, $assists <int>,
# $putOuts <int>, $errors <int>, $chances <int>, $fielding <chr>,
# $position <df[,4]>, $rangeFactorPerGame <chr>, $rangeFactorPer9Inn <chr>,
# $innings <chr>, $games <int>, $doublePlays <int>, $triplePlays <int>,
# $throwingErrors <int>, team <df[,21]>, player <df[,3]>, league <df[,3]>,
# sport <df[,3]>, gameType <chr>, position <df[,4]>, numTeams <int>, …
# ℹ Use `print(n = ...)` to see more rows
You can seleninder
to scrape the careerTable
because the table is loaded dynamically with JavaScript.
if (!require("pacman")) install.packages("pacman")
pacman::p_load("selenider","rvest")
session <- selenider_session("selenium", browser = "chrome")
open_url("https://www.mlb/player/junior-caminero-691406?stats=career-r-fielding-minors&year=2024")
session |> find_element("button[id='onetrust-accept-btn-handler']") |> elem_click()
session |> find_element("button[data-type='fielding']") |> elem_click()
table_data <- session %>%
get_page_source() %>%
html_element("#careerTable table") %>% # I know this is the table by using F12 and inspecting the page
html_table()
giving
Season | Team | LG | Level | G | AB | R | H | TB | 2B | 3B | HR | RBI | BB | IBB | SO | SB | CS | AVG | OBP | SLG | OPS | GO/AO |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2021 | D-INR | DSL | ROK | 43 | 146 | 26 | 43 | 78 | 8 | 0 | 9 | 33 | 20 | 2 | 28 | 2 | 0 | 0.295 | 0.380 | 0.534 | 0.914 | 1.00 |
2022 | 2 teams | - | Minors | 62 | 239 | 37 | 75 | 119 | 7 | 2 | 11 | 51 | 23 | 0 | 43 | 12 | 1 | 0.314 | 0.384 | 0.498 | 0.882 | 1.14 |
2022 | F-RAY | FCL | ROK | 36 | 132 | 18 | 43 | 65 | 5 | 1 | 5 | 31 | 15 | 0 | 21 | 7 | 1 | 0.326 | 0.403 | 0.492 | 0.895 | 1.03 |
2022 | CHS | CAR | A | 26 | 107 | 19 | 32 | 54 | 2 | 1 | 6 | 20 | 8 | 0 | 22 | 5 | 0 | 0.299 | 0.359 | 0.505 | 0.864 | 1.30 |
2023 | 2 teams | - | Minors | 117 | 460 | 85 | 149 | 272 | 18 | 6 | 31 | 94 | 42 | 1 | 100 | 5 | 5 | 0.324 | 0.384 | 0.591 | 0.975 | 1.38 |
2023 | BG | SAL | A+ | 36 | 146 | 30 | 52 | 100 | 9 | 3 | 11 | 32 | 10 | 0 | 40 | 2 | 1 | 0.356 | 0.409 | 0.685 | 1.094 | 1.45 |
2023 | MTG | SOU | AA | 81 | 314 | 55 | 97 | 172 | 9 | 3 | 20 | 62 | 32 | 1 | 60 | 3 | 4 | 0.309 | 0.373 | 0.548 | 0.921 | 1.35 |
2024 | 2 teams | - | Minors | 59 | 234 | 37 | 64 | 122 | 10 | 0 | 16 | 39 | 21 | 0 | 52 | 1 | 1 | 0.274 | 0.337 | 0.521 | 0.858 | 1.59 |
2024 | F-RAY | FCL | ROK | 6 | 17 | 4 | 4 | 14 | 1 | 0 | 3 | 5 | 5 | 0 | 2 | 0 | 0 | 0.235 | 0.409 | 0.824 | 1.233 | 1.75 |
2024 | DUR | INT | AAA | 53 | 217 | 33 | 60 | 108 | 9 | 0 | 13 | 34 | 16 | 0 | 50 | 1 | 1 | 0.276 | 0.331 | 0.498 | 0.829 | 1.57 |
Minors Career | - | - | Minors | 281 | 1079 | 185 | 331 | 591 | 43 | 8 | 67 | 217 | 106 | 3 | 223 | 20 | 7 | 0.307 | 0.374 | 0.548 | 0.922 | 1.30 |
- | - | ROK | 85 | 295 | 48 | 90 | 157 | 14 | 1 | 17 | 69 | 40 | 2 | 51 | 9 | 1 | 0.305 | 0.392 | 0.532 | 0.924 | 1.05 | |
- | - | AAA | 53 | 217 | 33 | 60 | 108 | 9 | 0 | 13 | 34 | 16 | 0 | 50 | 1 | 1 | 0.276 | 0.331 | 0.498 | 0.829 | 1.57 | |
- | - | AA | 81 | 314 | 55 | 97 | 172 | 9 | 3 | 20 | 62 | 32 | 1 | 60 | 3 | 4 | 0.309 | 0.373 | 0.548 | 0.921 | 1.35 | |
- | - | A+ | 36 | 146 | 30 | 52 | 100 | 9 | 3 | 11 | 32 | 10 | 0 | 40 | 2 | 1 | 0.356 | 0.409 | 0.685 | 1.094 | 1.45 | |
- | - | A | 26 | 107 | 19 | 32 | 54 | 2 | 1 | 6 | 20 | 8 | 0 | 22 | 5 | 0 | 0.299 | 0.359 | 0.505 | 0.864 | 1.30 |