最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

web scraping - Use R to scrape MLB.com player fielding data - Stack Overflow

programmeradmin3浏览0评论

I'm learning how to use R to scrape tables of baseball stats from different places on the web. For example, I adapted this post to scrape a player's minor league fielding data from the player register page. Here's that code for Junior Caminero:

library(tidyverse)

player_html <- ".fcgi?id=camine000jun#standard_fielding"

fielding_data <- 
  player_html |> 
  read_html() |> 
  as.character() |> # Extracts the HTML content as a string
  str_replace_all("<!--|-->","") |> # Remove all HTML comments; this is necessary
  read_html() |>  # Parse the cleaned html
  html_nodes(xpath = '//*[@id="div_standard_fielding"]') |> 
  html_table() |> # Extract, parse, 
  pluck(1)     # and pluck the fielding data table from the cleaned HTML

I would like to do a similar thing, but with a stats table in MLB. For example, here is a link to Caminero's MLB page:

Specifically, I would like to pull the table of his minor league fielding stats. Note that to get the minor league fielding data to appear, you need to hit the "Minors" and "Fielding" buttons respectively.

I tried to figure out the syntax for accessing the desired table from the MLB table (did "Inspect element" within Chrome), but having zero html coding experience, I'm not really even sure where to begin.

One possible complication: upon hitting those two buttons, the link changes to:

;year=2024

OTOH, if you click on this link (or simply paste that text into my browser address bar), you'll end up Caminero's minor league *batting* page. Maybe this doesn't matter. But I don't really have any expertise in HTML, so it's all a mystery to me.

I'm learning how to use R to scrape tables of baseball stats from different places on the web. For example, I adapted this post to scrape a player's minor league fielding data from the player register page. Here's that code for Junior Caminero:

library(tidyverse)

player_html <- "https://www.baseball-reference/register/player.fcgi?id=camine000jun#standard_fielding"

fielding_data <- 
  player_html |> 
  read_html() |> 
  as.character() |> # Extracts the HTML content as a string
  str_replace_all("<!--|-->","") |> # Remove all HTML comments; this is necessary
  read_html() |>  # Parse the cleaned html
  html_nodes(xpath = '//*[@id="div_standard_fielding"]') |> 
  html_table() |> # Extract, parse, 
  pluck(1)     # and pluck the fielding data table from the cleaned HTML

I would like to do a similar thing, but with a stats table in MLB. For example, here is a link to Caminero's MLB page:

https://www.mlb/player/junior-caminero-691406

Specifically, I would like to pull the table of his minor league fielding stats. Note that to get the minor league fielding data to appear, you need to hit the "Minors" and "Fielding" buttons respectively.

I tried to figure out the syntax for accessing the desired table from the MLB table (did "Inspect element" within Chrome), but having zero html coding experience, I'm not really even sure where to begin.

One possible complication: upon hitting those two buttons, the link changes to:

https://www.mlb/player/junior-caminero-691406?stats=career-r-fielding-minors&year=2024

OTOH, if you click on this link (or simply paste that text into my browser address bar), you'll end up Caminero's minor league *batting* page. Maybe this doesn't matter. But I don't really have any expertise in HTML, so it's all a mystery to me.

Share Improve this question edited Feb 16 at 5:09 Buckaroo Banzai asked Feb 15 at 19:36 Buckaroo BanzaiBuckaroo Banzai 11 silver badge2 bronze badges New contributor Buckaroo Banzai is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
Add a comment  | 

2 Answers 2

Reset to default 2

Sniff the network traffic and fetch data from their open API

library(tidyverse)
library(httr2)

"https://statsapi.mlb/api/v1/people/691406/stats?stats=yearByYear,career,yearByYearAdvanced,careerAdvanced&gameType=R&leagueListId=milb_all&group=fielding&hydrate=team(league)&language=en" %>% 
  request() %>% 
  req_perform() %>% 
  resp_body_json(simplifyVector = TRUE) %>%
  pluck("stats") %>%  
  as_tibble() %>% 
  unnest(splits)

# A tibble: 50 × 13
   type$displayName group$displayName exemptions season stat$gamesPlayed
   <chr>            <chr>             <list>     <chr>             <int>
 1 yearByYear       fielding          <list [0]> 2021                 10
 2 yearByYear       fielding          <list [0]> 2021                  4
 3 yearByYear       fielding          <list [0]> 2021                 18
 4 yearByYear       fielding          <list [0]> 2021                  5
 5 yearByYear       fielding          <list [0]> 2021                  6
 6 yearByYear       fielding          <list [0]> 2022                 33
 7 yearByYear       fielding          <list [0]> 2022                 15
 8 yearByYear       fielding          <list [0]> 2022                 16
 9 yearByYear       fielding          <list [0]> 2022                 12
10 yearByYear       fielding          <list [0]> 2022                  8
# ℹ 40 more rows
# ℹ 22 more variables: stat$gamesStarted <int>, $assists <int>,
#   $putOuts <int>, $errors <int>, $chances <int>, $fielding <chr>,
#   $position <df[,4]>, $rangeFactorPerGame <chr>, $rangeFactorPer9Inn <chr>,
#   $innings <chr>, $games <int>, $doublePlays <int>, $triplePlays <int>,
#   $throwingErrors <int>, team <df[,21]>, player <df[,3]>, league <df[,3]>,
#   sport <df[,3]>, gameType <chr>, position <df[,4]>, numTeams <int>, …
# ℹ Use `print(n = ...)` to see more rows

You can seleninder to scrape the careerTable because the table is loaded dynamically with JavaScript.

if (!require("pacman")) install.packages("pacman")
pacman::p_load("selenider","rvest")

session <- selenider_session("selenium", browser = "chrome")
open_url("https://www.mlb/player/junior-caminero-691406?stats=career-r-fielding-minors&year=2024")

session |> find_element("button[id='onetrust-accept-btn-handler']") |>  elem_click()
session |>  find_element("button[data-type='fielding']") |>   elem_click()

table_data <- session %>%
  get_page_source() %>% 
  html_element("#careerTable table") %>% # I know this is the table by using F12 and inspecting the page
  html_table()

giving

Season Team LG Level G AB R H TB 2B 3B HR RBI BB IBB SO SB CS AVG OBP SLG OPS GO/AO
2021 D-INR DSL ROK 43 146 26 43 78 8 0 9 33 20 2 28 2 0 0.295 0.380 0.534 0.914 1.00
2022 2 teams - Minors 62 239 37 75 119 7 2 11 51 23 0 43 12 1 0.314 0.384 0.498 0.882 1.14
2022 F-RAY FCL ROK 36 132 18 43 65 5 1 5 31 15 0 21 7 1 0.326 0.403 0.492 0.895 1.03
2022 CHS CAR A 26 107 19 32 54 2 1 6 20 8 0 22 5 0 0.299 0.359 0.505 0.864 1.30
2023 2 teams - Minors 117 460 85 149 272 18 6 31 94 42 1 100 5 5 0.324 0.384 0.591 0.975 1.38
2023 BG SAL A+ 36 146 30 52 100 9 3 11 32 10 0 40 2 1 0.356 0.409 0.685 1.094 1.45
2023 MTG SOU AA 81 314 55 97 172 9 3 20 62 32 1 60 3 4 0.309 0.373 0.548 0.921 1.35
2024 2 teams - Minors 59 234 37 64 122 10 0 16 39 21 0 52 1 1 0.274 0.337 0.521 0.858 1.59
2024 F-RAY FCL ROK 6 17 4 4 14 1 0 3 5 5 0 2 0 0 0.235 0.409 0.824 1.233 1.75
2024 DUR INT AAA 53 217 33 60 108 9 0 13 34 16 0 50 1 1 0.276 0.331 0.498 0.829 1.57
Minors Career - - Minors 281 1079 185 331 591 43 8 67 217 106 3 223 20 7 0.307 0.374 0.548 0.922 1.30
- - ROK 85 295 48 90 157 14 1 17 69 40 2 51 9 1 0.305 0.392 0.532 0.924 1.05
- - AAA 53 217 33 60 108 9 0 13 34 16 0 50 1 1 0.276 0.331 0.498 0.829 1.57
- - AA 81 314 55 97 172 9 3 20 62 32 1 60 3 4 0.309 0.373 0.548 0.921 1.35
- - A+ 36 146 30 52 100 9 3 11 32 10 0 40 2 1 0.356 0.409 0.685 1.094 1.45
- - A 26 107 19 32 54 2 1 6 20 8 0 22 5 0 0.299 0.359 0.505 0.864 1.30
发布评论

评论列表(0)

  1. 暂无评论