最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

web scraping - how to turn html text into multiple different columns in r - Stack Overflow

programmeradmin3浏览0评论

this is the code i wrote to generate the data:

info <- html_nodes(manga, ".mt4") %>% html_text2() %>% strsplit("\n")

it returns 50 rows of lists that that look like this:

[1] "Manga (? vols)" "Aug 1989 -" "741,705 members"

without the strsplit function it returns this kind of string:

[1] "Manga (? vols)\nAug 1989 -\n741,705 members"

my end goal is to have 3 separate columns, like below but with 50 rows:

media serialization members
Manga (? vols) Aug 1989 - 741,705

this is the code i wrote to generate the data:

info <- html_nodes(manga, ".mt4") %>% html_text2() %>% strsplit("\n")

it returns 50 rows of lists that that look like this:

[1] "Manga (? vols)" "Aug 1989 -" "741,705 members"

without the strsplit function it returns this kind of string:

[1] "Manga (? vols)\nAug 1989 -\n741,705 members"

my end goal is to have 3 separate columns, like below but with 50 rows:

media serialization members
Manga (? vols) Aug 1989 - 741,705

a complete table would look like this, i already have the title and rating section figured out:

title rating media serialization members
Berserk 9.47 Manga (? vols) Aug 1989 - 741,705

im not sure if i should approach the question by keeping my data as rows of strings and trying to split it into columns that way, or figure out what i can do to a rows of lists. i have tried to see if the css selector can be more specific, but it seems like MyAnimeList just lumps all that information together so im not sure what to do

Share Improve this question edited Mar 23 at 10:09 margusl 18.4k3 gold badges22 silver badges29 bronze badges asked Mar 23 at 6:54 dootdoot 12 bronze badges 1
  • what is the definition of manga? – s_baldur Commented Mar 23 at 9:50
Add a comment  | 

3 Answers 3

Reset to default 1

I assume it's https://myanimelist/topmanga.php you are working on.

This is how text is anized in div.mt4:

mt4 <- rvest::html_element(rvest::read_html("https://myanimelist/topmanga.php"), ".mt4")
as.character(mt4) %>% cat()
#> <div class="information di-ib mt4">
#>         Manga (? vols)<br>
#>         Aug 1989 - <br>
#>         741,735 members
#>       </div>
xml2::html_structure(mt4)
#> <div.information.di-ib.mt4>
#>   {text}
#>   <br>
#>   {text}
#>   <br>
#>   {text}

So you can access each individual text section separated by <br> with xpath, if you choose to:

html_element(mt4, xpath = "./text()[1]") %>%  html_text2()
#> [1] "Manga (? vols)"
html_element(mt4, xpath = "./text()[2]") %>% html_text2()
#> [1] "Aug 1989 -"
html_element(mt4, xpath = "./text()[3]") %>% html_text2()
#> [1] "741,735 members"

Though I personally would not bother unless there are known inconsistencies, just collect and separate at newlines:

library(rvest)
library(tidyr)

read_html("https://myanimelist/topmanga.php") %>%
  html_elements("tr.ranking-list") %>% 
  {
    tibble(
      title  = html_element(., "h3.manga_h3") %>% html_text(),
      rating = html_element(., ".score span") %>% html_text(),
      info   = html_element(., ".mt4") %>% html_text2()
    )
  } %>%
  separate_wider_delim(info, "\n", names = c("media", "serialization", "members"))
#> # A tibble: 50 × 5
#>    title                                      rating media serialization members
#>    <chr>                                      <chr>  <chr> <chr>         <chr>  
#>  1 Berserk                                    9.47   Mang… Aug 1989 -    741,73…
#>  2 JoJo no Kimyou na Bouken Part 7: Steel Ba… 9.32   Mang… Jan 2004 - A… 287,98…
#>  3 Vagabond                                   9.26   Mang… Sep 1998 - M… 417,71…
#>  4 One Piece                                  9.22   Mang… Jul 1997 -    654,63…
#>  5 Monster                                    9.16   Mang… Dec 1994 - D… 264,90…
#>  6 Slam Dunk                                  9.08   Mang… Sep 1990 - J… 183,84…
#>  7 Vinland Saga                               9.08   Mang… Apr 2005 -    327,82…
#>  8 Fullmetal Alchemist                        9.04   Mang… Jul 2001 - S… 305,86…
#>  9 Tian Guan Cifu                             9.03   Nove… Feb 2021 - M… 14,678…
#> 10 Grand Blue                                 9.03   Mang… Apr 2014 -    185,87…
#> # ℹ 40 more rows

Created on 2025-03-23 with reprex v2.1.1

Like others I also assume you are referring to /topmanga.php. Here is one way to convert your list of string vectors into a dataframe in base R.

The approach below uses lapply to iterate over the t() call (transpose), passing the output of the transposition into as.data.frame(). This returns a list of dataframes. Then, the list of dataframes is bound together using rbind within do.call().

library(rvest)

url <- "https://myanimelist/topmanga.php"

manga <- read_html(url)

df <-
    html_nodes(manga, ".mt4") %>%
    html_text2() %>%
    strsplit("\n") %>%
    lapply(\(listElement) as.data.frame(t(listElement))) %>%
    do.call(rbind, .)

head(df)

Printing out:

               V1                  V2              V3
1  Manga (? vols)          Aug 1989 - 741,744 members
2 Manga (24 vols) Jan 2004 - Apr 2011 287,986 members
3 Manga (37 vols) Sep 1998 - May 2015 417,716 members
4  Manga (? vols)          Jul 1997 - 654,636 members
5 Manga (18 vols) Dec 1994 - Dec 2001 264,904 members
6 Manga (31 vols) Sep 1990 - Jun 1996 183,847 members

Note that if you wish to use options other than base R, then dplyr::bind_rows() can be used instead of do.call(rbind, .).

Having shared all that for how to deal with your strsplit() output, I find the answer by margusl to probably be more principled for your broader objective.

Thanks to @Margusl for providing the link. In this case it can also be viable to read the html table directly using readHTMLTable(url) and unnest_wider the Title-column that includes multiple information by \n.

library(XML)
library(RCurl)
library(rlist)
library(tidyverse)
url <- getURL("https://myanimelist/topmanga.php",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(url) # gives all tables on the page

df <- tables[[1]] %>%
  mutate(Title = str_split(Title, "\\n+")) %>% # split Title column  by `\n`
  unnest_wider(Title, names_sep = "_") %>%
  mutate(across(where(is.character), ~ trimws(.x))) # trim

kableExtra::kable(head(df))

giving

Rank Title_1 Title_2 Title_3 Title_4 Title_5 Title_6 Score Your Score Status
1 Berserk Manga (? vols) Aug 1989 - 741,749 members NA 9.47 N/A Add to My List
2 JoJo no Kimyou na Bouken Part 7: Steel Ball Run Manga (24 vols) Jan 2004 - Apr 2011 287,986 members NA 9.32 N/A Add to My List
3 Vagabond Manga (37 vols) Sep 1998 - May 2015 417,727 members NA 9.26 N/A Add to My List
4 One Piece Manga (? vols) Jul 1997 - 654,641 members NA 9.22 N/A Add to My List
5 Monster Manga (18 vols) Dec 1994 - Dec 2001 264,907 members NA 9.16 N/A Add to My List
6 Slam Dunk Manga (31 vols) Sep 1990 - Jun 1996 183,847 members NA 9.08 N/A Add to My List
发布评论

评论列表(0)

  1. 暂无评论