web scraping - how to turn html text into multiple different columns in r

this is the code i wrote to generate the data:

info <- html_nodes(manga, ".mt4") %>% html_text2() %>% strsplit("\n")

it returns 50 rows of lists that that look like this:

[1] "Manga (? vols)" "Aug 1989 -" "741,705 members"

without the strsplit function it returns this kind of string:

[1] "Manga (? vols)\nAug 1989 -\n741,705 members"

my end goal is to have 3 separate columns, like below but with 50 rows:

media	serialization	members
Manga (? vols)	Aug 1989 -	741,705

this is the code i wrote to generate the data:

info <- html_nodes(manga, ".mt4") %>% html_text2() %>% strsplit("\n")

it returns 50 rows of lists that that look like this:

[1] "Manga (? vols)" "Aug 1989 -" "741,705 members"

without the strsplit function it returns this kind of string:

[1] "Manga (? vols)\nAug 1989 -\n741,705 members"

my end goal is to have 3 separate columns, like below but with 50 rows:

media	serialization	members
Manga (? vols)	Aug 1989 -	741,705

a complete table would look like this, i already have the title and rating section figured out:

title	rating	media	serialization	members
Berserk	9.47	Manga (? vols)	Aug 1989 -	741,705

im not sure if i should approach the question by keeping my data as rows of strings and trying to split it into columns that way, or figure out what i can do to a rows of lists. i have tried to see if the css selector can be more specific, but it seems like MyAnimeList just lumps all that information together so im not sure what to do

Share Improve this question edited Mar 23 at 10:09 margusl 18.4k3 gold badges22 silver badges29 bronze badges asked Mar 23 at 6:54 doot 12 bronze badges

what is the definition of manga? – s_baldur Commented Mar 23 at 9:50

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

I assume it's https://myanimelist/topmanga.php you are working on.

This is how text is anized in div.mt4:

mt4 <- rvest::html_element(rvest::read_html("https://myanimelist/topmanga.php"), ".mt4")
as.character(mt4) %>% cat()
#> <div class="information di-ib mt4">
#>         Manga (? vols)<br>
#>         Aug 1989 - <br>
#>         741,735 members
#>       </div>
xml2::html_structure(mt4)
#> <div.information.di-ib.mt4>
#>   {text}
#>   <br>
#>   {text}
#>   <br>
#>   {text}

So you can access each individual text section separated by <br> with xpath, if you choose to:

html_element(mt4, xpath = "./text()[1]") %>%  html_text2()
#> [1] "Manga (? vols)"
html_element(mt4, xpath = "./text()[2]") %>% html_text2()
#> [1] "Aug 1989 -"
html_element(mt4, xpath = "./text()[3]") %>% html_text2()
#> [1] "741,735 members"

Though I personally would not bother unless there are known inconsistencies, just collect and separate at newlines:

library(rvest)
library(tidyr)

read_html("https://myanimelist/topmanga.php") %>%
  html_elements("tr.ranking-list") %>% 
  {
    tibble(
      title  = html_element(., "h3.manga_h3") %>% html_text(),
      rating = html_element(., ".score span") %>% html_text(),
      info   = html_element(., ".mt4") %>% html_text2()
    )
  } %>%
  separate_wider_delim(info, "\n", names = c("media", "serialization", "members"))
#> # A tibble: 50 × 5
#>    title                                      rating media serialization members
#>    <chr>                                      <chr>  <chr> <chr>         <chr>  
#>  1 Berserk                                    9.47   Mang… Aug 1989 -    741,73…
#>  2 JoJo no Kimyou na Bouken Part 7: Steel Ba… 9.32   Mang… Jan 2004 - A… 287,98…
#>  3 Vagabond                                   9.26   Mang… Sep 1998 - M… 417,71…
#>  4 One Piece                                  9.22   Mang… Jul 1997 -    654,63…
#>  5 Monster                                    9.16   Mang… Dec 1994 - D… 264,90…
#>  6 Slam Dunk                                  9.08   Mang… Sep 1990 - J… 183,84…
#>  7 Vinland Saga                               9.08   Mang… Apr 2005 -    327,82…
#>  8 Fullmetal Alchemist                        9.04   Mang… Jul 2001 - S… 305,86…
#>  9 Tian Guan Cifu                             9.03   Nove… Feb 2021 - M… 14,678…
#> 10 Grand Blue                                 9.03   Mang… Apr 2014 -    185,87…
#> # ℹ 40 more rows

^{Created on 2025-03-23 with reprex v2.1.1}

Like others I also assume you are referring to /topmanga.php. Here is one way to convert your list of string vectors into a dataframe in base R.

The approach below uses lapply to iterate over the t() call (transpose), passing the output of the transposition into as.data.frame(). This returns a list of dataframes. Then, the list of dataframes is bound together using rbind within do.call().

library(rvest)

url <- "https://myanimelist/topmanga.php"

manga <- read_html(url)

df <-
    html_nodes(manga, ".mt4") %>%
    html_text2() %>%
    strsplit("\n") %>%
    lapply(\(listElement) as.data.frame(t(listElement))) %>%
    do.call(rbind, .)

head(df)

Printing out:

               V1                  V2              V3
1  Manga (? vols)          Aug 1989 - 741,744 members
2 Manga (24 vols) Jan 2004 - Apr 2011 287,986 members
3 Manga (37 vols) Sep 1998 - May 2015 417,716 members
4  Manga (? vols)          Jul 1997 - 654,636 members
5 Manga (18 vols) Dec 1994 - Dec 2001 264,904 members
6 Manga (31 vols) Sep 1990 - Jun 1996 183,847 members

Note that if you wish to use options other than base R, then dplyr::bind_rows() can be used instead of do.call(rbind, .).

Having shared all that for how to deal with your strsplit() output, I find the answer by margusl to probably be more principled for your broader objective.

Thanks to @Margusl for providing the link. In this case it can also be viable to read the html table directly using readHTMLTable(url) and unnest_wider the Title-column that includes multiple information by \n.

library(XML)
library(RCurl)
library(rlist)
library(tidyverse)
url <- getURL("https://myanimelist/topmanga.php",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(url) # gives all tables on the page

df <- tables[[1]] %>%
  mutate(Title = str_split(Title, "\\n+")) %>% # split Title column  by `\n`
  unnest_wider(Title, names_sep = "_") %>%
  mutate(across(where(is.character), ~ trimws(.x))) # trim

kableExtra::kable(head(df))

giving

Rank	Title_1	Title_3	Title_4	Title_5	Title_6	Score	Your Score	Status
1	Berserk	Manga (? vols)	Aug 1989 -	741,749 members	NA	9.47	N/A	Add to My List
2	JoJo no Kimyou na Bouken Part 7: Steel Ball Run	Manga (24 vols)	Jan 2004 - Apr 2011	287,986 members	NA	9.32	N/A	Add to My List
3	Vagabond	Manga (37 vols)	Sep 1998 - May 2015	417,727 members	NA	9.26	N/A	Add to My List
4	One Piece	Manga (? vols)	Jul 1997 -	654,641 members	NA	9.22	N/A	Add to My List
5	Monster	Manga (18 vols)	Dec 1994 - Dec 2001	264,907 members	NA	9.16	N/A	Add to My List
6	Slam Dunk	Manga (31 vols)	Sep 1990 - Jun 1996	183,847 members	NA	9.08	N/A	Add to My List

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

web scraping - how to turn html text into multiple different columns in r - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)