this is the code i wrote to generate the data:
info <- html_nodes(manga, ".mt4") %>% html_text2() %>% strsplit("\n")
it returns 50 rows of lists that that look like this:
[1] "Manga (? vols)" "Aug 1989 -" "741,705 members"
without the strsplit function it returns this kind of string:
[1] "Manga (? vols)\nAug 1989 -\n741,705 members"
my end goal is to have 3 separate columns, like below but with 50 rows:
media | serialization | members |
---|---|---|
Manga (? vols) | Aug 1989 - | 741,705 |
this is the code i wrote to generate the data:
info <- html_nodes(manga, ".mt4") %>% html_text2() %>% strsplit("\n")
it returns 50 rows of lists that that look like this:
[1] "Manga (? vols)" "Aug 1989 -" "741,705 members"
without the strsplit function it returns this kind of string:
[1] "Manga (? vols)\nAug 1989 -\n741,705 members"
my end goal is to have 3 separate columns, like below but with 50 rows:
media | serialization | members |
---|---|---|
Manga (? vols) | Aug 1989 - | 741,705 |
a complete table would look like this, i already have the title and rating section figured out:
title | rating | media | serialization | members |
---|---|---|---|---|
Berserk | 9.47 | Manga (? vols) | Aug 1989 - | 741,705 |
im not sure if i should approach the question by keeping my data as rows of strings and trying to split it into columns that way, or figure out what i can do to a rows of lists. i have tried to see if the css selector can be more specific, but it seems like MyAnimeList just lumps all that information together so im not sure what to do
Share Improve this question edited Mar 23 at 10:09 margusl 18.4k3 gold badges22 silver badges29 bronze badges asked Mar 23 at 6:54 dootdoot 12 bronze badges 1 |3 Answers
Reset to default 1I assume it's https://myanimelist/topmanga.php you are working on.
This is how text is anized in div.mt4
:
mt4 <- rvest::html_element(rvest::read_html("https://myanimelist/topmanga.php"), ".mt4")
as.character(mt4) %>% cat()
#> <div class="information di-ib mt4">
#> Manga (? vols)<br>
#> Aug 1989 - <br>
#> 741,735 members
#> </div>
xml2::html_structure(mt4)
#> <div.information.di-ib.mt4>
#> {text}
#> <br>
#> {text}
#> <br>
#> {text}
So you can access each individual text section separated by <br>
with xpath, if you choose to:
html_element(mt4, xpath = "./text()[1]") %>% html_text2()
#> [1] "Manga (? vols)"
html_element(mt4, xpath = "./text()[2]") %>% html_text2()
#> [1] "Aug 1989 -"
html_element(mt4, xpath = "./text()[3]") %>% html_text2()
#> [1] "741,735 members"
Though I personally would not bother unless there are known inconsistencies, just collect and separate at newlines:
library(rvest)
library(tidyr)
read_html("https://myanimelist/topmanga.php") %>%
html_elements("tr.ranking-list") %>%
{
tibble(
title = html_element(., "h3.manga_h3") %>% html_text(),
rating = html_element(., ".score span") %>% html_text(),
info = html_element(., ".mt4") %>% html_text2()
)
} %>%
separate_wider_delim(info, "\n", names = c("media", "serialization", "members"))
#> # A tibble: 50 × 5
#> title rating media serialization members
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Berserk 9.47 Mang… Aug 1989 - 741,73…
#> 2 JoJo no Kimyou na Bouken Part 7: Steel Ba… 9.32 Mang… Jan 2004 - A… 287,98…
#> 3 Vagabond 9.26 Mang… Sep 1998 - M… 417,71…
#> 4 One Piece 9.22 Mang… Jul 1997 - 654,63…
#> 5 Monster 9.16 Mang… Dec 1994 - D… 264,90…
#> 6 Slam Dunk 9.08 Mang… Sep 1990 - J… 183,84…
#> 7 Vinland Saga 9.08 Mang… Apr 2005 - 327,82…
#> 8 Fullmetal Alchemist 9.04 Mang… Jul 2001 - S… 305,86…
#> 9 Tian Guan Cifu 9.03 Nove… Feb 2021 - M… 14,678…
#> 10 Grand Blue 9.03 Mang… Apr 2014 - 185,87…
#> # ℹ 40 more rows
Created on 2025-03-23 with reprex v2.1.1
Like others I also assume you are referring to /topmanga.php. Here is one way to convert your list of string vectors into a dataframe in base R.
The approach below uses lapply to iterate over the t()
call (transpose), passing the output of the transposition into as.data.frame()
. This returns a list of dataframes. Then, the list of dataframes is bound together using rbind
within do.call()
.
library(rvest)
url <- "https://myanimelist/topmanga.php"
manga <- read_html(url)
df <-
html_nodes(manga, ".mt4") %>%
html_text2() %>%
strsplit("\n") %>%
lapply(\(listElement) as.data.frame(t(listElement))) %>%
do.call(rbind, .)
head(df)
Printing out:
V1 V2 V3
1 Manga (? vols) Aug 1989 - 741,744 members
2 Manga (24 vols) Jan 2004 - Apr 2011 287,986 members
3 Manga (37 vols) Sep 1998 - May 2015 417,716 members
4 Manga (? vols) Jul 1997 - 654,636 members
5 Manga (18 vols) Dec 1994 - Dec 2001 264,904 members
6 Manga (31 vols) Sep 1990 - Jun 1996 183,847 members
Note that if you wish to use options other than base R, then dplyr::bind_rows()
can be used instead of do.call(rbind, .)
.
Having shared all that for how to deal with your strsplit() output, I find the answer by margusl to probably be more principled for your broader objective.
Thanks to @Margusl for providing the link. In this case it can also be viable to read the html table directly using readHTMLTable(url)
and unnest_wider
the Title
-column that includes multiple information by \n
.
library(XML)
library(RCurl)
library(rlist)
library(tidyverse)
url <- getURL("https://myanimelist/topmanga.php",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(url) # gives all tables on the page
df <- tables[[1]] %>%
mutate(Title = str_split(Title, "\\n+")) %>% # split Title column by `\n`
unnest_wider(Title, names_sep = "_") %>%
mutate(across(where(is.character), ~ trimws(.x))) # trim
kableExtra::kable(head(df))
giving
Rank | Title_1 | Title_2 | Title_3 | Title_4 | Title_5 | Title_6 | Score | Your Score | Status |
---|---|---|---|---|---|---|---|---|---|
1 | Berserk | Manga (? vols) | Aug 1989 - | 741,749 members | NA | 9.47 | N/A | Add to My List | |
2 | JoJo no Kimyou na Bouken Part 7: Steel Ball Run | Manga (24 vols) | Jan 2004 - Apr 2011 | 287,986 members | NA | 9.32 | N/A | Add to My List | |
3 | Vagabond | Manga (37 vols) | Sep 1998 - May 2015 | 417,727 members | NA | 9.26 | N/A | Add to My List | |
4 | One Piece | Manga (? vols) | Jul 1997 - | 654,641 members | NA | 9.22 | N/A | Add to My List | |
5 | Monster | Manga (18 vols) | Dec 1994 - Dec 2001 | 264,907 members | NA | 9.16 | N/A | Add to My List | |
6 | Slam Dunk | Manga (31 vols) | Sep 1990 - Jun 1996 | 183,847 members | NA | 9.08 | N/A | Add to My List |
manga
? – s_baldur Commented Mar 23 at 9:50