I'm having trouble using readr::parse_number()
on a character column that sometimes contains non-alphanumeric prefixes before the number. In some cases, parse_number()
extracts the number correctly, while in other cases it returns NA.
For example, consider the following reproducible example:
library(tidyverse)
df <- structure(list(col1 = c("6980265",
"x (6969100)",
"1,234.56",
"euro1,000",
"x. (6969100) ",
"x (6969943)",
"x y. (6977416)",
"x-y (6923012) ")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -8L))
# Directly using parse_number() on col1
df2 <- df |>
mutate(col2 = parse_number(col1, trim_ws = TRUE))
df2
the output is:
A tibble: 8 × 2
col1 col2
<chr> <dbl>
1 "6980265" 6980265
2 "x (6969100)" 6969100
3 "1,234.56" 1235.
4 "euro1,000" 1000
5 "x. (6969100) " NA
6 "x (6969943)" 6969943
7 "x y. (6977416)" NA
8 "x-y (6923012) " NA
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `col2 = parse_number(col1, trim_ws =
TRUE)`.
Caused by warning:
! 3 parsing failures.
row col expected actual
5 -- a number x. (6969100)
7 -- a number x y. (6977416)
8 -- a number x-y (6923012)
After removing all non alpha-numeric the code works:
df |>
mutate(col1 = str_replace_all(col1, "[^[:alnum:] ]", "")) |>
mutate(col2 = parse_number(col1, trim_ws = TRUE))
# A tibble: 8 × 2
col1 col2
<chr> <dbl>
1 "6980265" 6980265
2 "x 6969100" 6969100
3 "123456" 123456
4 "euro1000" 1000
5 "x 6969100 " 6969100
6 "x 6969943" 6969943
7 "x y 6977416" 6977416
8 "xy 6923012 " 6923012
Question:
Why does parse_number()
sometimes ignore the numeric part when there is a non-alphanumeric character in front of it and sometimes not?
I'm having trouble using readr::parse_number()
on a character column that sometimes contains non-alphanumeric prefixes before the number. In some cases, parse_number()
extracts the number correctly, while in other cases it returns NA.
For example, consider the following reproducible example:
library(tidyverse)
df <- structure(list(col1 = c("6980265",
"x (6969100)",
"1,234.56",
"euro1,000",
"x. (6969100) ",
"x (6969943)",
"x y. (6977416)",
"x-y (6923012) ")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -8L))
# Directly using parse_number() on col1
df2 <- df |>
mutate(col2 = parse_number(col1, trim_ws = TRUE))
df2
the output is:
A tibble: 8 × 2
col1 col2
<chr> <dbl>
1 "6980265" 6980265
2 "x (6969100)" 6969100
3 "1,234.56" 1235.
4 "euro1,000" 1000
5 "x. (6969100) " NA
6 "x (6969943)" 6969943
7 "x y. (6977416)" NA
8 "x-y (6923012) " NA
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `col2 = parse_number(col1, trim_ws =
TRUE)`.
Caused by warning:
! 3 parsing failures.
row col expected actual
5 -- a number x. (6969100)
7 -- a number x y. (6977416)
8 -- a number x-y (6923012)
After removing all non alpha-numeric the code works:
df |>
mutate(col1 = str_replace_all(col1, "[^[:alnum:] ]", "")) |>
mutate(col2 = parse_number(col1, trim_ws = TRUE))
# A tibble: 8 × 2
col1 col2
<chr> <dbl>
1 "6980265" 6980265
2 "x 6969100" 6969100
3 "123456" 123456
4 "euro1000" 1000
5 "x 6969100 " 6969100
6 "x 6969943" 6969943
7 "x y 6977416" 6977416
8 "xy 6923012 " 6923012
Question:
Why does parse_number()
sometimes ignore the numeric part when there is a non-alphanumeric character in front of it and sometimes not?
1 Answer
Reset to default 3The characters "." and "-" can be parts of well-formed numbers, along with the 10 digits, "+", "e", and maybe others. Presumably it's looking for any of those characters, and when it finds one of them, it tries to parse it and following characters as a number. In your NA cases they weren't part of the number, they sat by themselves.
parse_number("-$3")
– Gusbourne Commented Mar 15 at 22:52