readr - Inconsistent behavior of parse_number() with non-alphanumeric strings in R

I'm having trouble using readr::parse_number() on a character column that sometimes contains non-alphanumeric prefixes before the number. In some cases, parse_number() extracts the number correctly, while in other cases it returns NA.

For example, consider the following reproducible example:

library(tidyverse)

df <- structure(list(col1 = c("6980265", 
                               "x (6969100)", 
                               "1,234.56", 
                               "euro1,000",
                               "x. (6969100) ", 
                               "x (6969943)", 
                               "x y.  (6977416)", 
                               "x-y  (6923012) ")), 
               class = c("tbl_df", "tbl", "data.frame"), 
               row.names = c(NA, -8L))

# Directly using parse_number() on col1
df2 <- df |> 
  mutate(col2 = parse_number(col1, trim_ws = TRUE))

df2

the output is:

 A tibble: 8 × 2
  col1                  col2
  <chr>                <dbl>
1 "6980265"         6980265 
2 "x (6969100)"     6969100 
3 "1,234.56"           1235.
4 "euro1,000"          1000 
5 "x. (6969100) "        NA 
6 "x (6969943)"     6969943 
7 "x y.  (6977416)"      NA 
8 "x-y  (6923012) "      NA 
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `col2 = parse_number(col1, trim_ws =
  TRUE)`.
Caused by warning:
! 3 parsing failures.
row col expected          actual
  5  -- a number x. (6969100)   
  7  -- a number x y.  (6977416)
  8  -- a number x-y  (6923012)

After removing all non alpha-numeric the code works:

df |> 
  mutate(col1 = str_replace_all(col1, "[^[:alnum:] ]", "")) |> 
  mutate(col2 = parse_number(col1, trim_ws = TRUE))

# A tibble: 8 × 2
  col1              col2
  <chr>            <dbl>
1 "6980265"      6980265
2 "x 6969100"    6969100
3 "123456"        123456
4 "euro1000"        1000
5 "x 6969100 "   6969100
6 "x 6969943"    6969943
7 "x y  6977416" 6977416
8 "xy  6923012 " 6923012

Question:

Why does parse_number() sometimes ignore the numeric part when there is a non-alphanumeric character in front of it and sometimes not?

For example, consider the following reproducible example:

library(tidyverse)

df <- structure(list(col1 = c("6980265", 
                               "x (6969100)", 
                               "1,234.56", 
                               "euro1,000",
                               "x. (6969100) ", 
                               "x (6969943)", 
                               "x y.  (6977416)", 
                               "x-y  (6923012) ")), 
               class = c("tbl_df", "tbl", "data.frame"), 
               row.names = c(NA, -8L))

# Directly using parse_number() on col1
df2 <- df |> 
  mutate(col2 = parse_number(col1, trim_ws = TRUE))

df2

the output is:

 A tibble: 8 × 2
  col1                  col2
  <chr>                <dbl>
1 "6980265"         6980265 
2 "x (6969100)"     6969100 
3 "1,234.56"           1235.
4 "euro1,000"          1000 
5 "x. (6969100) "        NA 
6 "x (6969943)"     6969943 
7 "x y.  (6977416)"      NA 
8 "x-y  (6923012) "      NA 
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `col2 = parse_number(col1, trim_ws =
  TRUE)`.
Caused by warning:
! 3 parsing failures.
row col expected          actual
  5  -- a number x. (6969100)   
  7  -- a number x y.  (6977416)
  8  -- a number x-y  (6923012)

After removing all non alpha-numeric the code works:

df |> 
  mutate(col1 = str_replace_all(col1, "[^[:alnum:] ]", "")) |> 
  mutate(col2 = parse_number(col1, trim_ws = TRUE))

# A tibble: 8 × 2
  col1              col2
  <chr>            <dbl>
1 "6980265"      6980265
2 "x 6969100"    6969100
3 "123456"        123456
4 "euro1000"        1000
5 "x 6969100 "   6969100
6 "x 6969943"    6969943
7 "x y  6977416" 6977416
8 "xy  6923012 " 6923012

Question:

Why does parse_number() sometimes ignore the numeric part when there is a non-alphanumeric character in front of it and sometimes not?

Share Improve this question asked Mar 15 at 21:26 TarJae 79.7k6 gold badges28 silver badges90 bronze badges Recognized by R Language Collective

3 github/tidyverse/readr/issues/1428 – Ben Bolker Commented Mar 15 at 22:08
3 The hack you used multiplied one of your numbers by 100. – IRTFM Commented Mar 15 at 22:09
@IRTM You are right. I was not aware of that!!! – TarJae Commented Mar 15 at 22:11
@TarJae, if you replace non-alphanumeric with space instead of empty-string it should work, right? – Ben Bolker Commented Mar 15 at 22:20
Another related issue for illustration: Currency parser; parse_number("-$3") – Gusbourne Commented Mar 15 at 22:52

| Show 1 more comment

1 Answer 1

Sorted by: Reset to default 3

The characters "." and "-" can be parts of well-formed numbers, along with the 10 digits, "+", "e", and maybe others. Presumably it's looking for any of those characters, and when it finds one of them, it tries to parse it and following characters as a number. In your NA cases they weren't part of the number, they sat by themselves.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

readr - Inconsistent behavior of parse_number() with non-alphanumeric strings in R - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)