最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Convert any UTF-8 to character form when both appear - Stack Overflow

programmeradmin3浏览0评论

I have a list of names where certain letters are in what I understand to be UTF-8 format. I wish to convert any of these back to standard character format is thats possible. I've tried some suggestions I have seen on other posts such as iconv(#insertname, to="UTF-8") with no success. I'm under the impression that this is because I have a mix of characters and UTF-8 maybe. The only other thing I thought of is just changing these via a gsub() although that wouldn't be a blanket fix if new ones pop up. It does seem that iconv() is very helpful, maybe I am just using it wrong.

Here are a couple of examples;

  \\u041C Brown (this should be M Brown)
  T Blan\u0441a (this should be T Blanca)

I have a list of names where certain letters are in what I understand to be UTF-8 format. I wish to convert any of these back to standard character format is thats possible. I've tried some suggestions I have seen on other posts such as iconv(#insertname, to="UTF-8") with no success. I'm under the impression that this is because I have a mix of characters and UTF-8 maybe. The only other thing I thought of is just changing these via a gsub() although that wouldn't be a blanket fix if new ones pop up. It does seem that iconv() is very helpful, maybe I am just using it wrong.

Here are a couple of examples;

  \\u041C Brown (this should be M Brown)
  T Blan\u0441a (this should be T Blanca)
Share Improve this question asked Mar 20 at 10:15 JoeJoe 1,3975 silver badges21 bronze badges 0
Add a comment  | 

1 Answer 1

Reset to default 1

You can try

x <- c("\\u041C Brown", "T Blan\\u0441a")
stringi::stri_unescape_unicode(x)
[1] "М Brown"  "T Blanсa"

Or

as.character(str2expression(sprintf('"%s"', x)))
[1] "М Brown"  "T Blanсa"

Manual approach

Another idea would be to download a cyrillic utf8-mapping-table and use it for replacing the utf-8 strings

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)

utf <- tables[[2]][,c(1,2)]
utf$`Unicodecode point` <- gsub("U\\+","\\\\u", utf$`Unicodecode point`)

replacement_map <- setNames(
  utf$character,
  utf$`Unicodecode point`
)

stringi::stri_replace_all_fixed(
  c("\\u041C Brown", "T Blan\\u0441a"),
  pattern = names(replacement_map),
  replacement = replacement_map,
  vectorize_all = FALSE
)

[1] "М Brown"  "T Blanсa"
发布评论

评论列表(0)

  1. 暂无评论