I have a list of names where certain letters are in what I understand to be UTF-8 format. I wish to convert any of these back to standard character format is thats possible. I've tried some suggestions I have seen on other posts such as iconv(#insertname, to="UTF-8")
with no success. I'm under the impression that this is because I have a mix of characters and UTF-8 maybe. The only other thing I thought of is just changing these via a gsub()
although that wouldn't be a blanket fix if new ones pop up. It does seem that iconv()
is very helpful, maybe I am just using it wrong.
Here are a couple of examples;
\\u041C Brown (this should be M Brown)
T Blan\u0441a (this should be T Blanca)
I have a list of names where certain letters are in what I understand to be UTF-8 format. I wish to convert any of these back to standard character format is thats possible. I've tried some suggestions I have seen on other posts such as iconv(#insertname, to="UTF-8")
with no success. I'm under the impression that this is because I have a mix of characters and UTF-8 maybe. The only other thing I thought of is just changing these via a gsub()
although that wouldn't be a blanket fix if new ones pop up. It does seem that iconv()
is very helpful, maybe I am just using it wrong.
Here are a couple of examples;
\\u041C Brown (this should be M Brown)
T Blan\u0441a (this should be T Blanca)
Share
Improve this question
asked Mar 20 at 10:15
JoeJoe
1,3975 silver badges21 bronze badges
0
1 Answer
Reset to default 1You can try
x <- c("\\u041C Brown", "T Blan\\u0441a")
stringi::stri_unescape_unicode(x)
[1] "М Brown" "T Blanсa"
Or
as.character(str2expression(sprintf('"%s"', x)))
[1] "М Brown" "T Blanсa"
Manual approach
Another idea would be to download a cyrillic utf8-mapping-table and use it for replacing the utf-8 strings
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
utf <- tables[[2]][,c(1,2)]
utf$`Unicodecode point` <- gsub("U\\+","\\\\u", utf$`Unicodecode point`)
replacement_map <- setNames(
utf$character,
utf$`Unicodecode point`
)
stringi::stri_replace_all_fixed(
c("\\u041C Brown", "T Blan\\u0441a"),
pattern = names(replacement_map),
replacement = replacement_map,
vectorize_all = FALSE
)
[1] "М Brown" "T Blanсa"