最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Why does R treat non-ASCII characters differently depending on the SSH client's OS? - Stack Overflow

programmeradmin4浏览0评论

When run over SSH, R appears to treat non-ASCII characters differently depending on the OS of the SSH client.

For example, if I use a computer running macOS (14.6.1) to start an R session on an Ubuntu machine (22.04.5), and run:

units::set_units(12.7, "\U00B5m")

I get:

12.7 [µm]

But the same expression run on the same server, but by a Windows client (10.0.19045.4170) produces:

Error: In '<U+00B5>m', '<U+00B5>m' is not recognized by udunits.

I thought that this could have to do with how the command line on each OS sends the character representations over SSH. However, if I save the following script on the server (written using vim over SSH from the macOS machine):

#!/bin/Rscript

print(nchar("µm"))

And execute it over SSH from the macOS client (e.g., ssh <user>@<host> "./print_micron.R"), I get:

[1] 2

i.e., "µ" is a single two-byte character. But if I execute it from the Windows client, I get:

[1] 3

i.e., "µ" becomes two separate characters, one for each byte.

This is challenging my intuition of how executing commands on SSH works, as I would expect the behavior of R to be determined entirely by the server. Why would the client OS affect how non-ASCII characters are represented by R?

When run over SSH, R appears to treat non-ASCII characters differently depending on the OS of the SSH client.

For example, if I use a computer running macOS (14.6.1) to start an R session on an Ubuntu machine (22.04.5), and run:

units::set_units(12.7, "\U00B5m")

I get:

12.7 [µm]

But the same expression run on the same server, but by a Windows client (10.0.19045.4170) produces:

Error: In '<U+00B5>m', '<U+00B5>m' is not recognized by udunits.

I thought that this could have to do with how the command line on each OS sends the character representations over SSH. However, if I save the following script on the server (written using vim over SSH from the macOS machine):

#!/bin/Rscript

print(nchar("µm"))

And execute it over SSH from the macOS client (e.g., ssh <user>@<host> "./print_micron.R"), I get:

[1] 2

i.e., "µ" is a single two-byte character. But if I execute it from the Windows client, I get:

[1] 3

i.e., "µ" becomes two separate characters, one for each byte.

This is challenging my intuition of how executing commands on SSH works, as I would expect the behavior of R to be determined entirely by the server. Why would the client OS affect how non-ASCII characters are represented by R?

Share Improve this question edited Jan 17 at 15:48 Eonema asked Jan 17 at 15:42 EonemaEonema 1,3307 silver badges17 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 5

Your Mac probably has LANG=en_US.UTF-8 (or something similar) in the environment, which sets the default locale to use the UTF-8 encoding. It probably also has SSH configured to forward that environment variable to the server (SendEnv LANG in the SSH config). That causes R to use UTF-8 for its internal encoding and for reading source files.

Your Windows SSH client, on the other hand, is likely not sending any such variable, and nothing on the server is defaulting it, so you get the C locale, which is ASCII-only. That causes units to not know what character B5 means (there are no characters above 7F in ASCII!), and causes the string literal in your test script to be interpreted as three characters (one per byte) instead of two.

You should be able to see the difference by running sessionInfo() and l10n_info() from the two different clients: they will show different values for locale, codeset, and UTF-8.

If your windows SSH client can handle UTF-8, you should be able to either

  1. Add LANG to the list of environment variables it sends to the server, if it has a config for that.
  2. Add export LANG=en_US.UTF-8 (or whatever value is appropriate for you) to your ~/.profile or ~/.bashrc on the server (or whatever you've got that gets automatically sourced when you log in).
发布评论

评论列表(0)

  1. 暂无评论