I am trying to make JavaScript print all Unicode characters. According to my research, there are 1,114,112 Unicode characters.
A script like the following could work:
for(i = 0; i < 1114112; i++)
console.log(String.fromCharCode(i));
But I found out that only 10% of the 1,114,112 Unicode characters are used.
How can I can I only print the used unicode characters?
I am trying to make JavaScript print all Unicode characters. According to my research, there are 1,114,112 Unicode characters.
A script like the following could work:
for(i = 0; i < 1114112; i++)
console.log(String.fromCharCode(i));
But I found out that only 10% of the 1,114,112 Unicode characters are used.
How can I can I only print the used unicode characters?
Share Improve this question edited Mar 29, 2014 at 22:40 P̲̳x͓L̳ 3,6513 gold badges31 silver badges37 bronze badges asked Mar 29, 2014 at 22:16 ProgoProgo 3,4907 gold badges30 silver badges44 bronze badges 6- 1 What do you mean by used characters? – Anthony Raymond Commented Mar 29, 2014 at 22:19
- 1 Note that JavaScript strings are UTF-16, so you'll have to manage surrogate pairs. And, the display of each character also depends on the font being used and whether it has a glyph defined for the code point. If you're trying to determine which code points the font supports, that information isn't generally made available in most JavaScript environments. – Jonathan Lonowski Commented Mar 29, 2014 at 22:20
- @AnthonyRaymond As I said in my question, "only 10% of the 1,114,112 Unicode characters are used." – Progo Commented Mar 29, 2014 at 22:48
- @Progo: And what does "used" mean? – Thanatos Commented Mar 30, 2014 at 1:06
- @Thanatos Since only 10% of the 1,114,112 possible Unicode characters are used, the rest is unused and reserved for future use. – Progo Commented Mar 30, 2014 at 1:23
3 Answers
Reset to default 7As Jukka said, JavaScript has no built-in way of knowing whether a given Unicode code point has been assigned a symbol yet or not.
There is still a way to do what you want, though.
I’ve written several scripts that parse the Unicode database and create separate data files for each category, property, script, block, etc. in Unicode. I’ve also created an HTTP API that allows you to programmatically get all code points (i.e. an array of numbers) in a given Unicode category, or all symbols (i.e. an array of strings for each character) with a given Unicode property, or a regular expression with that matches any symbols in a certain Unicode script.
For example, to get an array of strings that contains one item for each Unicode code point that has been assigned a symbol in Unicode v6.3.0, you could use the following URL:
http://mathias.html5/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B
Note that you can prepend and append anything you like to the output by tweaking the URL parameters, to make it easier to reuse the data in your own scripts. An example HTML page that console.log()
s all these symbols, as you requested, could be written as follows:
<!DOCTYPE html>
<meta charset="utf-8">
<title>All assigned Unicode v6.3.0 symbols</title>
<script src="http://mathias.html5/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B"></script>
<script>
window.symbols.forEach(function(symbol) {
// Do what you want to do with `symbol` here, e.g.
console.log(symbol);
});
</script>
Demo. Note that since this is a lot of data, you can expect your DevTools console to bee slow when opening this page.
Update: Nowadays, you should use Unicode data packages such as unicode-11.0.0
instead. In Node.js, you can then do the following:
const symbols = require('unicode-11.0.0/Binary_Property/Assigned/symbols.js');
console.log(symbols);
// Or, to get the code points:
require('unicode-11.0.0/Binary_Property/Assigned/code-points.js');
// Or, to get a regular expression that only matches these characters:
require('unicode-11.0.0/Binary_Property/Assigned/regex.js');
There is no direct way in JavaScript to find out whether a code point is assigned to a character or not, which appears to be the question here. You need information extracted from suitable sources, and this information needs to be updated whenever new characters are assigned in new versions of Unicode.
There are 1,114,112 code points in Unicode. The Unicode standard assigns to each code point the property gc, General Category. If the value of this property is anything but Cs, Co, or Cn, then the code point is assigned to a character. (Code points with gc equal to Co are Private Use code points, to which no character is assigned, but they may be used for characters by private agreements.)
What you would need to do is to get a copy of some relevant files in the Unicode character database (just a collection of files in specific formats, really) and write code that reads it and generates information about assigned code points. For the purposes of printing all Unicode characters, it might be best to generate the information as an array of ranges of assigned codepoints. And this would need to be repeated when the standard is updated with new characters.
Even the rest isn’t trivial. You would need to decide what it means to print a character. Some characters are control characters that may have an effect such as causing a newline, but lacking a visible glyph. Some (spaces) have empty glyphs. Some (bining marks) are meant to be rendered as marks attached to preceding character, though they have conventional renderings as “standalone” characters, too. Some are meant to take essentially different shapes depending on nearest context; they may have isolated forms, too, but just writing a character after another by no means guarantees that an isolated form is used.
Then there’s the problem of fonts. No single font can contain all Unicode characters, so you would need to find a collection of fonts that cover all of Unicode when used together, preferably so that they stylistically match somehow.
So if you are just looking for a pilation of all printable Unicode characters, consider using the Unicode code charts.
The trouble here is that Javascript is not, contrary to popular opinion, a Unicode environment.
Internally, it uses USC-2
, an inpatible 16-bit encoding method that predates UTF16.
In addition, many of the unicode characters are not directly printable by themselves -- some of them are modifies for the previous characters -- for example the Spanish letter ñ
can be written in unicode either as a single point -- that character -- or as two points -- n
and ~
Here are a couple of resources that should really help you in understanding this:
- http://mathiasbynens.be/notes/javascript-encoding
- http://mathiasbynens.be/notes/javascript-unicode