I'm trying to create Korean sentences programmatically, but to do so properly means I need a way to determine which of the Hangul Jamo unicode characters make up each Hangul Syllable unicode character. More concretely, I'd like to take a collection of Hangul Jamo characters and figure out how to convert them into a Hangul Syllable character. Simply concatenating the strings won't work and I've looked at the code point values to see if there's an obvious relationship between the code points of the Hangul Jamo and the bined Hangul Syllable, but I don't see one. For example, naively adding the code points does not result in the correct answer:
console.log(('ㄱ'.codePointAt(0) + 'ㅏ'.codePointAt(0)) === '가'.codePointAt(0));
It does not log true
which is also self evident when looking at the Unicode charts for Hangul Jamo and Hangul Syllables. I haven't found an answer in my searching so far, but there must be a way to programmatically convert the parts to the whole syllable, right?
I'm trying to create Korean sentences programmatically, but to do so properly means I need a way to determine which of the Hangul Jamo unicode characters make up each Hangul Syllable unicode character. More concretely, I'd like to take a collection of Hangul Jamo characters and figure out how to convert them into a Hangul Syllable character. Simply concatenating the strings won't work and I've looked at the code point values to see if there's an obvious relationship between the code points of the Hangul Jamo and the bined Hangul Syllable, but I don't see one. For example, naively adding the code points does not result in the correct answer:
console.log(('ㄱ'.codePointAt(0) + 'ㅏ'.codePointAt(0)) === '가'.codePointAt(0));
It does not log true
which is also self evident when looking at the Unicode charts for Hangul Jamo and Hangul Syllables. I haven't found an answer in my searching so far, but there must be a way to programmatically convert the parts to the whole syllable, right?
- 1 Take a look at hangul-js. – user5125954 Commented Nov 18, 2018 at 2:10
2 Answers
Reset to default 7I found the answer on Wikipedia. Follow that link for the table of characters, but here is the formula used:
To find Hangul Syllables in Unicode, you can apply a simple formula. The formula and tables are as follows: [{(initial) × 588} + {(medial) × 28} + (final)] + 44032
Here's an example of a really crappy random Korean sentence generator I threw together on JSFiddle. I use the value of the ending character (the final randFin
value below) to determine if the last syllable of the word ends in a vowel or a consonant. That determines the form of the particle which es after it in the resulting (almost certainly unintelligible) sentence. It makes use of the unicode formula in the getRandomKWord
method:
var getRandomInt = function(n, o) {
var min = Math.ceil(n);
var max = Math.floor(o);
return Math.floor(Math.random() * (max - min)) + min;
};
var getRandomKWord = function() {
var word = '';
var num = getRandomInt(1, 3);
for (var i = 0; i < num; i++) {
var randInit = getRandomInt(0, 19) * 588;
var randMed = getRandomInt(0, 21) * 28;
var randFin = getRandomInt(0, 28);
var hangulFormula = randInit + randMed + randFin + 44032;
word = word + String.fromCodePoint(hangulFormula);
}
return { word: word, final: randFin };
};
var title = document.getElementById('title');
var subject = getRandomKWord();
var object = getRandomKWord(); // don't use 'object' as a variable name
var verb = getRandomKWord();
var subParticle = subject.final ? '는' : '은';
var objParticle = object.final ? '를' : '을';
var text = subject.word +
subParticle +
object.word +
objParticle +
verb.word +
'습니다.';
title.innerText = text;
<h1 id='title'></h1>
All sequences of jamo that form valid Korean syllables exist as preposed characters in Unicode. In addition, all such preposed characters have canonical depositions to the sequence of jamo, which means that any text in the Normalization Form C will have those preposed characters, instead of jamo sequences.
So, simply normalizing the string that consists of jamo will result in as many preposed syllables as possible. This can be done in JavaScript with s.normalize("NFC")
.
If you don't care about having jamo sequences or preposed syllables, but only care about the results paring equal, then you can normalize the strings to either normalization form (C or D), as long as you they both have the same form.
Also relevant, the Unicode FAQ on Korean has a list of situations where Normalization Form C will contain jamo instead of syllables:
If the text is in NFD, then it will only contain Jamo. If it is in NFC (or unnormalized), most text will be Hangul Syllables. However, Jamo could occur in certain circumstances:
(a) isolated Jamo
(b) pre-1933 orthography Korean text
(c) modern inplete syllables (e.g. syllables without a leading consonant as used in dictionaries and grammar books)
(d) syllables used for a more faithful phonetic representation of some dialectsIn the latter case, there are two possibilities. If the L or V are ancient Jamo, then the entire syllable would be in Jamo. If both are modern Jamo but the T is ancient, then the syllable would be represented by a sequence of two characters: a single code point for LV, followed by the code point for the T: <LV, T>
This is similar to the case of Latin. The NFC form of A + grave + umlaut is <A-grave, umlaut> : part is preposed and the remainder is not.