How can I load a very large dictionary with JavaScript without freezing the DOM?

I wrote a spell-checking script that utilizes a long list of words. I formatted this large list of words as a JavaScript object, and I load the list as a script. So when it's loaded, that very big object is parsed.

dictionary.js

var dictionary = {
   "apple" : "apple",
   "banana" : "banana",
   "orange" : "orange"
}

Formatted this way for instant word validity checking:

if (dictionary["apple"]){
    //word is valid
}

The problem is that this giant object being parsed causes a significant DOM freeze.

~~How can I load my data structure in a way so that I can parse it piece by piece?~~ How can I store/load my data structure so that the DOM can handle it without freezing?

dictionary.js

var dictionary = {
   "apple" : "apple",
   "banana" : "banana",
   "orange" : "orange"
}

Formatted this way for instant word validity checking:

if (dictionary["apple"]){
    //word is valid
}

The problem is that this giant object being parsed causes a significant DOM freeze.

~~How can I load my data structure in a way so that I can parse it piece by piece?~~ How can I store/load my data structure so that the DOM can handle it without freezing?

Share Improve this question edited Nov 10, 2014 at 5:44 asked Nov 9, 2014 at 22:11 user4231564

1 Perhaps consider separating the list by starting letter or so. – techfoobar Commented Nov 9, 2014 at 22:14
Wele to S.O. Please read "How to ask". Did you think in order way to store your data instead memory? – Leandro Bardelli Commented Nov 9, 2014 at 22:14
1 @Leandro I know precisely how to ask questions on Stack Overflow. Did you find a problem with the way this question was asked? – user4231564 Commented Nov 10, 2014 at 1:45
2 @Leandro I understand.. Well, I am not really sure where to start, I guess that's the case with questions sometimes.. – user4231564 Commented Nov 10, 2014 at 1:50
2 Did you consider Web Workers? developer.mozilla/en-US/docs/Web/Guide/Performance/… – demux Commented Nov 10, 2014 at 2:13

| Show 7 more ments

2 Answers 2

Sorted by: Reset to default 6

Write your JS file in the form

var words = JSON.parse('{"aardvark": "aardvark", ...}');

JSON parse will be several orders of magnitude faster than the JS parser.

The actual lookup will be about 0.01ms by my measurement.

Background

There are several aspects to consider when thinking about performance in this situation, including download bandwidth, parsing, preprocessing or building if needed, memory, and retrieval. In this case, all other performance issues are overwhelmed by the JS parsing time, which could be up to 10 seconds for the 120K entry hash. If downloading from a server, the 2.5MB size could be an issue, but this will zip down to 20% or so of the original size. In terms of retrieval performance, JS hashes are already optimized for fast retrieval; the actual retrieval time might be less than 0.01ms, especially for subsequent accesses to the same key. In terms of memory, there seem to be few ways to optimize this, but most browsers could hold an object this size without breaking a sweat.

The Patricia trie approach is interesting, and addresses mainly the memory usage and download bandwidth issues, but they do not seem in this case to be the main problem areas.

If there are a lot of words in your list, one thing you can do to shrink the size of your dictionary is use a patricia trie. A patricia trie is a special kind of tree optimized for searching for strings. Basically, each node in the trie is a single letter. So for instance:

var dictionary = {
    'a': {
        '': 1, // a
        'a': {
            '': 1, // aa
            'l': {
                '': 1, // aal
                'ii': 1 // aalii
            },
            'm': 1, // aam
            'ni': 1, // aani
            'r': {
                'd': {
                    'vark': 1, // aardvark
                    'wolf': 1 // aardwolf
                },
                'on': {
                    '': 1, // aaron
                    'ic': 1 // aaronic
                }
            }
        },
    },
}

In the above, I'm using the empty string '' (actually a valid property identifier!) to denote the end of a word.

Here's how you could implement the searching:

function isWord_internal(str, dic) {
    var strlen, i, substr, substrlen;
    strlen = str.length;
    for(i = strlen; i > 0; i--) {
        substr = str.slice(0, i);
        substrlen = substr.length;
        if(dic[substr]) {
            if(dic[substr] === 1) {
                if(substrlen === strlen) {
                    return true;
                }
                return false; // end of the line, folks
            }
            if(dic[substr][''] && substrlen === strlen) {
                return true;
            }
            return isWord_internal(str.slice(i), dic[substr]);
        } // else keep going
    }
    // not found
    return false;
}

function isWord(str) {
    return isWord_internal(str, dictionary); // assumes that the dictionary variable exists already
}

Since objects are hash tables O(log n) and the algorithm is linear with the size of a word, your performance should be O(n log n).

This should also keep the size of your list down, since English words are not random and often have substrings in mon (you'll probably want to minify it though).

To convert your current dictionary, you can use this function to generate it in memory:

function patriciafy_internal(str, pat) {
    var i, substr, patkeys, patkeyslen, j, patkey, pos, portion;
    for(i = str.length; i > 0; i--) {
        substr = str.slice(0, i);
        patkeys = Object.keys(pat);
        patkeyslen = patkeys.length;
        for(j = 0; j < patkeyslen; j++) {
            patkey = patkeys[j];
            pos = patkey.indexOf(substr);
            if(pos !== -1) {
                if(pat[patkey] !== 1) {
                    // keep going down the rabbit hole
                    patriciafy_internal(str.slice(i), pat[patkey]);
                } else {
                    // split this node and store the new key
                    portion = patkey.slice(0, pos + 1);
                    delete pat[patkey];
                    pat[portion] = {'': 1};
                    pat[portion][str.slice(1)] = 1;
                }
                return;
            }
        }
    }
    // no substring found - store the whole thing at this level
    pat[str] = 1;
}

function patriciafy(dic) {
    var pat, keys, keyslen, str, i;
    pat = {};
    keys = Object.keys(dic);
    keyslen = keys.length;
    for(i = 0; i < keyslen; i++) {
        // insert this str into the trie
        str = keys[i];
        patriciafy_internal(str, pat);
    }
    return pat;
}

Then just use JSON.stringify to print it out for use.

A test on my machine with a dictionary of 235,887 words (taken from here) showed that the Patricia tree was a little more than 1/3 smaller than a flat dictionary of the words, where the format of the flat dictionary was {"name":1,"name2":1,...}.

This is still huge, but it's significantly less huge. You could probably also save a sizable amount of space by storing it in a non-JS format, e.g.

var dictionary = {
    'a': {
        '': 1,
        'a': {
            '': 1,
            'l': {
                '': 1,
                'ii': 1
            }
        }
    }
};

which minifies to

// 57 characters
var dictionary={'a':{'':1,'a':{'':1,'l':{'':1,'ii':1}}}};

could bee

// 17 characters - single @ represents ''
@a{@@a{@@l{@@i}}}

which you would then parse.

=====

patricia-tree-to-@-syntax:

function patToAt_internal(pat) {
    var keys, i, s = "", key;
    s += "{";
    keys = Object.keys(pat);
    for(i = 0; i < keys.length; i++) {
        key = keys[i];
        s += "@";
        s += key; // empty string will do nothing
        if(pat[key] !== 1) { // concat the sub-trie
            s += "{" + patToAt_internal(pat[key]) + "}";
        }
    }
    s += "}";
    return s;
}

function patToAt(pat) {
    return patToAt_internal(pat);
}

@-syntax-to-patricia-tree:

function atToPat_internal(at, pos) {

    var s = "", result, ff = true, ch;
    s += "{";
    pos++; // eat { character
    while((ch = at.charAt(pos)) !== "}") {
        // eat @ character
        pos++;
        // ma
        if(!ff) s += ",";
        else ff = false;
        // start key quote
        s += '"';
        // get the key
        while((ch = at.charAt(pos)) !== "@" && ch !== "{" && ch !== "}") {
            s += ch;
            pos++;
        }
        // close key quote and colon
        s += '":';
        // sub-trie
        if(ch === "{") {
            result = atToPat_internal(at, pos);
            s += result.s;
            pos = result.pos + 1;
        } else {
            s += "1";
        }
    }
    s += "}";
    return {pos: pos, s: s};

}

// this part is difficult, so we'll just delegate the heavy lifting to JSON.parse
function atToPat(at) {
    return JSON.parse(atToPat_internal(at, 0).s);
}

Using the @-syntax, the total size shrinks down to a third of the original flat dictionary:

To sum up...

Using this approach, you would build up the @-tree initially (just once) by doing

var dictionary = { ... } // your words here
// usually you can get all of the log if you triple click, or just output it into a textarea
console.log(patToAt(patriciafy(dictionary)));

and then once you've done that, on your page, you could do

var dictionary = atToPat("..."); // @-tree goes in the quotes

// and look up things like this:
isWord('apple');

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

How can I load a very large dictionary with JavaScript without freezing the DOM? - Stack Overflow

2 Answers 2

Background

与本文相关的文章

评论列表(0)