I have written a program that indicates all instances of a desired wordclass in a text. This is how I do it:

Make an array of words from the entire text
Iterate this array. For each word, look what its first letter is.
- Jump to the corresponding array in an object of all words of the selected wordclass (e.g 'S') and iterate it. Break if the word is found and push it into an array of matches.
After all words are checked, iterate the array of matches and highlight each one in the text.

A text which consists of 240000 words is processed in 100 seconds regarding nouns and about 4.5 seconds regarding prepositions on my machine.

I am looking for a way to improve performance and those are the ideas I could e up with:

Rearrange the items in each block of my wordlist. Sort them in a way that if the word starts with a vocal, all items that have a consonant as its second character e first and vice versa. (in the assuming that words with double vocals or consonants are rare)
Structure the text into chapters and process only the currently shown chapter.

Are those solid ideas and are there any more ideas or proven techniques to improve this kind of processing?

I have written a program that indicates all instances of a desired wordclass in a text. This is how I do it:

Make an array of words from the entire text
Iterate this array. For each word, look what its first letter is.
- Jump to the corresponding array in an object of all words of the selected wordclass (e.g 'S') and iterate it. Break if the word is found and push it into an array of matches.
After all words are checked, iterate the array of matches and highlight each one in the text.

A text which consists of 240000 words is processed in 100 seconds regarding nouns and about 4.5 seconds regarding prepositions on my machine.

I am looking for a way to improve performance and those are the ideas I could e up with:

Rearrange the items in each block of my wordlist. Sort them in a way that if the word starts with a vocal, all items that have a consonant as its second character e first and vice versa. (in the assuming that words with double vocals or consonants are rare)
Structure the text into chapters and process only the currently shown chapter.

Are those solid ideas and are there any more ideas or proven techniques to improve this kind of processing?

Share Improve this question edited Mar 20, 2015 at 15:23 asked Mar 20, 2015 at 15:17 Wottensprels 3,3273 gold badges30 silver badges39 bronze badges

2 Maybe a web worker that returns matches in chunks so that your UI can start highlighting immediately. – Malk Commented Mar 20, 2015 at 15:24
What is the definition of a "word class"? Can this be found anywhere in a word? Because it your logic seems to assume that the word must start with the same letter as the first letter of the "word class". Is case-sensitivity important? Can a "word class" have punctuation or spaces? – Nate Commented Aug 18, 2016 at 22:41
pretty basic question, are you using JQuery? if you are, you might wanna think about dropping it, since performance is actually pretty bad pared with vanilla JS – Brian H. Commented Aug 19, 2016 at 11:34
@Nate mea culpa. I'm not a native English speaker and thought "word class" to be the right term. What I mean is part of speech, e.g nouns, adjectives etc. – Wottensprels Commented Aug 20, 2016 at 6:46
@Brian Thanks. I'm not using jQuery though. – Wottensprels Commented Aug 20, 2016 at 6:46

Add a ment |

6 Answers 6

Sorted by: Reset to default 9

Use the power of javascript.

It manipulates dictionaries with string keys as a fundamental operation. For each word class, build an object with each possible word being a key and some simple value like true or 1. Then checking each word is simply typeof(wordClass[word]) !== "undefined". I expect this to be much much faster.

Regular expressions are another highly optimized area of Javascript. You can probably do the whole thing as one massive regular expression for each word class. If your highlighting is in HTML, then you can also just use a replace on the RE to get the result. This working is likely dependent on just how big your word sets are.

The solution I propose is to implement a trie data structure. It takes more effort to implement, but has several advantages over a hash table (dictionary).

Looking up data in a trie will take, at most, O(k) time, where k is the length of the search string. With a hash table, storing each word as a key might work, but what are you storing as the value at that key in the table? Hash tables don't seem very efficient to me for this problem.

Additionally, a trie can provide alphabetical ordering of your key entries natively via pre-order traversal. A hash table cannot. To sort its keys, you'd have to implement a sort function on your own, which just adds more time and space.

If you read further into tries, you'll e across suffix trees and radix trees, which address the exact problem you're trying to solve. So, in a sense, you're reinventing the wheel, but I'm not claiming that's a bad thing. Learning this stuff makes you a better programmer.

We can implement a simple trie as a set of connected nodes that stores three pieces of information: 1) a symbol (character), 2) a pointer to the first child of this node, and 3) a pointer to the parent node's next child.

class TrieNode {
  constructor(symbol) {
    this.symbol = symbol;
    this.child = null;
    this.next = null;
  }
}

You can then build a web of words, linked together by each letter in the word. Words that share the same prefix are natively linked together via the child and next pointers, so lookup is pretty fast. I encourage you to look further into tries. They're neat data structures and I think it suits your problem best.

I think the steps which cost high putational time would be:

Searching a particular word in the world class container.
Highlighting the matches on the source document.

Thus, I would propose a more efficient data structure to store your word class container and the matches list. So the search and lookup runs faster.

If I understand your problem correctly, you only want to highlight those words which are in the world class list. So I would propose Bloom Filter which does this job very outstandingly.

http://en.wikipedia/wiki/Bloom_filter

Bloom Filter is a set container which you can store any element (words) and check whether any new word is already in this set. The speed is blazingly fast and suits big data processing well.

The use cases would be:

You store the word classes in a Bloom Filter, let's name it bfWordClass.
Iterate through the list of the extracted words, check if each word is a member of bfWordClass (This operation is extremely fast and 100% accurate).
If the word does belong to bfWordClass, then you lookup the text and highlight them. You may consider another data structure to store the unique words extracted from the document and all indexes found in the document for faster reference.

Use first 3 characters as key, not first char.
Offload your work to many background threads
Process visible text first

2,40,000 words are indeed a big data to process at client side, this will create performance issues in javascript as well as DOM manipulation when you highlight the text. You shall try to process a smaller set if possible, like pages or paragraphs or sections.

If you can or cannot reduce your active set of words, here are few optimizations you can try for your text processing:

Storing the text in DOM

You can try two approaches here:

Single DOM element containing all the text i.e. 240k words
Multiple DOM elements each containing N words, say 240 elements with 1000 words each.

You will have to use tools like jsPerf, to see how fast innerHTML changes are in both approaches i.e. replacing a big innerHTML or text in a single DOM element vs replacing multiple innerHTMLs of matching DOM elements.

Matching-Highlighting words when they have been pletely typed

For example you want to highlight "Javascript" and "text" in your text after they have been pletely typed. In this case as @DrC mentioned, preprocessing your text to store key vs data would be a good thing. Generate a key for the word ( you may want to normalize the key in case you want to match case insensitive, or ignore special characters etc , i.e. 'nosql' will be the key for 'NoSQL' or 'NOSQL' or 'No-SQL`. Your lookup object will then look like:

{
  'nosql': {'matches':[{'NoSQL':3},{'NOSQL':6}],  // indexes of strings where this occurs

}

whenever a word is searched, you generate the key to lookup indexes of all matches, and highlight the text.

Matching-Highlighting words as they are being typed For this approach you would need a trie based structure created in javascript object. Another approach would be to generate and pile regex based on current typed string,.e.g if user has type 'jav' the regex would be \jav\gi and do a regex match on whole text. Both approaches would need a performance parison

I would do something like this.

HTML

<section id="text">All keywords in this sentence will be marked</section>

JavaScript

element = document.getElementId('text');
text = element.innerHTML;
array_of_keywords = ['keyword', 'will'];
eval('text = text.replace(/(' + array_of_keywords.join('|') + ')/g, "<mark>\$1</mark>");');
element.innerHTML = text;

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Increasing performance on text processing - Stack Overflow

6 Answers 6

HTML

JavaScript

与本文相关的文章

评论列表(0)