最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Searching for multiple partial phrases so that one original phrase can not match multiple search phrases - Stack Ov

programmeradmin6浏览0评论

Given a predefined set of phrases, I'd like to perform a search based on user's query. For example, consider the following set of phrases:

index      phrase
-----------------------------------------
0          Stack Overflow
1          Math Overflow
2          Super User
3          Webmasters
4          Electrical Engineering
5          Programming Jokes
6          Programming Puzzles
7          Geographic Information Systems 

The expected behaviour is:

query         result
------------------------------------------------------------------------
s             Stack Overflow, Super User, Geographic Information Systems
web           Webmasters
over          Stack Overflow, Math Overflow
super u       Super User
user s        Super User
e e           Electrical Engineering
p             Programming Jokes, Programming Puzzles
p p           Programming Puzzles

To implement this behaviour I used a trie. Every node in the trie has an array of indices (empty initially).

To insert a phrase to the trie, I first break it to words. For example, Programming Puzzles has index = 6. Therefore, I add 6 to all the following nodes:

p
pr
pro
prog
progr
progra
program
programm
programmi
programmin
programming
pu
puz
puzz
puzzl
puzzle
puzzles

The problem is, when I search for the query prog p, I first get a list of indices for prog which is [5, 6]. Then, I get a list of indices for p which is [5, 6] as well. Finally, I calculate the intersection between the two, and return the result [5, 6], which is obviously wrong (should be [6]).

How would you fix this?

Given a predefined set of phrases, I'd like to perform a search based on user's query. For example, consider the following set of phrases:

index      phrase
-----------------------------------------
0          Stack Overflow
1          Math Overflow
2          Super User
3          Webmasters
4          Electrical Engineering
5          Programming Jokes
6          Programming Puzzles
7          Geographic Information Systems 

The expected behaviour is:

query         result
------------------------------------------------------------------------
s             Stack Overflow, Super User, Geographic Information Systems
web           Webmasters
over          Stack Overflow, Math Overflow
super u       Super User
user s        Super User
e e           Electrical Engineering
p             Programming Jokes, Programming Puzzles
p p           Programming Puzzles

To implement this behaviour I used a trie. Every node in the trie has an array of indices (empty initially).

To insert a phrase to the trie, I first break it to words. For example, Programming Puzzles has index = 6. Therefore, I add 6 to all the following nodes:

p
pr
pro
prog
progr
progra
program
programm
programmi
programmin
programming
pu
puz
puzz
puzzl
puzzle
puzzles

The problem is, when I search for the query prog p, I first get a list of indices for prog which is [5, 6]. Then, I get a list of indices for p which is [5, 6] as well. Finally, I calculate the intersection between the two, and return the result [5, 6], which is obviously wrong (should be [6]).

How would you fix this?

Share Improve this question edited May 9, 2015 at 9:02 Misha Moroshko asked May 8, 2015 at 12:22 Misha MoroshkoMisha Moroshko 171k229 gold badges520 silver badges760 bronze badges 8
  • so it's your library and you would like to make changes to it so it handles this case correctly, right? – Miroshko Commented May 8, 2015 at 12:34
  • So what should be the result for p prog? I'm assuming zero matches, right? If so, I'd say the way forward would be to attach a word-index on each node-label. – aioobe Commented May 8, 2015 at 12:38
  • @aioobe For p prog we should get Programming Puzzles. In general, the result for word1 word2 ... wordN and for any permutation like wordN word2 word1 ..., should be the same. – Misha Moroshko Commented May 8, 2015 at 12:41
  • @MishaMoroshko, then it seems like a really hard problem to solve. If all permutations should give the same result (search term position is irrelevant) but words may not be "reused" as they are in your example, then it seems to me that you could reduce the Set Cover Problem to your problem, which means that your problem is NP-hard an can only be solved using brute force. (Maybe I'm wrong or misunderstood something, but that's my immediate feeling.) – aioobe Commented May 8, 2015 at 12:49
  • @aioobe Main difference from set cover is you have a candidate for the covering set, and you don't need to "invent" the covering set, or give a minimal one. Verifying if a given set is indeed a set cover can be done efficiently (definition of NP), and I am not sure it is verification of set cover in the first place, but that's debateable – amit Commented May 11, 2015 at 9:50
 |  Show 3 more ments

4 Answers 4

Reset to default 4 +150

Key Observation

We can use the fact that two words in a query can match the same word in a phrase only if one query word is a prefix of the other query word (or if they are same). So if we process the query words in descending lexicographic order (prefixes e after their "superwords"), then we can safely remove words from the phrases at the first match. Doing so we left no possibility to match the same phrase word twice. As I said, it is safe because prefixes match superset of phrase words what their "superwords" can match, and pair of query words, where one is not a prefix of the other, always match disjoint set of phrase words.

We don't have to remove words from phrases or the trie "physically", we can do it "virtually".

Implementation of the Algorithm

var PhraseSearch = function () {   
    var Trie = function () {
        this.phraseWordCount = {};
        this.children = {};
    };

    Trie.prototype.addPhraseWord = function (phrase, word) {
        if (word !== '') {
            var first = word.charAt(0);

            if (!this.children.hasOwnProperty(first)) {
                this.children[first] = new Trie();
            }
            var rest = word.substring(1);
            this.children[first].addPhraseWord(phrase, rest);
        }
        if (!this.phraseWordCount.hasOwnProperty(phrase)) {
            this.phraseWordCount[phrase] = 0;
        }
        this.phraseWordCount[phrase]++;
    };

    Trie.prototype.getPhraseWordCount = function (prefix) {
        if (prefix !== '') {
            var first = prefix.charAt(0);

            if (this.children.hasOwnProperty(first)) {
                var rest = prefix.substring(1);
                return this.children[first].getPhraseWordCount(rest);
            } else {
                return {};
            }
        } else {
            return this.phraseWordCount;
        }
    }

    this.trie = new Trie();
}

PhraseSearch.prototype.addPhrase = function (phrase) {
    var words = phrase.trim().toLowerCase().split(/\s+/);
    words.forEach(function (word) {
        this.trie.addPhraseWord(phrase, word);
    }, this);
}

PhraseSearch.prototype.search = function (query) {
    var answer = {};
    var phraseWordCount = this.trie.getPhraseWordCount('');
    for (var phrase in phraseWordCount) {
        if (phraseWordCount.hasOwnProperty(phrase)) {
            answer[phrase] = true;
        }
    }

    var prefixes = query.trim().toLowerCase().split(/\s+/);

    prefixes.sort();
    prefixes.reverse();

    var prevPrefix = '';
    var superprefixCount = 0;

    prefixes.forEach(function (prefix) {
        if (prevPrefix.indexOf(prefix) !== 0) {
            superprefixCount = 0;
        }
        phraseWordCount = this.trie.getPhraseWordCount(prefix);

        function phraseMatchedWordCount(phrase) {
            return phraseWordCount.hasOwnProperty(phrase) ? phraseWordCount[phrase] - superprefixCount : 0;
        }

        for (var phrase in answer) {
            if (answer.hasOwnProperty(phrase) && phraseMatchedWordCount(phrase) < 1) {
                delete answer[phrase];
            }
        }

        prevPrefix = prefix;
        superprefixCount++;
    }, this);

    return Object.keys(answer);
}

function test() {
    var phraseSearch = new PhraseSearch();

    var phrases = [
        'Stack Overflow',
        'Math Overflow',
        'Super User',
        'Webmasters',
        'Electrical Engineering',
        'Programming Jokes',
        'Programming Puzzles',
        'Geographic Information Systems'
    ];

    phrases.forEach(phraseSearch.addPhrase, phraseSearch);

    var queries = {
        's':       'Stack Overflow, Super User, Geographic Information Systems',
        'web':     'Webmasters',
        'over':    'Stack Overflow, Math Overflow',
        'super u': 'Super User',
        'user s':  'Super User',
        'e e':     'Electrical Engineering',
        'p':       'Programming Jokes, Programming Puzzles',
        'p p':     'Programming Puzzles'
    };

    for(var query in queries) {
        if (queries.hasOwnProperty(query)) {
            var expected = queries[query];
            var actual = phraseSearch.search(query).join(', ');

            console.log('query: ' + query);
            console.log('expected: ' + expected);
            console.log('actual: ' + actual);
        }
    }
}

One can test this code here: http://ideone./RJgj6p

Possible Optimizations

  • Storing the phrase word count in each trie node is not very memory efficient. But by implementing pressed trie it is possible to reduce the worst case memory plexity to O(n m), there n is the number of different words in all the phrases, and m is the total number of phrases.

  • For simplicity I initialize answer by adding all the phrases. But a more time efficient approach is to initialize answer by adding the phrases matched by the query word matching least number of phrases. Then intersect with the phrases of the query word matching second least number of phrases. And so on...

Relevant Differences from the Implementation Referenced in the Question

  1. In trie node I store not only the phrase references (ids) matched by the subtrie, but also the number of matched words in these phrases. So, the result of the match is not only the matched phrase references, but also the number of matched words in these phrases.
  2. I process query words in descending lexicographic order.
  3. I subtract the number of superprefixes (query words of which the current query word is a prefix) from current match results (by using variable superprefixCount), and a phrase is considered matched by the current query word only when the resulting number of matched words in it is greater than zero. As in the original implementation, the final result is the intersection of the matched phrases.

As one can see, changes are minimal and asymptotic plexities (both time and memory) are not changed.

You can solve it as a Graph Matching Problem in a Bipartite Graph.

For each document, query pair define the graph:

G=(V,E) Where
V = {t1 | for each term t1 in the query} U { t2 | for each term t2 in the document}
E = { (t1,t2) | t1 is a match for t2 }

Intuitively: you have a vertex for each term in the query, a vertex for each term in the document, and an edge between a document term and a query term, only if the query term matches the document term. You have already solved this part with your trie.

You got yourself a bipartite graph, there are only edges between the "query vertices" and the "document vertices" (and not between two vertices of the same type).

Now, invoke a matching problem for bipartite graph, and get an optimal matching {(t1_1,t2_1), ... , (t1_k,t2_k)}.

Your algorithm should return a document d for a query q with m terms in the query, if (and only if) all m terms are satisfied, which means - you have maximal matching where k=m.

In your example, the graph for query="prog p", and document="Programming Jokes", you will get the bipartite graph with the matching: (or with Programming,p matched - doesn't matter which)

And, for the same query, and document="Programming Puzzles", you will get the bipartite graph with the matching:

As you can see, for the first example - there is no matching that covers all the terms, and you will "reject" the document. For the 2nd example - you were able to match all terms, and you will return it.

For performance issues, you can do the suggested algorithm only on a subset of the phrases, that were already filtered out by your initial approach (intersection of documents that have matching for all terms).

If the set of phrases is defined and does not contain long phrases, maybe you can create not 1 trie, but n tries, where n is the maximum number of words in one phrase.

In i-th trie store i-th word of the phrase. Let's call it the trie with label 'i'.

To process query with m words let's consider the following algorithm:

  1. For each phrase we will store the lowest label of a trie, where the word from this phrase was found. Let's denote it as d[j], where j is the phrase index. At first for each phrase j, d[j] = -1.
  2. Search the first word in each of n tries.
  3. For each phrase j find the label of a trie that is greater than d[j] and where the word from this phrase was found. If there are several such labels, pick the smallest one. Let's denote such label as c[j].
  4. If there is no such index, this phrase can not be matched. You can mark this case with d[j] = n + 1.
  5. If there is such c[j] that c[j] > d[j], than assign d[j] = c[j].
  6. Repeat for every word left.
  7. Every phrase with -1 < d[j] < n is matched.

This is not very optimal. To improve performance you should store only usable values of d array. After first word, store only phrases, matched with this word. Also, instead of assignment d[j] = n + 1, delete index j. Process only already stored phrase indexes.

After some thought I came up with a similar idea to dened's - in addition to the index of a matched phrase, each prefix will refer to how many words it is a prefix of in that phrase - then that number can be reduced in the query process by the number of its superfixes among other query words, and the returned results include only those with at least the same number of matched words as the query.

We can implement an additional small tweak to avoid large cross-checks by adding (for the English language) a maximum of approximately 26 choose 2 + 26 choose 3 and even an additional 26 choose 4 special elements to the trie that refer to ordered first-letter intersections. When a phrase is inserted, the special elements in the trie referring to the 2 and 3 first-letter binations will receive its index. Then match results from larger query words can be cross-checked against these. For example, if our query is "Geo i", the match results for "Geo" would be cross-checked against the special trie element, "g-i", which hopefully would have significantly less match results than "i".

Also, depending on the specific circumstances, large cross-checks could at times be more efficiently handled in parallel (for example, via a bitset &).

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论