最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

elasticsearch - Fuzzy matching domain while ignoring TLD - Stack Overflow

programmeradmin2浏览0评论

I have an index with a domain field that stores, for example:

 domain: "google" 

What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".

I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?

I have an index with a domain field that stores, for example:

 domain: "google" 

What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".

I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?

Share Improve this question edited Mar 30 at 11:20 Mark Rotteveel 110k229 gold badges156 silver badges223 bronze badges asked Mar 30 at 10:17 Mister_LMister_L 2,6119 gold badges37 silver badges68 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

Tldr;

If you do not want to ever match on the Top Level Domain, you might want to remove it ? or store it into another field ?

Although there are solutions even without removing it.

Solutions

The below solution will leverage creating a custom analyzer in your index.

Using ngram

Using the ngram, you could have a match on the common token. Below is an request that is going to return to token created by the n-gram.

POST _analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 3,
    "token_chars": [
      "letter",
      "digit"
    ]
  },
  "text": "google"
}

In this situation we would get:

  • google => goo, oog, ogl, gle, com
  • gogle => gog, ogl, gle, net

Meaning you have in common ogl, gle, each of those match is going to add to the _score

Here is a demo

PUT 79544437
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ngram33": {
          "type": "custom",
          "tokenizer": "ngram-tokenizer",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "tokenizer": {
        "ngram-tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "domain": {
        "type": "text",
        "analyzer": "ngram33"
      }
    }
  }
}

PUT 79544437/_doc/1
{
  "domain": "google"
}

GET 79544437/_search
{
  "query": {
    "match": {
      "domain": {
        "query": "gogle"
      }
    }
  }
}

Custom Tokenizer

Using a custom tokenizer you could create a token per sub-domain, domain and top level domain.

The following analyzer showcase it:

POST _analyze
{
  "tokenizer": {
    "type": "simple_pattern_split",
    "pattern": "\\."
  },
  "text": "google"
}

You are getting two tokens: google and com

The index + search you would like below:

PUT 79544437
{
  "settings": {
    "analysis": {
      "analyzer": {
        "split": {
          "type": "custom",
          "tokenizer": "split-tokenizer",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "tokenizer": {
        "split-tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "\\."
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "domain": {
        "type": "text",
        "analyzer": "split"
      }
    }
  }
}

PUT 79544437/_doc/1
{
  "domain": "google"
}

GET 79544437/_search
{
  "query": {
    "match": {
      "domain": {
        "query": "gogle",
        "fuzziness": "auto"
      }
    }
  }
}
发布评论

评论列表(0)

  1. 暂无评论