I have an index with a domain field that stores, for example:

 domain: "google"

What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".

I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?

I have an index with a domain field that stores, for example:

 domain: "google"

I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?

Share Improve this question edited Mar 30 at 11:20 Mark Rotteveel 110k229 gold badges156 silver badges223 bronze badges asked Mar 30 at 10:17 Mister_L 2,6119 gold badges37 silver badges68 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Tldr;

If you do not want to ever match on the Top Level Domain, you might want to remove it ? or store it into another field ?

Although there are solutions even without removing it.

Solutions

The below solution will leverage creating a custom analyzer in your index.

Using ngram

Using the ngram, you could have a match on the common token. Below is an request that is going to return to token created by the n-gram.

POST _analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 3,
    "token_chars": [
      "letter",
      "digit"
    ]
  },
  "text": "google"
}

In this situation we would get:

google => goo, oog, ogl, gle, com
gogle => gog, ogl, gle, net

Meaning you have in common ogl, gle, each of those match is going to add to the _score

Here is a demo

PUT 79544437
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ngram33": {
          "type": "custom",
          "tokenizer": "ngram-tokenizer",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "tokenizer": {
        "ngram-tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "domain": {
        "type": "text",
        "analyzer": "ngram33"
      }
    }
  }
}

PUT 79544437/_doc/1
{
  "domain": "google"
}

GET 79544437/_search
{
  "query": {
    "match": {
      "domain": {
        "query": "gogle"
      }
    }
  }
}

Custom Tokenizer

Using a custom tokenizer you could create a token per sub-domain, domain and top level domain.

The following analyzer showcase it:

POST _analyze
{
  "tokenizer": {
    "type": "simple_pattern_split",
    "pattern": "\\."
  },
  "text": "google"
}

You are getting two tokens: google and com

The index + search you would like below:

PUT 79544437
{
  "settings": {
    "analysis": {
      "analyzer": {
        "split": {
          "type": "custom",
          "tokenizer": "split-tokenizer",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "tokenizer": {
        "split-tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "\\."
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "domain": {
        "type": "text",
        "analyzer": "split"
      }
    }
  }
}

PUT 79544437/_doc/1
{
  "domain": "google"
}

GET 79544437/_search
{
  "query": {
    "match": {
      "domain": {
        "query": "gogle",
        "fuzziness": "auto"
      }
    }
  }
}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

elasticsearch - Fuzzy matching domain while ignoring TLD - Stack Overflow

1 Answer 1

Tldr;

Solutions

Using ngram

Custom Tokenizer

与本文相关的文章

评论列表(0)