I have an index with a domain field that stores, for example:
domain: "google"
What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".
I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?
I have an index with a domain field that stores, for example:
domain: "google"
What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".
I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?
Share Improve this question edited Mar 30 at 11:20 Mark Rotteveel 110k229 gold badges156 silver badges223 bronze badges asked Mar 30 at 10:17 Mister_LMister_L 2,6119 gold badges37 silver badges68 bronze badges1 Answer
Reset to default 0Tldr;
If you do not want to ever match on the Top Level Domain, you might want to remove it ? or store it into another field ?
Although there are solutions even without removing it.
Solutions
The below solution will leverage creating a custom analyzer in your index.
Using ngram
Using the ngram, you could have a match on the common token. Below is an request that is going to return to token created by the n-gram.
POST _analyze
{
"tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
},
"text": "google"
}
In this situation we would get:
google
=>goo
,oog
,ogl
,gle
,com
gogle
=>gog
,ogl
,gle
,net
Meaning you have in common ogl
, gle
, each of those match is going to add to the _score
Here is a demo
PUT 79544437
{
"settings": {
"analysis": {
"analyzer": {
"ngram33": {
"type": "custom",
"tokenizer": "ngram-tokenizer",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"tokenizer": {
"ngram-tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"domain": {
"type": "text",
"analyzer": "ngram33"
}
}
}
}
PUT 79544437/_doc/1
{
"domain": "google"
}
GET 79544437/_search
{
"query": {
"match": {
"domain": {
"query": "gogle"
}
}
}
}
Custom Tokenizer
Using a custom tokenizer you could create a token per sub-domain, domain and top level domain.
The following analyzer showcase it:
POST _analyze
{
"tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\."
},
"text": "google"
}
You are getting two tokens: google
and com
The index + search you would like below:
PUT 79544437
{
"settings": {
"analysis": {
"analyzer": {
"split": {
"type": "custom",
"tokenizer": "split-tokenizer",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"tokenizer": {
"split-tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\."
}
}
}
},
"mappings": {
"properties": {
"domain": {
"type": "text",
"analyzer": "split"
}
}
}
}
PUT 79544437/_doc/1
{
"domain": "google"
}
GET 79544437/_search
{
"query": {
"match": {
"domain": {
"query": "gogle",
"fuzziness": "auto"
}
}
}
}