最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

elasticsearch - Restrict cardinality on a multivalue field - Stack Overflow

programmeradmin2浏览0评论

So I have a multivalue field in my data set and I want to run a cardinality aggregation over it:

    "fooCount": {
      "cardinality": {
        "precision_threshold": 1000,
        "field": "foo"
      }
    }

The problem is that in the query, I am restricting the values I want to see:

        {
          "terms": {
            "foo": [
              1042,
              1594,
              1741,
              4178
            ]
          }
        }

So, naively, I would expect the cardinality to be 4. It isn't. It's higher than that. The reason being that it's counting the four values I restricted for, but also counting all other unique values that those records also have.

So I think, I can solve this with a script, but I'm not sure of the exact syntax to get it right. Something like:

    "fooCount": {
      "cardinality": {
        "script": {
          "source": "???",
          "params": {
            "targets": [
              1042, 1594, 1741, 4178
            ]
          }
        },
        "precision_threshold": 1000
      }
    }

What I need the source to do is compare the array doc['foo'] with each item in params.targets and return the a target value if there is a match. I'm not really all that familiar with painless, its syntax and its built-in functions. I thought something like this might work:

"source": "for (int t : params.targets) { if (doc['foo'].contains(t)) { return t} }",

So, loop over the targets, if any of them are in the foo array, return the value we found. Unfortunately, this is returning 0 for me. Ideally, what I think I need is to actually return the intersection of params.targets and doc['foo']. Am I right? How do I do that?

Edit:

Some more fiddling around, I came up with this:

Collection targets = params.targets.clone(); targets.removeIf(t -> !doc['foo'].contains(t)); return targets;

Which seems like it should be closer. targets ought to contain the intersection, if I understand this at all, but I'm still getting 0

Edit again:

As it turns out, my previous attempt will work. What I was missing is that my target values needed to be strings because the field I was trying to count is actually a string and not a number.

Still, I'm not sure if this is the best, or most efficient way to do it.

Another edit:

I updated my previous script (which was just a get some kind of result out of this thing script), with one that is much more efficient

List result = new ArrayList(); for(t in params.targets) { if (doc['foo'].contains(t)) { result.add(t); } } return result

So I have a multivalue field in my data set and I want to run a cardinality aggregation over it:

    "fooCount": {
      "cardinality": {
        "precision_threshold": 1000,
        "field": "foo"
      }
    }

The problem is that in the query, I am restricting the values I want to see:

        {
          "terms": {
            "foo": [
              1042,
              1594,
              1741,
              4178
            ]
          }
        }

So, naively, I would expect the cardinality to be 4. It isn't. It's higher than that. The reason being that it's counting the four values I restricted for, but also counting all other unique values that those records also have.

So I think, I can solve this with a script, but I'm not sure of the exact syntax to get it right. Something like:

    "fooCount": {
      "cardinality": {
        "script": {
          "source": "???",
          "params": {
            "targets": [
              1042, 1594, 1741, 4178
            ]
          }
        },
        "precision_threshold": 1000
      }
    }

What I need the source to do is compare the array doc['foo'] with each item in params.targets and return the a target value if there is a match. I'm not really all that familiar with painless, its syntax and its built-in functions. I thought something like this might work:

"source": "for (int t : params.targets) { if (doc['foo'].contains(t)) { return t} }",

So, loop over the targets, if any of them are in the foo array, return the value we found. Unfortunately, this is returning 0 for me. Ideally, what I think I need is to actually return the intersection of params.targets and doc['foo']. Am I right? How do I do that?

Edit:

Some more fiddling around, I came up with this:

Collection targets = params.targets.clone(); targets.removeIf(t -> !doc['foo'].contains(t)); return targets;

Which seems like it should be closer. targets ought to contain the intersection, if I understand this at all, but I'm still getting 0

Edit again:

As it turns out, my previous attempt will work. What I was missing is that my target values needed to be strings because the field I was trying to count is actually a string and not a number.

Still, I'm not sure if this is the best, or most efficient way to do it.

Another edit:

I updated my previous script (which was just a get some kind of result out of this thing script), with one that is much more efficient

List result = new ArrayList(); for(t in params.targets) { if (doc['foo'].contains(t)) { result.add(t); } } return result
Share Improve this question edited Feb 17 at 14:54 Matt Burland asked Feb 14 at 14:55 Matt BurlandMatt Burland 45.2k18 gold badges106 silver badges179 bronze badges 2
  • I might be missing the obvious but I am unsure as to what you want to accomplish. Do you want to count the number of document having all 4 value for this term ? Do you want to count how many time each of the 4 value appear in the dataset ? – Paulo Commented Feb 14 at 18:09
  • @Paulo: I need to count how many of the items in my list of targets actually appear in the result set. So if all 4 appear in the result set, it would be 4. If only 3 of the values appears in all the matching documents, then it would be 3, etc. – Matt Burland Commented Feb 14 at 19:19
Add a comment  | 

1 Answer 1

Reset to default 1

Tldr;

Given the specific nature of the aggregation you are right to look into scripting it.

Solution

Here is what I think you want to achieve. As well as the set up steps.

# Set up
POST 79439746/_bulk
{"index":{}}
{"foo":["a", "b", "c"]}
{"index":{}}
{"foo":["b", "c", "e"]}
{"index":{}}
{"foo":["e", "f", "g"]}
{"index":{}}
{"foo":["e", "b", "a"]}

# Search + custom aggregation
POST 79439746/_search?size=0
{
  "query": {
    "terms": {
      "foo.keyword": [
        "f",
        "e",
        "h"
      ]
    }
  },
  "aggs": {
    "occurences_of_all_foo": {
      "scripted_metric": {
        "params": {
          "foo": [
            "f",
            "e",
            "h"
          ]
        },
        "init_script": "state.apprearances = [];",
        "map_script": """
            def counter = 0;
            for (def t : params.foo) { 
                if(doc["foo.keyword"].contains(t)) { 
                    counter+=1;
                }
            }
            state.apprearances.add(counter);""",
        "combine_script": "def max = Collections.min(state.apprearances); return max",
        "reduce_script": "def max = Collections.min(states); return max"
      }
    }
  }
}

Given the specific data set. The aggregation is going to return 2 because only f, e, exists in the dataset.

发布评论

评论列表(0)

  1. 暂无评论