So I have a multivalue field in my data set and I want to run a cardinality aggregation over it:
"fooCount": {
"cardinality": {
"precision_threshold": 1000,
"field": "foo"
}
}
The problem is that in the query, I am restricting the values I want to see:
{
"terms": {
"foo": [
1042,
1594,
1741,
4178
]
}
}
So, naively, I would expect the cardinality to be 4. It isn't. It's higher than that. The reason being that it's counting the four values I restricted for, but also counting all other unique values that those records also have.
So I think, I can solve this with a script, but I'm not sure of the exact syntax to get it right. Something like:
"fooCount": {
"cardinality": {
"script": {
"source": "???",
"params": {
"targets": [
1042, 1594, 1741, 4178
]
}
},
"precision_threshold": 1000
}
}
What I need the source
to do is compare the array doc['foo']
with each item in params.targets
and return the a target value if there is a match. I'm not really all that familiar with painless, its syntax and its built-in functions. I thought something like this might work:
"source": "for (int t : params.targets) { if (doc['foo'].contains(t)) { return t} }",
So, loop over the targets
, if any of them are in the foo
array, return the value we found. Unfortunately, this is returning 0 for me. Ideally, what I think I need is to actually return the intersection of params.targets
and doc['foo']
. Am I right? How do I do that?
Edit:
Some more fiddling around, I came up with this:
Collection targets = params.targets.clone(); targets.removeIf(t -> !doc['foo'].contains(t)); return targets;
Which seems like it should be closer. targets
ought to contain the intersection, if I understand this at all, but I'm still getting 0
Edit again:
As it turns out, my previous attempt will work. What I was missing is that my target values needed to be strings because the field I was trying to count is actually a string and not a number.
Still, I'm not sure if this is the best, or most efficient way to do it.
Another edit:
I updated my previous script (which was just a get some kind of result out of this thing script), with one that is much more efficient
List result = new ArrayList(); for(t in params.targets) { if (doc['foo'].contains(t)) { result.add(t); } } return result
So I have a multivalue field in my data set and I want to run a cardinality aggregation over it:
"fooCount": {
"cardinality": {
"precision_threshold": 1000,
"field": "foo"
}
}
The problem is that in the query, I am restricting the values I want to see:
{
"terms": {
"foo": [
1042,
1594,
1741,
4178
]
}
}
So, naively, I would expect the cardinality to be 4. It isn't. It's higher than that. The reason being that it's counting the four values I restricted for, but also counting all other unique values that those records also have.
So I think, I can solve this with a script, but I'm not sure of the exact syntax to get it right. Something like:
"fooCount": {
"cardinality": {
"script": {
"source": "???",
"params": {
"targets": [
1042, 1594, 1741, 4178
]
}
},
"precision_threshold": 1000
}
}
What I need the source
to do is compare the array doc['foo']
with each item in params.targets
and return the a target value if there is a match. I'm not really all that familiar with painless, its syntax and its built-in functions. I thought something like this might work:
"source": "for (int t : params.targets) { if (doc['foo'].contains(t)) { return t} }",
So, loop over the targets
, if any of them are in the foo
array, return the value we found. Unfortunately, this is returning 0 for me. Ideally, what I think I need is to actually return the intersection of params.targets
and doc['foo']
. Am I right? How do I do that?
Edit:
Some more fiddling around, I came up with this:
Collection targets = params.targets.clone(); targets.removeIf(t -> !doc['foo'].contains(t)); return targets;
Which seems like it should be closer. targets
ought to contain the intersection, if I understand this at all, but I'm still getting 0
Edit again:
As it turns out, my previous attempt will work. What I was missing is that my target values needed to be strings because the field I was trying to count is actually a string and not a number.
Still, I'm not sure if this is the best, or most efficient way to do it.
Another edit:
I updated my previous script (which was just a get some kind of result out of this thing script), with one that is much more efficient
List result = new ArrayList(); for(t in params.targets) { if (doc['foo'].contains(t)) { result.add(t); } } return result
Share
Improve this question
edited Feb 17 at 14:54
Matt Burland
asked Feb 14 at 14:55
Matt BurlandMatt Burland
45.2k18 gold badges106 silver badges179 bronze badges
2
- I might be missing the obvious but I am unsure as to what you want to accomplish. Do you want to count the number of document having all 4 value for this term ? Do you want to count how many time each of the 4 value appear in the dataset ? – Paulo Commented Feb 14 at 18:09
- @Paulo: I need to count how many of the items in my list of targets actually appear in the result set. So if all 4 appear in the result set, it would be 4. If only 3 of the values appears in all the matching documents, then it would be 3, etc. – Matt Burland Commented Feb 14 at 19:19
1 Answer
Reset to default 1Tldr;
Given the specific nature of the aggregation you are right to look into scripting it.
Solution
Here is what I think you want to achieve. As well as the set up steps.
# Set up
POST 79439746/_bulk
{"index":{}}
{"foo":["a", "b", "c"]}
{"index":{}}
{"foo":["b", "c", "e"]}
{"index":{}}
{"foo":["e", "f", "g"]}
{"index":{}}
{"foo":["e", "b", "a"]}
# Search + custom aggregation
POST 79439746/_search?size=0
{
"query": {
"terms": {
"foo.keyword": [
"f",
"e",
"h"
]
}
},
"aggs": {
"occurences_of_all_foo": {
"scripted_metric": {
"params": {
"foo": [
"f",
"e",
"h"
]
},
"init_script": "state.apprearances = [];",
"map_script": """
def counter = 0;
for (def t : params.foo) {
if(doc["foo.keyword"].contains(t)) {
counter+=1;
}
}
state.apprearances.add(counter);""",
"combine_script": "def max = Collections.min(state.apprearances); return max",
"reduce_script": "def max = Collections.min(states); return max"
}
}
}
}
Given the specific data set.
The aggregation is going to return 2 because only f
, e
, exists in the dataset.