According to the sklearn docs, if you apply predict_proba to DecisionTreeClassifier:
The predicted class probability is the fraction of samples of the same class in a leaf.
Let's say that the rows where class = 1 in my training dataset look like this:
feature_1 | feature_2 | class
----------|-----------|------
A | C | 1
A | C | 1
A | D | 1
B | C | 1
B | D | 1
I'm interpreting the docs to mean that if I trained a model on this data, predict_proba would tell me that a row where feature_1 = A and feature_2 = C would have a 40% chance of falling under class 1. This is because there are five rows total where class = 1, two of which also have feature_1 = A and feature_2 = C. Two is 40% of five.
Obviously this is a very simple example, but I'm just trying to understand the general methodology predict_proba uses.
Is my interpretation correct? I would have thought that in this case, the probability of class being 1 would be at least partially affected by any rows in the training dataset where feature_1 = A, feature_2 = C, and class != 1?
According to the sklearn docs, if you apply predict_proba to DecisionTreeClassifier:
The predicted class probability is the fraction of samples of the same class in a leaf.
Let's say that the rows where class = 1 in my training dataset look like this:
feature_1 | feature_2 | class
----------|-----------|------
A | C | 1
A | C | 1
A | D | 1
B | C | 1
B | D | 1
I'm interpreting the docs to mean that if I trained a model on this data, predict_proba would tell me that a row where feature_1 = A and feature_2 = C would have a 40% chance of falling under class 1. This is because there are five rows total where class = 1, two of which also have feature_1 = A and feature_2 = C. Two is 40% of five.
Obviously this is a very simple example, but I'm just trying to understand the general methodology predict_proba uses.
Is my interpretation correct? I would have thought that in this case, the probability of class being 1 would be at least partially affected by any rows in the training dataset where feature_1 = A, feature_2 = C, and class != 1?
Share Improve this question edited Mar 12 at 19:58 desertnaut 60.5k32 gold badges155 silver badges182 bronze badges asked Mar 12 at 15:43 SRJCodingSRJCoding 4755 silver badges18 bronze badges1 Answer
Reset to default 1First of all, decision trees operate with numerical features - not categorical. And then the learning algorithm will attempt to find decision thresholds that optimize the given criteria. This happens in a greedy fashion until the stopping criteria (like max_depth) is hit. When the stopping criterion is hit, we have a leaf node. And at this leaf node (when we have applied all the featureA < threshold 1 && featureE > threshold2 etc), we might be left with multiple samples coming from different classes. This is when the cited rule comes in. The probabilities for each class, in that specific leaf node, will be set to the observed class proportions.