In python, I’m trying to solve a problem of linear optimization with the pulp package. I have N continuous indicators and a dichotomous target. I need to select a subset of at most M indicators and a threshold for each indicator so that if at least P indicators are above their threshold the record is classified as 1. The optimization problem must determine these variables:
- Which are the indicators to be used
- The threshold for each used variable
N, M and P are given.
The goal is to misclassify no more than 100 0s of the target variable as 1, while maximizing the classification of 1s as 1.
Hereafter my example code… I don’t understand why it is setting very high thresholds and no records are classified as 1.
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from pulp import LpMaximize, LpProblem, LpVariable, lpSum, value
# Sample data
np.random.seed(42)
data = pd.DataFrame(np.random.rand(1000, 10), columns=[f'X{i}' for i in range(10)]) # 10 continuous indicators
data['target'] = np.random.choice([0, 1], size=1000, p=[0.7, 0.3]) # Binary target (30% ones)
# Parameters
M = 5 # Number of selected indicators
S = 3 # Minimum number of indicators above threshold to classify as 1
max_false_positives = 100 # Maximum allowed misclassified 0s
# Define model
model = LpProblem("Indicator_Selection", LpMaximize)
# Decision variables
X = {i: LpVariable(f'X_{i}', cat='Binary') for i in range(data.shape[1] - 1)} # Indicator selection
T = {i: LpVariable(f'T_{i}', lowBound=-100, upBound=100, cat='Continuous') for i in range(data.shape[1] - 1)} # Thresholds
Y = {j: LpVariable(f'Y_{j}', cat='Binary') for j in range(len(data))} # Predicted labels
# Auxiliary binary variables to represent whether each indicator exceeds its threshold
Z = {(j, i): LpVariable(f'Z_{j}_{i}', cat='Binary') for j in range(len(data)) for i in X}
# Constraint: Select exactly M indicators
model += lpSum(X[i] for i in X) <= M
# Constraint: Define Z[j, i] (1 if data.iloc[j, i] >= T[i], 0 otherwise)
M_large = 1 # Upper bound for continuous values
for j in range(len(data)):
for i in X:
model += Z[j, i] >= (data.iloc[j, i] - T[i]) / M_large
model += Z[j, i] <= X[i] # Z[j, i] can only be 1 if the feature is selected
# Constraint: Ensure classification rule
for j in range(len(data)):
model += Y[j] * S <= lpSum(Z[j, i] for i in X)
# Constraint: Limit false positives
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 0) <= max_false_positives
# Objective: Maximize correct classifications
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 1)
# Solve the model
model.solve()
# Output results
selected_features = [i for i in X if value(X[i]) == 1]
thresholds = {i: value(T[i]) for i in selected_features}
In python, I’m trying to solve a problem of linear optimization with the pulp package. I have N continuous indicators and a dichotomous target. I need to select a subset of at most M indicators and a threshold for each indicator so that if at least P indicators are above their threshold the record is classified as 1. The optimization problem must determine these variables:
- Which are the indicators to be used
- The threshold for each used variable
N, M and P are given.
The goal is to misclassify no more than 100 0s of the target variable as 1, while maximizing the classification of 1s as 1.
Hereafter my example code… I don’t understand why it is setting very high thresholds and no records are classified as 1.
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from pulp import LpMaximize, LpProblem, LpVariable, lpSum, value
# Sample data
np.random.seed(42)
data = pd.DataFrame(np.random.rand(1000, 10), columns=[f'X{i}' for i in range(10)]) # 10 continuous indicators
data['target'] = np.random.choice([0, 1], size=1000, p=[0.7, 0.3]) # Binary target (30% ones)
# Parameters
M = 5 # Number of selected indicators
S = 3 # Minimum number of indicators above threshold to classify as 1
max_false_positives = 100 # Maximum allowed misclassified 0s
# Define model
model = LpProblem("Indicator_Selection", LpMaximize)
# Decision variables
X = {i: LpVariable(f'X_{i}', cat='Binary') for i in range(data.shape[1] - 1)} # Indicator selection
T = {i: LpVariable(f'T_{i}', lowBound=-100, upBound=100, cat='Continuous') for i in range(data.shape[1] - 1)} # Thresholds
Y = {j: LpVariable(f'Y_{j}', cat='Binary') for j in range(len(data))} # Predicted labels
# Auxiliary binary variables to represent whether each indicator exceeds its threshold
Z = {(j, i): LpVariable(f'Z_{j}_{i}', cat='Binary') for j in range(len(data)) for i in X}
# Constraint: Select exactly M indicators
model += lpSum(X[i] for i in X) <= M
# Constraint: Define Z[j, i] (1 if data.iloc[j, i] >= T[i], 0 otherwise)
M_large = 1 # Upper bound for continuous values
for j in range(len(data)):
for i in X:
model += Z[j, i] >= (data.iloc[j, i] - T[i]) / M_large
model += Z[j, i] <= X[i] # Z[j, i] can only be 1 if the feature is selected
# Constraint: Ensure classification rule
for j in range(len(data)):
model += Y[j] * S <= lpSum(Z[j, i] for i in X)
# Constraint: Limit false positives
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 0) <= max_false_positives
# Objective: Maximize correct classifications
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 1)
# Solve the model
model.solve()
# Output results
selected_features = [i for i in X if value(X[i]) == 1]
thresholds = {i: value(T[i]) for i in selected_features}
Share
Improve this question
edited yesterday
Marco Ballerini
asked 2 days ago
Marco BalleriniMarco Ballerini
172 bronze badges
1 Answer
Reset to default 0Caveat: I haven't had the time to fully look at all the details of how you are doing classification here, but in your model, there is no "pressure" to reduce T
in your model. T[i]
could be arbitrarily large and you'd get the same result.
You might try adding a little penalty to your objective to help minimize T
. You almost certainly want to weight it such that it could never be larger than the value of 1 unit of Y
, so a reasonable weight might be |i| * upper bound
or 1/500 as shown.
Try this:
# Objective: Maximize correct classifications
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 1) - lpSum(T[i] for i in range(data.shape[1] - 1))/500
I'd also recommend ginning up a small example (maybe 5 x 10) that uses fixed values that you construct manually to see if it is working. Blobs of random numbers are tough to troubleshoot