python - Maximizing corretc classification with PuLP

In python, I’m trying to solve a problem of linear optimization with the pulp package. I have N continuous indicators and a dichotomous target. I need to select a subset of at most M indicators and a threshold for each indicator so that if at least P indicators are above their threshold the record is classified as 1. The optimization problem must determine these variables:

Which are the indicators to be used
The threshold for each used variable

N, M and P are given.

The goal is to misclassify no more than 100 0s of the target variable as 1, while maximizing the classification of 1s as 1.

Hereafter my example code… I don’t understand why it is setting very high thresholds and no records are classified as 1.

import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from pulp import LpMaximize, LpProblem, LpVariable, lpSum, value


# Sample data
np.random.seed(42)
data = pd.DataFrame(np.random.rand(1000, 10), columns=[f'X{i}' for i in range(10)])  # 10 continuous indicators
data['target'] = np.random.choice([0, 1], size=1000, p=[0.7, 0.3])  # Binary target (30% ones)

# Parameters
M = 5  # Number of selected indicators
S = 3  # Minimum number of indicators above threshold to classify as 1
max_false_positives = 100  # Maximum allowed misclassified 0s

# Define model
model = LpProblem("Indicator_Selection", LpMaximize)

# Decision variables
X = {i: LpVariable(f'X_{i}', cat='Binary') for i in range(data.shape[1] - 1)}  # Indicator selection
T = {i: LpVariable(f'T_{i}', lowBound=-100, upBound=100, cat='Continuous') for i in range(data.shape[1] - 1)}  # Thresholds
Y = {j: LpVariable(f'Y_{j}', cat='Binary') for j in range(len(data))}  # Predicted labels

# Auxiliary binary variables to represent whether each indicator exceeds its threshold
Z = {(j, i): LpVariable(f'Z_{j}_{i}', cat='Binary') for j in range(len(data)) for i in X}

# Constraint: Select exactly M indicators
model += lpSum(X[i] for i in X) <= M

# Constraint: Define Z[j, i] (1 if data.iloc[j, i] >= T[i], 0 otherwise)
M_large = 1  # Upper bound for continuous values
for j in range(len(data)):
    for i in X:
        model += Z[j, i] >= (data.iloc[j, i] - T[i]) / M_large
        model += Z[j, i] <= X[i]  # Z[j, i] can only be 1 if the feature is selected

# Constraint: Ensure classification rule
for j in range(len(data)):
    model += Y[j] * S <= lpSum(Z[j, i] for i in X)

# Constraint: Limit false positives
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 0) <= max_false_positives

# Objective: Maximize correct classifications
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 1)

# Solve the model
model.solve()

# Output results
selected_features = [i for i in X if value(X[i]) == 1]
thresholds = {i: value(T[i]) for i in selected_features}

Which are the indicators to be used
The threshold for each used variable

N, M and P are given.

The goal is to misclassify no more than 100 0s of the target variable as 1, while maximizing the classification of 1s as 1.

Hereafter my example code… I don’t understand why it is setting very high thresholds and no records are classified as 1.

import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from pulp import LpMaximize, LpProblem, LpVariable, lpSum, value


# Sample data
np.random.seed(42)
data = pd.DataFrame(np.random.rand(1000, 10), columns=[f'X{i}' for i in range(10)])  # 10 continuous indicators
data['target'] = np.random.choice([0, 1], size=1000, p=[0.7, 0.3])  # Binary target (30% ones)

# Parameters
M = 5  # Number of selected indicators
S = 3  # Minimum number of indicators above threshold to classify as 1
max_false_positives = 100  # Maximum allowed misclassified 0s

# Define model
model = LpProblem("Indicator_Selection", LpMaximize)

# Decision variables
X = {i: LpVariable(f'X_{i}', cat='Binary') for i in range(data.shape[1] - 1)}  # Indicator selection
T = {i: LpVariable(f'T_{i}', lowBound=-100, upBound=100, cat='Continuous') for i in range(data.shape[1] - 1)}  # Thresholds
Y = {j: LpVariable(f'Y_{j}', cat='Binary') for j in range(len(data))}  # Predicted labels

# Auxiliary binary variables to represent whether each indicator exceeds its threshold
Z = {(j, i): LpVariable(f'Z_{j}_{i}', cat='Binary') for j in range(len(data)) for i in X}

# Constraint: Select exactly M indicators
model += lpSum(X[i] for i in X) <= M

# Constraint: Define Z[j, i] (1 if data.iloc[j, i] >= T[i], 0 otherwise)
M_large = 1  # Upper bound for continuous values
for j in range(len(data)):
    for i in X:
        model += Z[j, i] >= (data.iloc[j, i] - T[i]) / M_large
        model += Z[j, i] <= X[i]  # Z[j, i] can only be 1 if the feature is selected

# Constraint: Ensure classification rule
for j in range(len(data)):
    model += Y[j] * S <= lpSum(Z[j, i] for i in X)

# Constraint: Limit false positives
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 0) <= max_false_positives

# Objective: Maximize correct classifications
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 1)

# Solve the model
model.solve()

# Output results
selected_features = [i for i in X if value(X[i]) == 1]
thresholds = {i: value(T[i]) for i in selected_features}

Share Improve this question edited yesterday asked 2 days ago Marco Ballerini 172 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Caveat: I haven't had the time to fully look at all the details of how you are doing classification here, but in your model, there is no "pressure" to reduce T in your model. T[i] could be arbitrarily large and you'd get the same result.

You might try adding a little penalty to your objective to help minimize T. You almost certainly want to weight it such that it could never be larger than the value of 1 unit of Y, so a reasonable weight might be |i| * upper bound or 1/500 as shown.

Try this:

# Objective: Maximize correct classifications
model += lpSum(Y[j] for j in range(len(data)) if data.iloc[j, -1] == 1) - lpSum(T[i] for i in range(data.shape[1] - 1))/500

I'd also recommend ginning up a small example (maybe 5 x 10) that uses fixed values that you construct manually to see if it is working. Blobs of random numbers are tough to troubleshoot

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Maximizing corretc classification with PuLP - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)