My Question

I know this is not efficient/optimal for this data; this is just for study and me trying (so hard...) to understand it.

I have a DataFrame with 4 numerical features and a class label with 3 unique string values iris_df.
I would like to be able to create a DataPipeline that would accept a tf.Dataset consisting of both the raw features and the string labels -> process both the features and labels -> feed them into a functional model.

I know it can easily be done with sklearn, outside of the pipeline, but I'd love to know if it's possible to do it all within a pipeline using tensorflow tools.

What I've tried - my best result

Create the Dataset

feats_df = iris_df.drop(labels='Species', axis=1)
lbl_df = iris_df['Species']
# I'm skipping the splitting into training, testing etc.
train_ds = tf.data.Dataset.from_tensor_slices((feats_df, lbl_df))

Create preprocessing layers

# Normalizing features
normalizer = tf.keras.layers.Normalization()
features_ds = train_ds.map(lambda x, y: x) # Get only features
normalizer.adapt(features_ds)

# One-Hot Encode Labels
oh_encoder = tf.keras.layers.StringLookup(output_mode="one_hot")
labels_ds = train_ds.map(lambda x, y: y) # Get only labels
oh_encoder.adapt(labels_ds)

Define the Model

raw_features = tf.keras.Input(shape=(4,), name="Feature Input")
raw_labels = tf.keras.Input(shape=(1,), name="Label Input")

normalized_features = normalizer(raw_features)
encoded_labels = oh_encoder(raw_labels)
preprocessed_inputs = tf.keras.layers.concatenate([normalized_features, encoded_labels], axis=1)

x = tf.keras.layers.Dense(units=16, activation="relu", name="Hidden1")(preprocessed_inputs)
x = tf.keras.layers.Dense(units=8, activation="relu", name="Hidden2")(x)
output = tf.keras.layers.Dense(units=4, activation="softmax", name="Output")(x)

model1 = tf.keras.Model(inputs=[raw_features, raw_labels], outputs=output, name="Model1")
model1pile(
    optimizer='adam',
    loss={"Output": keras.losses.CategoricalCrossentropy()},
    metrics={"Output":[keras.metrics.Accuracy()]}
)

This graph shows what I'm trying to achieve . I pass a Dataset that contains unprocessed features and labels to their respective layers, then they get concatenated and passed to the Neural Network.

The Problem

Unfortunately this approach doesn't work. I found some posts on stackoverflow about this but none of them seemed to have been fully answered. After hours of trying (so, many, hours T.T) I realized how to pass the data into model.fit() to avoid errors there

# I know it's missing validation data etc.
model1.fit(
    x=train_ds.map(lambda x,y: ({"Feature Input":x, "Label Input":y}, y)),
    epochs=2)

This however results in a problem where the loss function receives unprocessed labels in the form of strings from the y - the 2nd element in the tuple.

Custom Loss Function

I have also tried implementing a custom loss function that would encode the string labels before passing them to the original tf loss function

def custom_loss(y_true, y_pred):
    # Encode the raw string labels to integers
    y_true_encoded = oh_encoder(y_true)
    # Compute the categorical crossentropy loss
    loss = tf.keras.losses.categorical_crossentropy(y_true_encoded, y_pred)
    return loss

Unfortunately this seems to result in the same problem (this is a bit too advanced for me to interpret), namely the labels aren't being encoded

Cast string to float is not supported [[{{node Model1_1/Cast_1}}]] [Op:__inference_multi_step_on_iterator_6270]

Question Restated

So my question is: is there a way to somehow route the One-Hot encoded labels from the StringLookup layer to the loss function? Or maybe there is a completely different approach?

If that's not possible, what is the best way to encode the labels using only tensorflow tools (without resorting to sklearn) so that it would be more scalable to giant datasets?

Ps. I do apologize for any heresies in this code, I'm not a programmer or a Computer Science student, I've been self studying from the very start

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Tensorflow - Encode Target Labels within Pipeline - Stack Overflow