My Question
I know this is not efficient/optimal for this data; this is just for study and me trying (so hard...) to understand it.
I have a DataFrame with 4 numerical features and a class label with 3 unique string values
iris_df.
I would like to be able to create a DataPipeline that would accept a tf.Dataset consisting of both the raw features and the string labels -> process both the features and labels -> feed them into a functional model.
I know it can easily be done with sklearn, outside of the pipeline, but I'd love to know if it's possible to do it all within a pipeline using tensorflow tools.
What I've tried - my best result
Create the Dataset
feats_df = iris_df.drop(labels='Species', axis=1)
lbl_df = iris_df['Species']
# I'm skipping the splitting into training, testing etc.
train_ds = tf.data.Dataset.from_tensor_slices((feats_df, lbl_df))
Create preprocessing layers
# Normalizing features
normalizer = tf.keras.layers.Normalization()
features_ds = train_ds.map(lambda x, y: x) # Get only features
normalizer.adapt(features_ds)
# One-Hot Encode Labels
oh_encoder = tf.keras.layers.StringLookup(output_mode="one_hot")
labels_ds = train_ds.map(lambda x, y: y) # Get only labels
oh_encoder.adapt(labels_ds)
Define the Model
raw_features = tf.keras.Input(shape=(4,), name="Feature Input")
raw_labels = tf.keras.Input(shape=(1,), name="Label Input")
normalized_features = normalizer(raw_features)
encoded_labels = oh_encoder(raw_labels)
preprocessed_inputs = tf.keras.layers.concatenate([normalized_features, encoded_labels], axis=1)
x = tf.keras.layers.Dense(units=16, activation="relu", name="Hidden1")(preprocessed_inputs)
x = tf.keras.layers.Dense(units=8, activation="relu", name="Hidden2")(x)
output = tf.keras.layers.Dense(units=4, activation="softmax", name="Output")(x)
model1 = tf.keras.Model(inputs=[raw_features, raw_labels], outputs=output, name="Model1")
model1pile(
optimizer='adam',
loss={"Output": keras.losses.CategoricalCrossentropy()},
metrics={"Output":[keras.metrics.Accuracy()]}
)
This graph shows what I'm trying to achieve . I pass a Dataset that contains unprocessed features and labels to their respective layers, then they get concatenated and passed to the Neural Network.
The Problem
Unfortunately this approach doesn't work. I found some posts on stackoverflow about this but none of them seemed to have been fully answered. After hours of trying (so, many, hours T.T) I realized how to pass the data into model.fit()
to avoid errors there
# I know it's missing validation data etc.
model1.fit(
x=train_ds.map(lambda x,y: ({"Feature Input":x, "Label Input":y}, y)),
epochs=2)
This however results in a problem where the loss function receives unprocessed labels in the form of strings from the y
- the 2nd element in the tuple.
Custom Loss Function
I have also tried implementing a custom loss function that would encode the string labels before passing them to the original tf loss function
def custom_loss(y_true, y_pred):
# Encode the raw string labels to integers
y_true_encoded = oh_encoder(y_true)
# Compute the categorical crossentropy loss
loss = tf.keras.losses.categorical_crossentropy(y_true_encoded, y_pred)
return loss
Unfortunately this seems to result in the same problem (this is a bit too advanced for me to interpret), namely the labels aren't being encoded
Cast string to float is not supported [[{{node Model1_1/Cast_1}}]] [Op:__inference_multi_step_on_iterator_6270]
Question Restated
So my question is: is there a way to somehow route the One-Hot encoded labels from the StringLookup layer to the loss function? Or maybe there is a completely different approach?
If that's not possible, what is the best way to encode the labels using only tensorflow tools (without resorting to sklearn) so that it would be more scalable to giant datasets?
Ps. I do apologize for any heresies in this code, I'm not a programmer or a Computer Science student, I've been self studying from the very start