I tried export YOLOv11 model to tensorflow, it said:
'yolo11n.pt' with input shape (1, 3, 640, 640) BCHW and output shape(s) (1, 84, 8400) (5.4 MB)
Now I have this model summary in Keras 3:
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_4 (InputLayer) │ (None, 640, 640, 3) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ tfsm_layer_8 (TFSMLayer) │ (1, 84, 8400) │ 0 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 0 (0.00 B)
Trainable params: 0 (0.00 B)
Non-trainable params: 0 (0.00 B)
It's clear for the input shape, but not the output shape. According this:
The output shapes for YOLOv8n and YOLOv8n-seg models represent different components. For YOLOv8n, the shape (1, 84, 8400) includes 80 classes and 4 bounding box parameters. For YOLOv8n-seg, the first output (1, 116, 8400) includes 80 classes, 4 parameters, and 32 mask coefficients, while the second output (1, 32, 160, 160) represents the prototype masks.
I tried to infer and manually post processing from ChatGPT source code:
# output: (84, 8400) | image: (640, 640, 3)
# Extract bounding box coordinates (first 4 values)
boxes = output[:4, :].T # Shape: (8400, 4)
# Extract confidence scores (5th value)
confidences = output[4, :] # Shape: (8400,)
# Convert (center x, center y, width, height) → (x1, y1, x2, y2)
boxes[:, 0] -= boxes[:, 2] / 2 # x1 = x_center - width/2
boxes[:, 1] -= boxes[:, 3] / 2 # y1 = y_center - height/2
boxes[:, 2] += boxes[:, 0] # x2 = x1 + width
boxes[:, 3] += boxes[:, 1] # y2 = y1 + height
# Filter by confidence threshold (adjust for debugging)
threshold = 0.1
indices = np.where(confidences > threshold)[0]
filtered_boxes = boxes[indices]
filtered_confidences = confidences[indices]
# Draw raw bounding boxes
for i in range(len(filtered_boxes)):
x1, y1, x2, y2 = map(int, filtered_boxes[i])
# Ensure coordinates are within image bounds
x1, y1 = max(0, x1), max(0, y1)
x2, y2 = min(image.shape[1], x2), min(image.shape[0], y2)
# Draw bounding box
cv2.rectangle(image, (x1, y1), (x2, y2), (255, 0, 0), 2) # Red box
# Display confidence score (for debugging)
cv2.putText(image, f"{filtered_confidences[i]:.2f}", (x1, y1 - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 2)
# Show the image
plt.figure(figsize=(10, 6))
plt.imshow(image)
plt.axis("off")
plt.show()
And here is the output:
Not sure if my post processing implementation is correct since I have no idea how to interpret the tensor output shape.