python - Understanding YOLOv11 Tensor Ouput Shape for Post-Processing

I tried export YOLOv11 model to tensorflow, it said:

'yolo11n.pt' with input shape (1, 3, 640, 640) BCHW and output shape(s) (1, 84, 8400) (5.4 MB)

Now I have this model summary in Keras 3:

Model: "functional_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_4 (InputLayer)           │ (None, 640, 640, 3)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ tfsm_layer_8 (TFSMLayer)             │ (1, 84, 8400)               │               0 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 0 (0.00 B)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 0 (0.00 B)

It's clear for the input shape, but not the output shape. According this:

The output shapes for YOLOv8n and YOLOv8n-seg models represent different components. For YOLOv8n, the shape (1, 84, 8400) includes 80 classes and 4 bounding box parameters. For YOLOv8n-seg, the first output (1, 116, 8400) includes 80 classes, 4 parameters, and 32 mask coefficients, while the second output (1, 32, 160, 160) represents the prototype masks.

I tried to infer and manually post processing from ChatGPT source code:

# output: (84, 8400) | image: (640, 640, 3)

# Extract bounding box coordinates (first 4 values)
boxes = output[:4, :].T  # Shape: (8400, 4)

# Extract confidence scores (5th value)
confidences = output[4, :]  # Shape: (8400,)

# Convert (center x, center y, width, height) → (x1, y1, x2, y2)
boxes[:, 0] -= boxes[:, 2] / 2  # x1 = x_center - width/2
boxes[:, 1] -= boxes[:, 3] / 2  # y1 = y_center - height/2
boxes[:, 2] += boxes[:, 0]  # x2 = x1 + width
boxes[:, 3] += boxes[:, 1]  # y2 = y1 + height

# Filter by confidence threshold (adjust for debugging)
threshold = 0.1
indices = np.where(confidences > threshold)[0]

filtered_boxes = boxes[indices]
filtered_confidences = confidences[indices]

# Draw raw bounding boxes
for i in range(len(filtered_boxes)):
    x1, y1, x2, y2 = map(int, filtered_boxes[i])

    # Ensure coordinates are within image bounds
    x1, y1 = max(0, x1), max(0, y1)
    x2, y2 = min(image.shape[1], x2), min(image.shape[0], y2)

    # Draw bounding box
    cv2.rectangle(image, (x1, y1), (x2, y2), (255, 0, 0), 2)  # Red box

    # Display confidence score (for debugging)
    cv2.putText(image, f"{filtered_confidences[i]:.2f}", (x1, y1 - 10),
                cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 2)

# Show the image
plt.figure(figsize=(10, 6))
plt.imshow(image)
plt.axis("off")
plt.show()

And here is the output:

Not sure if my post processing implementation is correct since I have no idea how to interpret the tensor output shape.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Understanding YOLOv11 Tensor Ouput Shape for Post-Processing - Stack Overflow

与本文相关的文章

评论列表(0)