I’m facing an issue with a webhook system built using AWS SQS FIFO queues, EventBridge Pipes, Step Functions, and Lambda. Here’s the setup: • SQS FIFO Queue: • Receives webhook messages containing shop info and webhook type. • Uses a composite message group ID (mall_id, shop_no, webhook_type) for FIFO ordering. • Deduplication is performed based on datetime.now.timestamp(). • EventBridge Pipe: • Connects the SQS FIFO queue to Step Functions (which in turn trigger one or more Lambdas).
The Problem: On March 14th, one vendor sent approximately 800,000 messages in a short burst. According to CloudWatch metrics: • The number of visible messages in the SQS FIFO queue skyrocketed starting around 4:30 PM. • However, the invocation count for the EventBridge Pipe did not increase proportionally. • Neither the EventBridge Pipe, Step Functions, nor Lambda showed any signs of throttling or increased execution durations.
I expected that since each vendor uses a separate message group ID, the heavy load from one vendor should only affect that specific group, and other vendors’ messages would still be processed in parallel. In theory, FIFO queues allow parallel processing across different message groups.
My Questions: 1. Why is the EventBridge Pipe not consuming messages at a higher rate in response to the increased load from a single vendor? 2. Is it possible that a massive burst of messages in one FIFO message group can indirectly affect the consumption of messages in other groups? 3. Are there any known limitations or hidden concurrency controls in EventBridge Pipes that might explain why the invocation rate doesn’t scale with the incoming message volume? 4. What configuration changes or alternative approaches can be taken to mitigate this bottleneck?
Any insights, documentation references, or configuration tips would be greatly appreciated.