The title had to be shortened a bit.
The full error message is more like this:
Kafka Consumer group session timed out (in join-state steady) after X ms without a successful response from the group coordinator: revoking assignment and rejoining group
What causes this?
Context
I have a simple python application running which is intended to verify a data migration process.
- Data was migrated (copied) from one Kafka cluster to another
- The process spawns two consumers, one for each cluster, and reads events sequentially
- It verifies that the consumed data from each consumer is the same
Here are some more detailed log lines.
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #0)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #1)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #2)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #3)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #4)
%4|1739052194.834|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out 2216 in-flight, 0 retry-queued, 15394 out-queue, 1 partially-sent requests
%3|1739052194.835|FAIL|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.2:9092: 17610 request(s) timed out: disconnect (average rtt 159.627ms) (after 100166ms in state UP)
%4|1739052205.056|SESSTMOUT|consumer2.topicname#consumer-2| [thrd:main]: Consumer group session timed out (in join-state steady) after 45000 ms without a successful response from the group coordinator (broker 1, last error was Local: Timed out in queue): revoking assignment and rejoining group
%4|1739052206.391|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1739052206.392|FAIL|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.2:9092: 1 request(s) timed out: disconnect (average rtt 156.853ms) (after 9708ms in state UP)
%4|1739052212.861|REQTMOUT|kafka_topic_data_verify_consumer1_rightmove.property_data#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1739052212.862|FAIL|kafka_topic_data_verify_consumer1_rightmove.property_data#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.3:9092: 1 request(s) timed out: disconnect (average rtt 146.758ms) (after 118194ms in state UP)
The application is written in Python, although this is unlikely to be significant.
What is strange is the code is very similar to another code which was used to migrate the topic data. This previous code had a single consumer and producer. After each event was read, the producer was flushed and then the consumer commit function was called.
The title had to be shortened a bit.
The full error message is more like this:
Kafka Consumer group session timed out (in join-state steady) after X ms without a successful response from the group coordinator: revoking assignment and rejoining group
What causes this?
Context
I have a simple python application running which is intended to verify a data migration process.
- Data was migrated (copied) from one Kafka cluster to another
- The process spawns two consumers, one for each cluster, and reads events sequentially
- It verifies that the consumed data from each consumer is the same
Here are some more detailed log lines.
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #0)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #1)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #2)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #3)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #4)
%4|1739052194.834|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out 2216 in-flight, 0 retry-queued, 15394 out-queue, 1 partially-sent requests
%3|1739052194.835|FAIL|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.2:9092: 17610 request(s) timed out: disconnect (average rtt 159.627ms) (after 100166ms in state UP)
%4|1739052205.056|SESSTMOUT|consumer2.topicname#consumer-2| [thrd:main]: Consumer group session timed out (in join-state steady) after 45000 ms without a successful response from the group coordinator (broker 1, last error was Local: Timed out in queue): revoking assignment and rejoining group
%4|1739052206.391|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1739052206.392|FAIL|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.2:9092: 1 request(s) timed out: disconnect (average rtt 156.853ms) (after 9708ms in state UP)
%4|1739052212.861|REQTMOUT|kafka_topic_data_verify_consumer1_rightmove.property_data#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1739052212.862|FAIL|kafka_topic_data_verify_consumer1_rightmove.property_data#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.3:9092: 1 request(s) timed out: disconnect (average rtt 146.758ms) (after 118194ms in state UP)
The application is written in Python, although this is unlikely to be significant.
What is strange is the code is very similar to another code which was used to migrate the topic data. This previous code had a single consumer and producer. After each event was read, the producer was flushed and then the consumer commit function was called.
Share Improve this question asked yesterday user2138149user2138149 16.8k30 gold badges145 silver badges287 bronze badges1 Answer
Reset to default 1tl;dr
Consumers have a buffer into which offsets to be committed are stored. This is similar to how a producer has a buffer into which messages to be dispatched are stored.
If you commit asynchronously, this buffer may become full and crash the consumer.
The reason for the failure is that the consumer commit functions are being called with their default arguments.
Specifically, the asynchronous
argument has a default value of True
.
In the previous program, the producer flush
function call is synchronous. This introduces some delay, and allows the asynchronously committing consumer to catch up, and reduce the number of pending messages stored in its buffer.
With the new code, there is nothing to add some delay to give the consumers time to flush their commit queues.
The solution was to force both consumers to commit synchronously, which is what they should have been doing already, because this is the safer option.
Example:
consumer.commit(asynchronous=False)
Note that I am using the Confluent-Kafka Python library.