The title had to be shortened a bit.

The full error message is more like this:

Kafka Consumer group session timed out (in join-state steady) after X ms without a successful response from the group coordinator: revoking assignment and rejoining group

What causes this?

Context

I have a simple python application running which is intended to verify a data migration process.

Data was migrated (copied) from one Kafka cluster to another
The process spawns two consumers, one for each cluster, and reads events sequentially
It verifies that the consumed data from each consumer is the same

Here are some more detailed log lines.

%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #0)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #1)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #2)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #3)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #4)
%4|1739052194.834|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out 2216 in-flight, 0 retry-queued, 15394 out-queue, 1 partially-sent requests
%3|1739052194.835|FAIL|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.2:9092: 17610 request(s) timed out: disconnect (average rtt 159.627ms) (after 100166ms in state UP)
%4|1739052205.056|SESSTMOUT|consumer2.topicname#consumer-2| [thrd:main]: Consumer group session timed out (in join-state steady) after 45000 ms without a successful response from the group coordinator (broker 1, last error was Local: Timed out in queue): revoking assignment and rejoining group
%4|1739052206.391|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1739052206.392|FAIL|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.2:9092: 1 request(s) timed out: disconnect (average rtt 156.853ms) (after 9708ms in state UP)
%4|1739052212.861|REQTMOUT|kafka_topic_data_verify_consumer1_rightmove.property_data#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1739052212.862|FAIL|kafka_topic_data_verify_consumer1_rightmove.property_data#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.3:9092: 1 request(s) timed out: disconnect (average rtt 146.758ms) (after 118194ms in state UP)

The application is written in Python, although this is unlikely to be significant.

What is strange is the code is very similar to another code which was used to migrate the topic data. This previous code had a single consumer and producer. After each event was read, the producer was flushed and then the consumer commit function was called.

The title had to be shortened a bit.

The full error message is more like this:

Kafka Consumer group session timed out (in join-state steady) after X ms without a successful response from the group coordinator: revoking assignment and rejoining group

What causes this?

Context

I have a simple python application running which is intended to verify a data migration process.

Data was migrated (copied) from one Kafka cluster to another
The process spawns two consumers, one for each cluster, and reads events sequentially
It verifies that the consumed data from each consumer is the same

Here are some more detailed log lines.

%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #0)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #1)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #2)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #3)
%5|1739052194.708|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out OffsetCommitRequest in flight (after 158ms, timeout #4)
%4|1739052194.834|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out 2216 in-flight, 0 retry-queued, 15394 out-queue, 1 partially-sent requests
%3|1739052194.835|FAIL|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.2:9092: 17610 request(s) timed out: disconnect (average rtt 159.627ms) (after 100166ms in state UP)
%4|1739052205.056|SESSTMOUT|consumer2.topicname#consumer-2| [thrd:main]: Consumer group session timed out (in join-state steady) after 45000 ms without a successful response from the group coordinator (broker 1, last error was Local: Timed out in queue): revoking assignment and rejoining group
%4|1739052206.391|REQTMOUT|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator/1: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1739052206.392|FAIL|consumer2.topicname#consumer-2| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.2:9092: 1 request(s) timed out: disconnect (average rtt 156.853ms) (after 9708ms in state UP)
%4|1739052212.861|REQTMOUT|kafka_topic_data_verify_consumer1_rightmove.property_data#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1739052212.862|FAIL|kafka_topic_data_verify_consumer1_rightmove.property_data#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: 192.168.0.3:9092: 1 request(s) timed out: disconnect (average rtt 146.758ms) (after 118194ms in state UP)

The application is written in Python, although this is unlikely to be significant.

Share Improve this question asked yesterday user2138149 16.8k30 gold badges145 silver badges287 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

tl;dr

Consumers have a buffer into which offsets to be committed are stored. This is similar to how a producer has a buffer into which messages to be dispatched are stored.

If you commit asynchronously, this buffer may become full and crash the consumer.

The reason for the failure is that the consumer commit functions are being called with their default arguments.

Specifically, the asynchronous argument has a default value of True.

In the previous program, the producer flush function call is synchronous. This introduces some delay, and allows the asynchronously committing consumer to catch up, and reduce the number of pending messages stored in its buffer.

With the new code, there is nothing to add some delay to give the consumers time to flush their commit queues.

The solution was to force both consumers to commit synchronously, which is what they should have been doing already, because this is the safer option.

Example:

consumer.commit(asynchronous=False)

Note that I am using the Confluent-Kafka Python library.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Kafka Consumer group session timed out without a successful response from the group coordinator: revoking assignment an

Context

Context

1 Answer 1

tl;dr

与本文相关的文章

评论列表(0)