最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

cassandra - Writes fail when lightweight transactions cannot reach quorum - Stack Overflow

programmeradmin2浏览0评论

In three node Cassandra cluster I am consistently facing the same kind of fatal situation on tables that are solely written using Cassandra's lightweight transactions (CAS).

Whenever a lightweight transaction fails to reach quorum (1/2), e.g. due to high load, any following attempt to write data within a transactions fails, i.e. does not return "[applied]"=true.

Using select * from system.paxos where cf_id=<id of table>, I see that there are entries, which I assume to be pending transactions.

Further, in /var/log/Cassandra/system.log I see logs like:

INFO  [ScheduledTasks:1] 2025-01-12 21:46:53,005 UncommittedTableData.java:567 - \
  Scheduling uncommitted paxos data merge task for `<any other table>
INFO  [OptionalTasks:1] 2025-01-12 21:46:53,006 PaxosCleanupLocalCoordinator.java:89 - \
  Completing uncommitted paxos instances for <table in stalled state> on ranges

However, I can't figure how to resolve the state nodetool repair -full <keyspace> (and variations), as well as restarting all nodes did not resolve the issue.

Further information:

  • Cassandra version: 4.1.5
  • replication strategy: SimpleStrategy
  • replication factor: 3

In three node Cassandra cluster I am consistently facing the same kind of fatal situation on tables that are solely written using Cassandra's lightweight transactions (CAS).

Whenever a lightweight transaction fails to reach quorum (1/2), e.g. due to high load, any following attempt to write data within a transactions fails, i.e. does not return "[applied]"=true.

Using select * from system.paxos where cf_id=<id of table>, I see that there are entries, which I assume to be pending transactions.

Further, in /var/log/Cassandra/system.log I see logs like:

INFO  [ScheduledTasks:1] 2025-01-12 21:46:53,005 UncommittedTableData.java:567 - \
  Scheduling uncommitted paxos data merge task for `<any other table>
INFO  [OptionalTasks:1] 2025-01-12 21:46:53,006 PaxosCleanupLocalCoordinator.java:89 - \
  Completing uncommitted paxos instances for <table in stalled state> on ranges

However, I can't figure how to resolve the state nodetool repair -full <keyspace> (and variations), as well as restarting all nodes did not resolve the issue.

Further information:

  • Cassandra version: 4.1.5
  • replication strategy: SimpleStrategy
  • replication factor: 3
Share Improve this question edited Jan 24 at 4:56 Erick Ramirez 16.4k2 gold badges21 silver badges31 bronze badges asked Jan 13 at 7:10 PeMaPeMa 1,71620 silver badges49 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

Lightweight transactions (LWTs) are expensive operations since they require a read-before-write, meaning the data must be read to verify the conditional IF in the statement before the write is executed.

Prior to Paxos v2 added in Cassandra 4.1 (CASSANDRA-17164), LWTs required four round-trips for the [extended] Paxos phases: prepare/promise, serial read, propose/accept, commit. As a result, LWTs add significantly more load than regular writes. As such, if nodes are overloaded then it is expected for LWTs to perform even worse and not reach a quorum of replicas.

Running a repair does not solve the underlying issue with the nodes being overloaded. In fact, repairs add even more load like adding more fuel to a cluster that's on fire.

You should address the root cause of the problem. I recommend that you review the capacity of your cluster and analyse the utilisation of resources like disk, CPU and memory. It may be necessary for you to consider adding more nodes. Cheers!

发布评论

评论列表(0)

  1. 暂无评论