kubernetes - ActiveMQ Artemis: Primary Pod Restart Loop with Shared Store HA

I am running ActiveMQ Artemis on Kubernetes and trying to configure high availability (HA) with shared storage. However, I am facing an issue where the primary pod goes into a restart loop after enabling the shared store HA policy.

My question is an extension of this one, as I am experiencing the same issue but have also experimented with an alternative setup.

What I Tried

Configured HA with shared store:

Primary Pod

<ha-policy>
    <shared-store>
        <primary>
            <failover-on-shutdown>true</failover-on-shutdown>
        </primary>
    </shared-store>
</ha-policy>

Secondary Pod

<ha-policy>
    <shared-store>
        <backup>
            <allow-failback>false</allow-failback>
            <failover-on-shutdown>true</failover-on-shutdown>
        </backup>
    </shared-store>
</ha-policy>

Observed Issue:

ERROR [.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=Lost NodeManager lock, message=NULL
java.io.IOException: lost lock

What change I tried:

Tested Running without HA Policy but in a Clustered Mode:

Instead of defining an HA policy, I simply booted two clustered Artemis nodes using the same PVC (Persistent Volume Claim) for data storage.
Behavior Observed:
- One pod becomes active while the other becomes passive.
- This resembles an active-passive setup, even though no HA policy is explicitly defined.

Questions:

Why does the shared store HA setup cause the "Lost NodeManager lock" error, but a simple clustered setup with shared storage works fine?
If I continue using a clustered setup without an HA policy but with shared storage, is this an acceptable and recommended approach?
What are the risks of running a clustered ActiveMQ Artemis setup with shared storage but without an HA policy?

My question is an extension of this one, as I am experiencing the same issue but have also experimented with an alternative setup.

What I Tried

Configured HA with shared store:

Primary Pod

<ha-policy>
    <shared-store>
        <primary>
            <failover-on-shutdown>true</failover-on-shutdown>
        </primary>
    </shared-store>
</ha-policy>

Secondary Pod

<ha-policy>
    <shared-store>
        <backup>
            <allow-failback>false</allow-failback>
            <failover-on-shutdown>true</failover-on-shutdown>
        </backup>
    </shared-store>
</ha-policy>

Observed Issue:

ERROR [.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=Lost NodeManager lock, message=NULL
java.io.IOException: lost lock

What change I tried:

Tested Running without HA Policy but in a Clustered Mode:

Instead of defining an HA policy, I simply booted two clustered Artemis nodes using the same PVC (Persistent Volume Claim) for data storage.
Behavior Observed:
- One pod becomes active while the other becomes passive.
- This resembles an active-passive setup, even though no HA policy is explicitly defined.

Questions:

Why does the shared store HA setup cause the "Lost NodeManager lock" error, but a simple clustered setup with shared storage works fine?
If I continue using a clustered setup without an HA policy but with shared storage, is this an acceptable and recommended approach?
What are the risks of running a clustered ActiveMQ Artemis setup with shared storage but without an HA policy?

Share Improve this question edited Mar 14 at 15:04 Justin Bertram 35.5k6 gold badges26 silver badges49 bronze badges asked Mar 14 at 13:38 Subhidh Agarwal 1854 silver badges15 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

You see "Lost NodeManager lock" when using a shared-store ha-policy because that configuration causes the broker to actively monitor the shared file lock while the broker is running.

Without a shared-store ha-policy your primary broker might lose the shared file lock without realizing it in which case the backup would activate and both the primary and the backup would be operating simultaneously (i.e. split brain). Therefore, I would not recommend a simple clustered setting using shared storage without a shared-store ha-policy.

I recommend you inspect the configuration and features of the shared storage device to ensure it is able to support exclusive shared file locks. I also recommend you monitor the shared storage device to ensure there are no intermittent problems that would cause the primary broker to lose its lock.

You can enable TRACE logging for .apache.activemq.artemis.core.server.impl.FileLockNodeManager to help you identify why the primary broker is losing its shared file lock.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

kubernetes - ActiveMQ Artemis: Primary Pod Restart Loop with Shared Store HA - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)