How MOCO maintains MySQL clusters
For each MySQLCluster, MOCO creates and maintains a set of mysqld instances.
The set contains one primary instance and may contain multiple replica instances depending on the spec.replicas value of MySQLCluster.
This document describes how MOCO does this job safely.
Terminology
- Replication: GTID-based replication between
mysqldinstances. - Cluster: a group of
mysqldinstances that replicate data between them. - Primary (instance): a single source instance of
mysqldin a cluster. - Replica (instance): a read-only instance of
mysqldthat synchronizes data with the primary instance. - Intermediate primary: a special primary instance that replicates data from an external
mysqld. - Errant transaction: a transaction that exists only on a replica instance.
- Errant replica: a replica instance that has errant transactions.
- Switchover: operation to change a live primary to a replica and promote a replica to the new primary.
- Failover: operation to replace a dead primary with a replica.
Prerequisites
MySQLCluster allows positive odd numbers for spec.replicas value. If 1, MOCO runs a single mysqld instance without configuring replication. If 3 or greater, MOCO chooses a mysqld instance as a primary, writable instance and configures all other instances as replicas of the primary instance.
status.currentPrimaryIndex in MySQLCluster is used to record the current chosen primary instance.
Initially, status.currentPrimaryIndex is zero and therefore the index of the primary instance is zero.
As a special case, if spec.replicationSourceSecretName is set for MySQLCluster, the primary instance is configured as a replica of an external MySQL server. In this case, the primary instance will not be writable. We call this type of primary instance intermediate primary.
If spec.replicationSourceSecretName is not set, MOCO configures semisynchronous replication between the primary and replicas. Otherwise, the replication is asynchronous.
For semi-synchronous replication, MOCO configures rpl_semi_sync_master_timeout long enough so that it never degrades to asynchronous replication.
Likewise, MOCO configures rpl_semi_sync_master_wait_for_slave_count to (spec.replicas - 1 / 2) to make sure that at least half of replica instances have the same commit as the primary. e.g., If spec.replicas is 5, rpl_semi_sync_master_wait_for_slave_count will be set to 2.
MOCO also disables relay_log_recovery because enabling it would drop the relay logs on replicas.
mysqld always starts with super_read_only=1 to prevent erroneous writes, and with skip_replica_start to prevent misconfigured replication.
moco-agent, a sidecar container for MOCO, initializes MySQL users and plugins. At the end of the initialization, it issues RESET MASTER | RESET BINARY LOGS AND GTIDS to clear executed GTID set.
moco-agent also provides a readiness probe for mysqld container. If a replica instance does not start replication threads or is too delayed to execute transactions, the container and the Pod will be determined as unready.
Limitations
Currently, MOCO does not re-initialize data after the primary instance fails.
After failover to a replica instance, the old primary may have errant transactions because it may recover unacknowledged transactions in its binary log. This is an inevitable limitation in MySQL semi-synchronous replication.
If this happens, MOCO detects the errant transaction and will not allow the old primary to rejoin the cluster as a replica.
Users need to delete the volume data (PersistentVolumeClaim) and the pod of the old primary to re-initialize it.
Possible states
MySQLCluster
MySQLCluster can be one of the following states.
The initial state is Cloning if spec.replicationSourceSecretName is set, or Restoring if spec.restore is set.
Otherwise, the initial state is Incomplete.
Note that, if the primary Pod is ready, the mysqld is assured writable.
Likewise, if a replica Pod is ready, the mysqld is assured read-only and running replication threads w/o too much delay.
- Healthy
- All Pods are ready.
- All replicas have no errant transactions.
- All replicas are read-only and connected to the primary.
- For intermediate primary instance, the primary works as a replica for an external
mysqldand is read-only.
- Cloning
spec.replicationSourceSecretNameis set.status.clonedis false.- The cloning result exists and is not "Completed" or there is no cloning result and the instance has no data.
- (note: if the primary has some data and has no cloning result, the instance was used to be a replica and then promoted to the primary.)
- Restoring
spec.restoreis set.status.restoredTimeis not set.
- Degraded
- The primary Pod is ready and does not lose data.
- For intermediate primary instance, the primary works as a replica for an external
mysqldand is read-only. - Half or more replicas are ready, read-only, connected to the primary, and have no errant transactions. For example, if
spec.replicasis 5, two or more such replicas are needed. - At least one replica has some problems.
- This also includes cases where a replica's
rpl_semi_sync_master_wait_sessionsis greater than 0. See related issues. #813
- This also includes cases where a replica's
- Failed
- The primary instance is not running or lost data.
- More than half of replicas are running and have data without errant transactions. For example, if
spec.replicasis 5, three or more such replicas are needed.
- Lost
- The primary instance is not running or lost data.
- Half or more replicas are not running or lost data or have errant transactions.
- Incomplete
- None of the above states applies.
MOCO can recover the cluster to Healthy from Degraded, Failed, or Incomplete if all Pods are running and there are no errant transactions.
MOCO can recover the cluster to Degraded from Failed when not all Pods are running. Recovering from Failed is called failover.
MOCO cannot recover the cluster from Lost. Users need to restore data from backups.
Pod
mysqld is run as a container in a Pod.
Therefore, MOCO needs to be aware of the following conditions.
- Missing: the Pod does not exist.
- Exist: the Pod exists and not Terminating or Demoting.
- Terminating: The Pod exists and
metadata.deletionTimestampis not null. - Demoting: The Pod exists and has
moco.cybozu.com/demote: trueannotation.
If there are missing Pods, MOCO does nothing for the MySQLCluster.
If a primary instance Pod is Terminating or Demoting, MOCO controller changes the primary to one of the replica instances. This operation is called switchover.
MySQL data
MOCO checks replica instances whether they have errant transactions compared to the primary instance. If it detects such an instance, MOCO records the instance with MySQLCluster and excludes it from the cluster.
The user needs to delete the Pod and the volume manually and let the StatefulSet controller to re-create them. After a newly initialized instance gets created, MOCO will allow it to rejoin the cluster.
Invariants
- By definition, the primary instance recorded in MySQLCluster has no errant transactions. It is always the single source of truth.
- Errant replicas are not treated as ready even if their Pod status is ready.
The maintenance flow
MOCO runs the following infinite loop for each MySQLCluster. It stops when MySQLCluster resource is deleted.
- Gather the current status
- Update
statusof MySQLCluster - Determine what MOCO should do for the cluster
- If there is nothing to do, wait a while and go to 1
- Do the determined operation then go to 1
Read the following sub-sections about 1 to 3.
Gather the current status
MOCO gathers the information from kube-apiserver and mysqld as follows:
- MySQLCluster resource
- Pod resources
- If some of the Pods are missing, MOCO does nothing.
mysqldSHOW REPLICAS(on the primary)SHOW REPLICA STATUS(on the replicas)- Global variables such as
gtid_executedorsuper_read_only - Result of CLONE from
performance_schema.clone_statustable
If MOCO cannot connect to an instance for a certain period, that instance is determined as failed.
Update status of MySQLCluster
In this phase, MOCO updates status field of MySQLCluster as follows:
- Determine the current MySQLCluster state.
- Add or update type=
Initializedcondition tostatus.conditionsasTrueif the cluster state is not Cloning.- otherwise,
False.
- Add or update type=
Availablecondition tostatus.conditionsasTrueif the cluster state is Healthy or Degraded.- otherwise,
False.
- Add or update type=
Healthycondition tostatus.conditionsasTrueif the cluster state is Healthy.- otherwise,
False. - The
Reasonfield is set to the cluster state such as "Failed" or "Incomplete".
- Set the number of ready replica Pods to
status.syncedReplicas. - Add newly found errant replicas to
status.errantReplicaList. - Remove re-initialized and/or no-longer errant replicas from
status.errantReplicaList - Set
status.errantReplicasto the length ofstatus.errantReplicaList. - Set
status.clonedto true ifspec.replicationSourceSecretis not nil and the state is not Cloning.
Determine what MOCO should do for the cluster
The operation depends on the current cluster state.
The operation and its result are recorded as Events of MySQLCluster resource.
cf. Application Introspection and Debugging
Healthy
If the primary instance Pod is Terminating or Demoting, switch the primary instance to another replica. Otherwise, just wait a while.
The switchover is done as follows. It takes at least several seconds for a new primary to become writable.
- Make the primary instance
super_read_only=1. - Kill all existing connections except ones from
localhostand ones for MOCO. - Wait for a replica to catch up the executed GTID set of the primary instance.
- Set
status.currentPrimaryIndexto the replica's index. - If the old primary is Demoting, remove
moco.cybozu.com/demoteannotation from the Pod.
Cloning
Execute CLONE INSTANCE on the intermediate primary instance to clone data from an external MySQL instance.
If the cloning goes successful, do the same as Intermediate case.
Restoring
Do nothing.
Degraded
First, check if the primary instance Pod is Terminating or Demoting, and if it is, do the switchover just like Healthy case.
Then, do the same as Intermediate case to try to fix the problems. It is not possible to recover the cluster to Healthy if there are errant or stopped replicas, though.
Failed
MOCO chooses the most advanced instance as the new primary instance. The most advanced means that its retrieved GTID set is the superset of all other replicas except for those have errant transactions.
To prevent accidental writes to the old primary instance (so-called split-brain), MOCO stops replication IO_THREAD for all replicas. This way, the old primary cannot get necessary acks from replicas to write further transactions.
The failover is done as follows:
- Stop IO_THREAD on all replicas.
- Choose the most advanced replica as the new primary. Errant replicas recorded in MySQLCluster are excluded from the candidates.
- Wait for the replica to execute all retrieved GTID set.
- Update
status.currentPrimaryIndexto the new primary's index.
Lost
There is nothing can be done.
Intermediate
- On the primary that was an intermediate primary, wait for all the retrieved GTID set to be executed.
- Start replication between the primary and non-errant replicas.
- If a replication has no data, MOCO clones the primary data to the replica first.
- Stop replication of errant replicas.
- Set
super_read_only=1for replica instances that are writable. - Adjust
moco.cybozu.com/rolelabel to Pods according to their roles.- For errant replicas, the label is removed to prevent users from reading inconsistent data.
- Finally, make the primary
mysqldwritable if the primary is not an intermediate primary.