MOCO documentation

moco logo

This is the documentation site for MOCO. MOCO is a Kubernetes operator for MySQL created and maintained by Cybozu.

Getting started

Setup

Quick setup

You can choose between two installation methods.

MOCO depends on cert-manager. If cert-manager is not installed on your cluster, install it as follows:

$ curl -fsLO https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml
$ kubectl apply -f cert-manager.yaml

Install using raw manifests:

$ curl -fsLO https://github.com/cybozu-go/moco/releases/latest/download/moco.yaml
$ kubectl apply -f moco.yaml

Install using Helm chart:

$ helm repo add moco https://cybozu-go.github.io/moco/
$ helm repo update
$ helm install --create-namespace --namespace moco-system moco moco/moco

Customize manifests

If you want to edit the manifest, config/ directory contains the source YAML for kustomize.

Next step

Read usage.md and create your first MySQL cluster!

MOCO Helm Chart

How to use MOCO Helm repository

You need to add this repository to your Helm repositories:

$ helm repo add moco https://cybozu-go.github.io/moco/
$ helm repo update

Quick start

Installing cert-manager

$ curl -fsL https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml | kubectl apply -f -

Installing the Chart

NOTE:

This installation method requires cert-manager to be installed beforehand. To install the chart with the release name moco using a dedicated namespace(recommended):

$ helm install --create-namespace --namespace moco-system moco moco/moco

Specify parameters using --set key=value[,key=value] argument to helm install.

Alternatively a YAML file that specifies the values for the parameters can be provided like this:

$ helm install --create-namespace --namespace moco-system moco -f values.yaml moco/moco

Values

KeyTypeDefaultDescription
image.repositorystring"ghcr.io/cybozu-go/moco"MOCO image repository to use.
image.tagstring{{ .Chart.AppVersion }}MOCO image tag to use.

Generate Manifests

You can use the helm template command to render manifests.

$ helm template --namespace moco-system moco moco/moco

Upgrade CRDs

There is no support at this time for upgrading or deleting CRDs using Helm. Users must manually upgrade the CRD if there is a change in the CRD used by MOCO.

https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#install-a-crd-declaration-before-using-the-resource

Release Chart

MOCO Helm Chart will be released independently. This will prevent the MOCO version from going up just by modifying the Helm Chart.

You must change the version of Chart.yaml when making changes to the Helm Chart.

Pushing a tag like chart-v<chart version> will cause GitHub Actions to release chart. Chart versions are expected to follow Semantic Versioning. If the chart version in the tag does not match the version listed in Chart.yaml, the release will fail.

Installing kubectl-moco

kubectl-moco is a plugin for kubectl to control MySQL clusters of MOCO.

Pre-built binaries are available on GitHub releases for Windows, Linux, and MacOS.

Download one of the binaries for your OS and place it in a directory of PATH.

$ curl -fsL -o /path/to/bin/kubectl-moco https://github.com/cybozu-go/moco/releases/latest/download/kubectl-moco-linux-amd64
$ chmod a+x /path/to/bin/kubectl-moco

Check the installation by running kubectl moco -h.

$ kubectl moco -h
the utility command for MOCO.

Usage:
  kubectl-moco [command]

Available Commands:
  credential  Fetch the credential of a specified user
  help        Help about any command
  mysql       Run mysql command in a specified MySQL instance
  switchover  Switch the primary instance

...

How to use MOCO

After setting up MOCO, you can create MySQL clusters with a custom resource called MySQLCluster.

Basics

MOCO creates a cluster of mysqld instances for each MySQLCluster. A cluster can consists of 1, 3, or 5 mysqld instances.

MOCO configures semi-synchronous GTID-based replication between mysqld instances in a cluster if the cluster size is 3 or 5. A 3-instance cluster can tolerate up to 1 replica failure, and a 5-instance cluster can tolerate up to 2 replica failures.

In a cluster, there is only one instance called primary. The primary instance is the source of truth. It is the only writable instance in the cluster, and the source of the replication. All other instances are called replica. A replica is a read-only instance and replicates data from the primary.

Limitations

Errant replicas

An inherent limitation of GTID-based semi-synchronous replication is that a failed instance would have errant transactions. If this happens, the instance needs to be re-created by removing all data.

MOCO does not re-create such an instance. It only detects instances having errant transactions and excludes them from the cluster. Users need to monitor them and re-create the instances.

Read-only primary

MOCO from time to time sets the primary mysqld instance read-only for a switchover or other reasons. Applications that use MOCO MySQL need to be aware of this.

Creating clusters

Creating an empty cluster

An empty cluster always has a writable instance called the primary. All other instances are called replicas. Replicas are read-only and replicate data from the primary.

The following YAML is to create a three-instance cluster. It has an anti-affinity for Pods so that all instances will be scheduled to different Nodes. It also sets the limits for memory and CPU to make the Pod Guaranteed.

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
  # replicas is the number of mysqld Pods.  The default is 1.
  replicas: 3
  podTemplate:
    spec:
      # Make the data directory writable. If moco-init fails with "Permission denied", uncomment the following settings.
      # securityContext:
      #   fsGroup: 10000
      #   fsGroupChangePolicy: "OnRootMismatch"  # available since k8s 1.20
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - mysql
              - key: app.kubernetes.io/instance
                operator: In
                values:
                - test
            topologyKey: "kubernetes.io/hostname"
      containers:
      # At least a container named "mysqld" must be defined.
      - name: mysqld
        image: quay.io/cybozu/mysql:8.0.26
        # By limiting CPU and memory, Pods will have Guaranteed QoS class.
        # requests can be omitted; it will be set to the same value as limits.
        resources:
          limits:
            cpu: "10"
            memory: "10Gi"
  volumeClaimTemplates:
  # At least a PVC named "mysql-data" must be defined.
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

There are other example manifests in examples directory.

The complete reference of MySQLCluster is crd_mysqlcluster.md.

Creating a cluster that replicates data from an external mysqld

Let's call the source mysqld instance donor.

We use the clone plugin to copy the whole data quickly. After the cloning, MOCO needs to create some user accounts and install plugins.

On the donor, you need to install the plugin and create two user accounts as follows:

mysql> INSTALL PLUGIN clone SONAME mysql_clone.so;
mysql> CREATE USER 'clone-donor'@'%' IDENTIFIED BY 'xxxxxxxxxxx';
mysql> GRANT BACKUP_ADMIN, REPLICATION SLAVE ON *.* TO 'clone-donor'@'%';
mysql> CREATE USER 'clone-init'@'localhost' IDENTIFIED BY 'yyyyyyyyyyy';
mysql> GRANT ALL ON *.* TO 'clone-init'@'localhost' WITH GRANT OPTION;

You may change the user names and should change their passwords.

Then create a Secret in the same namespace as MySQLCluster:

$ kubectl -n <namespace> create secret generic donor-secret \
    --from-literal=HOST=<donor-host> \
    --from-literal=PORT=<donor-port> \
    --from-literal=USER=clone-donor \
    --from-literal=PASSWORD=xxxxxxxxxxx \
    --from-literal=INIT_USER=clone-init \
    --from-literal=INIT_PASSWORD=yyyyyyyyyyy

You may change the secret name.

Finally, create MySQLCluster with spec.replicationSourceSecretName set to the Secret name as follows. The mysql image must be the same version as the donor's.

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  replicationSourceSecretName: donor-secret
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: quay.io/cybozu/mysql:8.0.26  # must be the same version as the donor
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

To stop the replication from the donor, update MySQLCluster with spec.replicationSourceSecretName: null.

Bring your own image

We provide pre-built MySQL container images at quay.io/cybozu/mysql. If you want to build and use your own image, read custom-mysqld.md.

Configurations

The default and constant configuration values for mysqld are available on pkg.go.dev. The settings in ConstMycnf cannot be changed while the settings in DefaultMycnf can be overridden.

You can change the default values or set undefined values by creating a ConfigMap in the same namespace as MySQLCluster, and setting spec.mysqlConfigMapName in MySQLCluster to the name of the ConfigMap as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: foo
  name: mycnf
data:
  long_query_time: "5"
  innodb_buffer_pool_size: "10G"
---
apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  # set this to the name of ConfigMap
  mysqlConfigMapName: mycnf
  ...

InnoDB buffer pool size

If innodb_buffer_pool_size is not specified, MOCO sets it automatically to 70% of the value of resources.requests.memory (or resources.limits.memory) for mysqld container.

If both resources.request.memory and resources.limits.memory are not set, innodb_buffer_pool_size will be set to 128M.

Opaque configuration

Some configuration variables cannot be fully configured with ConfigMap values. For example, --performance-schema-instrument needs to be specified multiple times.

You may set them through a special config key _include. The value of _include will be included in my.cnf as opaque.

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: foo
  name: mycnf
data:
  _include: |
    performance-schema-instrument='memory/%=ON'
    performance-schema-instrument='wait/synch/%/innodb/%=ON'
    performance-schema-instrument='wait/lock/table/sql/handler=OFF'
    performance-schema-instrument='wait/lock/metadata/sql/mdl=OFF'

Care must be taken not to overwrite critical configurations such as log_bin since MOCO does not check the contents from _include.

Using the cluster

kubectl moco

From outside of your Kubernetes cluster, you can access MOCO MySQL instances using kubectl-moco. kubectl-moco is a plugin for kubectl. Pre-built binaries are available on GitHub releases.

The following is an example to run mysql command interactively to access the primary instance of test MySQLCluster in foo namespace.

$ kubectl moco -n foo mysql -it test

Read the reference manual of kubectl-moco for further details and examples.

MySQL users

MOCO prepares a set of users.

  • moco-readonly can read all tables of all databases.
  • moco-writable can create users, databases, or tables.
  • moco-admin is the super user.

The exact privileges that moco-readonly has are:

  • PROCESS
  • REPLICATION CLIENT
  • REPLICATION SLAVE
  • SELECT
  • SHOW DATABASES
  • SHOW VIEW

The exact privileges that moco-writable has are:

  • ALTER
  • ALTER ROUTINE
  • CREATE
  • CREATE ROLE
  • CREATE ROUTINE
  • CREATE TEMPORARY TABLES
  • CREATE USER
  • CREATE VIEW
  • DELETE
  • DROP
  • DROP ROLE
  • EVENT
  • EXECUTE
  • INDEX
  • INSERT
  • LOCK TABLES
  • PROCESS
  • REFERENCES
  • REPLICATION CLIENT
  • REPLICATION SLAVE
  • SELECT
  • SHOW DATABASES
  • SHOW VIEW
  • TRIGGER
  • UPDATE

moco-writable cannot edit tables in mysql database, though.

You can create other users and grant them certain privileges as either moco-writable or moco-admin.

$ kubectl moco mysql -u moco-writable test -- -e "CREATE USER 'foo'@'%' IDENTIFIED BY 'bar'"
$ kubectl moco mysql -u moco-writable test -- -e "CREATE DATABASE db1"
$ kubectl moco mysql -u moco-writable test -- -e "GRANT ALL ON db1.* TO 'foo'@'%'"

Connecting to mysqld over network

MOCO prepares two Services for each MySQLCluster. For example, a MySQLCluster named test in foo Namespace has the following Services.

Service NameDNS NameDescription
moco-test-primarymoco-test-primary.foo.svcConnect to the primary instance.
moco-test-replicamoco-test-replica.foo.svcConnect to replica instances.

moco-test-replica can be used only for read access.

The type of these Services is usually ClusterIP. The following is an example to change Service type to LoadBalancer and add an annotation for MetalLB.

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  serviceTemplate:
    metadata:
      annotations:
        metallb.universe.tf/address-pool: production-public-ips
    spec:
      type: LoadBalancer
...

Backup and restore

MOCO can take full and incremental backups regularly. The backup data are stored in Amazon S3 compatible object storages.

You can restore data from a backup to a new MySQL cluster.

Object storage bucket

Bucket is a management unit of objects in S3. MOCO stores backups in a specified bucket.

MOCO does not remove backups. To remove old backups automatically, you can set a lifecycle configuration to the bucket.

ref: Setting lifecycle configuration on a bucket

A bucket can be shared safely across multiple MySQLClusters. Object keys are prefixed with moco/.

BackupPolicy

BackupPolicy is a custom resource to define a policy for taking backups.

The following is an example BackupPolicy to take a backup every day and store data in MinIO:

apiVersion: moco.cybozu.com/v1beta1
kind: BackupPolicy
metadata:
  namespace: backup
  name: daily
spec:
  # Backup schedule.  Any CRON format is allowed.
  schedule: "@daily"

  jobConfig:
    # An existing ServiceAccount name is required.
    serviceAccountName: backup-owner
    env:
    - name: AWS_ACCESS_KEY_ID
      value: minioadmin
    - name: AWS_SECRET_ACCESS_KEY
      value: minioadmin

    # bucketName is required.  Other fields are optional.
    bucketConfig:
      bucketName: moco
      endpointURL: http://minio.default.svc:9000
      usePathStyle: true

    # MOCO uses a filesystem volume to store data temporarily.
    workVolume:
      # Using emptyDir as a working directory is NOT recommended.
      # The recommended way is to use generic ephemeral volume with a provisioner
      # that can provide enough capacity.
      # https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes
      emptyDir: {}

To enable backup for a MySQLCluster, reference the BackupPolicy name like this:

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: default
  name: foo
spec:
  backupPolicyName: daily  # The policy name
...

MOCO creates a CronJob for each MySQLCluster that has spec.backupPolicyName.

The CronJob's name is moco-backup- + the name of MySQLCluster. For the above example, a CronJob named moco-backup-foo is created in default namespace.

Credentials to access S3 bucket

Depending on your Kubernetes service provider and object storage, there are various ways to give credentials to access the object storage bucket.

For Amazon's Elastic Kubernetes Service (EKS) and S3 users, the easiest way is probably to use IAM Roles for Service Accounts (IRSA).

ref: IAM ROLES FOR SERVICE ACCOUNTS

Another popular way is to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables as shown in the above example.

Taking an emergency backup

You can take an emergency backup by creating a Job from the CronJob for backup.

$ kubectl create job --from=cronjob/moco-backup-foo emergency-backup

Restore

To restore data from a backup, create a new MyQLCluster with spec.restore field as follows:

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: backup
  name: target
spec:
  # restore field is not editable.
  # to modify parameters, delete and re-create MySQLCluster.
  restore:
    # The source MySQLCluster's name and namespace
    sourceName: source
    sourceNamespace: backup

    # The restore point-in-time in RFC3339 format.
    restorePoint: "2021-05-26T12:34:56Z"

    # jobConfig is the same in BackupPolicy
    jobConfig:
      serviceAccountName: backup-owner
      env:
      - name: AWS_ACCESS_KEY_ID
        value: minioadmin
      - name: AWS_SECRET_ACCESS_KEY
        value: minioadmin
      bucketConfig:
        bucketName: moco
        endpointURL: http://minio.default.svc:9000
        usePathStyle: true
      workVolume:
        emptyDir: {}
...

Further details

Read backup.md for further details.

Deleting the cluster

By deleting MySQLCluster, all resources including PersistentVolumeClaims generated from the templates are automatically removed.

If you want to keep the PersistentVolumeClaims, remove metadata.ownerReferences from them before you delete a MySQLCluster.

Status, metrics, and logs

Cluster status

You can see the health and availability status of MySQLCluster as follows:

$ kubectl get mysqlcluster
NAME   AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS
test   True        True      0         3
  • The cluster is available when the primary Pod is running and ready.
  • The cluster is healthy when there is no problems.
  • PRIMARY is the index of the current primary instance Pod.
  • SYNCED REPLICAS is the number of ready Pods.
  • ERRANT REPLICAS is the number of instances having errant transactions.

You can also use kubectl describe mysqlcluster to see the recent events on the cluster.

Pod status

MOCO adds mysqld containers a liveness probe and a readiness probe to check the replication status in addition to the process status.

A replica Pod is ready only when it is replicating data from the primary without a significant delay. The default threshold of the delay is 60 seconds. The threshold can be configured as follows.

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  maxDelaySeconds: 180
  ...

Unready replica Pods are automatically excluded from the load-balancing targets so that users will not see too old data.

Metrics

MOCO provides a built-in support to collect and expose mysqld metrics using mysqld_exporter.

This is an example YAML to enable mysqld_exporter. spec.collectors is a list of mysqld_exporter flag names without collect. prefix.

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  collectors:
  - engine_innodb_status
  - info_schema.innodb_metrics
  podTemplate:
    ...

See metrics.md for all available metrics and how to collect them using Prometheus.

Logs

Error logs from mysqld can be viewed as follows:

$ kubectl logs moco-test-0 mysqld

Slow logs from mysqld can be viewed as follows:

$ kubectl logs moco-test-0 slow-log

Maintenance

Increasing the number of instances in the cluster

Edit spec.replicas field of MySQLCluster:

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  replicas: 5
  ...

You can only increase the number of instances in a MySQLCluster from 1 to 3 or 5, or from 3 to 5. Decreasing the number of instances is not allowed.

Switchover

Switchover is an operation to change the live primary to one of the replicas.

MOCO automatically switch the primary when the Pod of the primary instance is to be deleted.

Users can manually trigger a switchover with kubectl moco switchover CLUSTER_NAME. Read kubectl-moco.md for details.

Failover

Failover is an operation to replace the dead primary with the most advanced replica. MOCO automatically does this as soon as it detects that the primary is down.

The most advanced replica is a replica who has retrieved the most up-to-date transaction from the dead primary. Since MOCO configures loss-less semi-synchronous replication, the failover is guaranteed not to lose any user data.

After a failover, the old primary may become an errant replica as described.

Upgrading mysql version

You can upgrade the MySQL version of a MySQL cluster as follows:

  1. Check that the cluster is healthy.
  2. Check release notes of MySQL for any incompatibilities between the current and the new versions.
  3. Edit the Pod template of the MySQLCluster and update mysqld container image:
apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
      containers:
      - name: mysqld
        # Edit the next line
        image: quay.io/cybozu/mysql:8.0.26

You are advised to make backups and/or create a replica cluster before starting the upgrade process. Read upgrading.md for further details.

Re-initializing an errant replica

Delete the PVC and Pod of the errant replica, like this:

$ kubectl delete --wait=false pvc mysql-data-moco-test-0
$ kubectl delete --grace-period=1 pods moco-test-0

Depending on your Kubernetes version, StatefulSet controller may create a pending Pod before PVC gets deleted. Delete such pending Pods until PVC is actually removed.

Advanced topics

Building custom image of mysqld

There are pre-built mysqld container images for MOCO on quay.io/cybozu/mysql. Users can use one of these images to supply mysqld container in MySQLCluster like:

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
spec:
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: quay.io/cybozu/mysql:8.0.26

If you want to build and use your own mysqld, read the rest of this document.

Dockerfile

The easiest way to build a custom mysqld for MOCO is to copy and edit our Dockerfile. You can find it under mysql directory in github.com/cybozu/neco-containers.

You should keep the following points:

  • Build and install moco-init
  • Add directories for mysqld and moco-init to PATH
  • ENTRYPOINT should be ["mysqld"]
  • USER should be 10000:10000
  • sleep command must exist in one of the PATH directories.

How to build mysqld

On Ubuntu 20.04, you can build the source code as follows:

$ sudo apt-get update
$ sudo apt-get -y --no-install-recommends install build-essential libssl-dev \
    cmake libncurses5-dev libjemalloc-dev libnuma-dev libaio-dev pkg-config
$ curl -fsSL -O https://dev.mysql.com/get/Downloads/MySQL-8.0/mysql-boost-8.0.20.tar.gz
$ tar -x -z -f mysql-boost-8.0.20.tar.gz
$ cd mysql-8.0.20
$ mkdir bld
$ cd bld
$ cmake .. -DBUILD_CONFIG=mysql_release -DCMAKE_BUILD_TYPE=Release \
    -DWITH_BOOST=$(ls -d ../boost/boost_*) -DWITH_NUMA=1 -DWITH_JEMALLOC=1
$ make -j $(nproc)
$ make install

Trouble shooting

Failed to initialize data directory for mysqld

If you see the following error message from an init container of mysqld Pod,

mysqld: Can't create directory '/var/lib/mysql/data/' (OS errno 13 - Permission denied)
2021-05-24T19:44:33.022939Z 0 [System] [MY-013169] [Server] /usr/local/mysql/bin/mysqld (mysqld 8.0.24) initializing of server in progress as process 12
2021-05-24T19:44:33.024090Z 0 [ERROR] [MY-013236] [Server] The designated data directory /var/lib/mysql/data/ is unusable. You can remove all files that the server added to it.
2021-05-24T19:44:33.024138Z 0 [ERROR] [MY-010119] [Server] Aborting
2021-05-24T19:44:33.024316Z 0 [System] [MY-010910] [Server] /usr/local/mysql/bin/mysqld: Shutdown complete (mysqld 8.0.24)  Source distribution.

the data directory is probably only writable for the root user.

To resolve the problem, add fsGroup: 10000 to MySQLCluster as follows:

apiVersion: moco.cybozu.com/v1beta1
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
  podTemplate:
    spec:
      securityContext:
        fsGroup: 10000    # to make the data directory writable for `mysqld` container.
        fsGroupChangePolicy: "OnRootMismatch"  # available since k8s 1.20
...

Custom resources

Custom Resources

Sub Resources

BackupStatus

BackupStatus represents the status of the last successful backup.

FieldDescriptionSchemeRequired
timeThe time of the backup. This is used to generate object keys of backup files in a bucket.metav1.Timetrue
elapsedElapsed is the time spent on the backup.metav1.Durationtrue
sourceIndexSourceIndex is the ordinal of the backup source instance.inttrue
sourceUUIDSourceUUID is the server_uuid of the backup source instance.stringtrue
binlogFilenameBinlogFilename is the binlog filename that the backup source instance was writing to at the backup.stringtrue
gtidSetGTIDSet is the GTID set of the full dump of database.stringtrue
dumpSizeDumpSize is the size in bytes of a full dump of database stored in an object storage bucket.int64true
binlogSizeBinlogSize is the size in bytes of a tarball of binlog files stored in an object storage bucket.int64true
workDirUsageWorkDirUsage is the max usage in bytes of the woking directory.int64true
warningsWarnings are list of warnings from the last backup, if any.[]stringtrue

Back to Custom Resources

MySQLCluster

MySQLCluster is the Schema for the mysqlclusters API

FieldDescriptionSchemeRequired
metadatametav1.ObjectMetafalse
specMySQLClusterSpecfalse
statusMySQLClusterStatusfalse

Back to Custom Resources

MySQLClusterCondition

MySQLClusterCondition defines the condition of MySQLCluster.

FieldDescriptionSchemeRequired
typeType is the type of the condition.MySQLClusterConditionTypetrue
statusStatus is the status of the condition.corev1.ConditionStatustrue
reasonReason is a one-word CamelCase reason for the condition's last transition.stringfalse
messageMessage is a human-readable message indicating details about last transition.stringfalse
lastTransitionTimeLastTransitionTime is the last time the condition transits from one status to another.metav1.Timetrue

Back to Custom Resources

MySQLClusterList

MySQLClusterList contains a list of MySQLCluster

FieldDescriptionSchemeRequired
metadatametav1.ListMetafalse
items[]MySQLClustertrue

Back to Custom Resources

MySQLClusterSpec

MySQLClusterSpec defines the desired state of MySQLCluster

FieldDescriptionSchemeRequired
replicasReplicas is the number of instances. Available values are positive odd numbers.int32false
podTemplatePodTemplate is a Pod template for MySQL server container.PodTemplateSpectrue
volumeClaimTemplatesVolumeClaimTemplates is a list of PersistentVolumeClaim templates for MySQL server container. A claim named "mysql-data" must be included in the list.[]PersistentVolumeClaimtrue
serviceTemplateServiceTemplate is a Service template for both primary and replicas.*ServiceTemplatefalse
mysqlConfigMapNameMySQLConfigMapName is a ConfigMap name of MySQL config.*stringfalse
replicationSourceSecretNameReplicationSourceSecretName is a Secret name which contains replication source info. If this field is given, the MySQLCluster works as an intermediate primary.*stringfalse
collectorsCollectors is the list of collector flag names of mysqld_exporter. If this field is not empty, MOCO adds mysqld_exporter as a sidecar to collect and export mysqld metrics in Prometheus format.\n\nSee https://github.com/prometheus/mysqld_exporter/blob/master/README.md#collector-flags for flag names.\n\nExample: ["engine_innodb_status", "info_schema.innodb_metrics"][]stringfalse
serverIDBaseServerIDBase, if set, will become the base number of server-id of each MySQL instance of this cluster. For example, if this is 100, the server-ids will be 100, 101, 102, and so on. If the field is not given or zero, MOCO automatically sets a random positive integer.int32false
maxDelaySecondsMaxDelaySeconds, if set, configures the readiness probe of mysqld container. For a replica mysqld instance, if it is delayed to apply transactions over this threshold, the mysqld instance will be marked as non-ready. The default is 60 seconds.intfalse
startupDelaySecondsStartupWaitSeconds is the maximum duration to wait for mysqld container to start working. The default is 3600 seconds.int32false
logRotationScheduleLogRotationSchedule specifies the schedule to rotate MySQL logs. If not set, the default is to rotate logs every 5 minutes. See https://pkg.go.dev/github.com/robfig/cron/v3#hdr-CRON_Expression_Format for the field format.stringfalse
backupPolicyNameThe name of BackupPolicy custom resource in the same namespace. If this is set, MOCO creates a CronJob to take backup of this MySQL cluster periodically.*stringtrue
restoreRestore is the specification to perform Point-in-Time-Recovery from existing cluster. If this field is not null, MOCO restores the data as specified and create a new cluster with the data. This field is not editable.*RestoreSpecfalse
disableSlowQueryLogContainerDisableSlowQueryLogContainer controls whether to add a sidecar container named "slow-log" to output slow logs as the containers output. If set to true, the sidecar container is not added. The default is false.boolfalse

Back to Custom Resources

MySQLClusterStatus

MySQLClusterStatus defines the observed state of MySQLCluster

FieldDescriptionSchemeRequired
conditionsConditions is an array of conditions.[]MySQLClusterConditionfalse
currentPrimaryIndexCurrentPrimaryIndex is the index of the current primary Pod in StatefulSet. Initially, this is zero.inttrue
syncedReplicasSyncedReplicas is the number of synced instances including the primary.intfalse
errantReplicasErrantReplicas is the number of instances that have errant transactions.intfalse
errantReplicaListErrantReplicaList is the list of indices of errant replicas.[]intfalse
backupBackup is the status of the last successful backup.BackupStatustrue
restoredTimeRestoredTime is the time when the cluster data is restored.*metav1.Timefalse
clonedCloned indicates if the initial cloning from an external source has been completed.boolfalse
reconcileInfoReconcileInfo represents version information for reconciler.ReconcileInfotrue

Back to Custom Resources

ObjectMeta

ObjectMeta is metadata of objects. This is partially copied from metav1.ObjectMeta.

FieldDescriptionSchemeRequired
nameName is the name of the object.stringfalse
labelsLabels is a map of string keys and values.map[string]stringfalse
annotationsAnnotations is a map of string keys and values.map[string]stringfalse

Back to Custom Resources

PersistentVolumeClaim

PersistentVolumeClaim is a user's request for and claim to a persistent volume. This is slightly modified from corev1.PersistentVolumeClaim.

FieldDescriptionSchemeRequired
metadataStandard object's metadata.ObjectMetatrue
specSpec defines the desired characteristics of a volume requested by a pod author.corev1.PersistentVolumeClaimSpectrue

Back to Custom Resources

PodTemplateSpec

PodTemplateSpec describes the data a pod should have when created from a template. This is slightly modified from corev1.PodTemplateSpec.

FieldDescriptionSchemeRequired
metadataStandard object's metadata. The name in this metadata is ignored.ObjectMetafalse
specSpecification of the desired behavior of the pod. The name of the MySQL server container in this spec must be mysqld.corev1.PodSpectrue

Back to Custom Resources

ReconcileInfo

ReconcileInfo is the type to record the last reconciliation information.

FieldDescriptionSchemeRequired
generationGeneration is the metadata.generation value of the last reconciliation. See also https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#status-subresourceint64false
reconcileVersionReconcileVersion is the version of the operator reconciler.inttrue

Back to Custom Resources

RestoreSpec

RestoreSpec represents a set of parameters for Point-in-Time Recovery.

FieldDescriptionSchemeRequired
sourceNameSourceName is the name of the source MySQLCluster.stringtrue
sourceNamespaceSourceNamespace is the namespace of the source MySQLCluster.stringtrue
restorePointRestorePoint is the target date and time to restore data. The format is RFC3339. e.g. "2006-01-02T15:04:05Z"metav1.Timetrue
jobConfigSpecifies parameters for restore Pod.JobConfigtrue

Back to Custom Resources

ServiceTemplate

ServiceTemplate defines the desired spec and annotations of Service

FieldDescriptionSchemeRequired
metadataStandard object's metadata. Only annotations and labels are valid.ObjectMetafalse
specSpec is the ServiceSpec*corev1.ServiceSpecfalse

Back to Custom Resources

BucketConfig

BucketConfig is a set of parameter to access an object storage bucket.

FieldDescriptionSchemeRequired
bucketNameThe name of the bucketstringtrue
regionThe region of the bucket. This can also be set through AWS_REGION environment variable.stringfalse
endpointURLThe API endpoint URL. Set this for non-S3 object storages.stringfalse
usePathStyleAllows you to enable the client to use path-style addressing, i.e., https?://ENDPOINT/BUCKET/KEY. By default, a virtual-host addressing is used (https?://BUCKET.ENDPOINT/KEY).boolfalse

Back to Custom Resources

JobConfig

JobConfig is a set of parameters for backup and restore job Pods.

FieldDescriptionSchemeRequired
serviceAccountNameServiceAccountName specifies the ServiceAccount to run the Pod.stringtrue
bucketConfigSpecifies how to access an object storage bucket.BucketConfigtrue
workVolumeWorkVolume is the volume source for the working directory. Since the backup or restore task can use a lot of bytes in the working directory, You should always give a volume with enough capacity.\n\nThe recommended volume source is a generic ephemeral volume. https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumescorev1.VolumeSourcetrue
threadsThreads is the number of threads used for backup or restoration.intfalse
memoryMemory is the amount of memory requested for the Pod.*resource.Quantityfalse
maxMemoryMaxMemory is the amount of maximum memory for the Pod.*resource.Quantityfalse
envFromList of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig[]corev1.EnvFromSourcefalse
envList of environment variables to set in the container.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig[]corev1.EnvVarfalse

Back to Custom Resources

Custom Resources

Sub Resources

BackupPolicy

BackupPolicy is a namespaced resource that should be referenced from MySQLCluster.

FieldDescriptionSchemeRequired
metadatametav1.ObjectMetafalse
specBackupPolicySpectrue

Back to Custom Resources

BackupPolicyList

BackupPolicyList contains a list of BackupPolicy

FieldDescriptionSchemeRequired
metadatametav1.ListMetafalse
items[]BackupPolicytrue

Back to Custom Resources

BackupPolicySpec

BackupPolicySpec defines the configuration items for MySQLCluster backup.\n\nThe following fields will be copied to CronJob.spec:\n\n- Schedule - StartingDeadlineSeconds - ConcurrencyPolicy - SuccessfulJobsHistoryLimit - FailedJobsHistoryLimit\n\nThe following fields will be copied to CronJob.spec.jobTemplate.\n\n- ActiveDeadlineSeconds - BackoffLimit

FieldDescriptionSchemeRequired
scheduleThe schedule in Cron format for periodic backups. See https://en.wikipedia.org/wiki/Cronstringtrue
jobConfigSpecifies parameters for backup Pod.JobConfigtrue
startingDeadlineSecondsOptional deadline in seconds for starting the job if it misses scheduled time for any reason. Missed jobs executions will be counted as failed ones.*int64false
concurrencyPolicySpecifies how to treat concurrent executions of a Job. Valid values are: - "Allow" (default): allows CronJobs to run concurrently; - "Forbid": forbids concurrent runs, skipping next run if previous run hasn't finished yet; - "Replace": cancels currently running job and replaces it with a new onebatchv1beta1.ConcurrencyPolicyfalse
activeDeadlineSecondsSpecifies the duration in seconds relative to the startTime that the job may be continuously active before the system tries to terminate it; value must be positive integer. If a Job is suspended (at creation or through an update), this timer will effectively be stopped and reset when the Job is resumed again.*int64false
backoffLimitSpecifies the number of retries before marking this job failed. Defaults to 6*int32false
successfulJobsHistoryLimitThe number of successful finished jobs to retain. This is a pointer to distinguish between explicit zero and not specified. Defaults to 3.*int32false
failedJobsHistoryLimitThe number of failed finished jobs to retain. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1.*int32false

Back to Custom Resources

BucketConfig

BucketConfig is a set of parameter to access an object storage bucket.

FieldDescriptionSchemeRequired
bucketNameThe name of the bucketstringtrue
regionThe region of the bucket. This can also be set through AWS_REGION environment variable.stringfalse
endpointURLThe API endpoint URL. Set this for non-S3 object storages.stringfalse
usePathStyleAllows you to enable the client to use path-style addressing, i.e., https?://ENDPOINT/BUCKET/KEY. By default, a virtual-host addressing is used (https?://BUCKET.ENDPOINT/KEY).boolfalse

Back to Custom Resources

JobConfig

JobConfig is a set of parameters for backup and restore job Pods.

FieldDescriptionSchemeRequired
serviceAccountNameServiceAccountName specifies the ServiceAccount to run the Pod.stringtrue
bucketConfigSpecifies how to access an object storage bucket.BucketConfigtrue
workVolumeWorkVolume is the volume source for the working directory. Since the backup or restore task can use a lot of bytes in the working directory, You should always give a volume with enough capacity.\n\nThe recommended volume source is a generic ephemeral volume. https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumescorev1.VolumeSourcetrue
threadsThreads is the number of threads used for backup or restoration.intfalse
memoryMemory is the amount of memory requested for the Pod.*resource.Quantityfalse
maxMemoryMaxMemory is the amount of maximum memory for the Pod.*resource.Quantityfalse
envFromList of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig[]corev1.EnvFromSourcefalse
envList of environment variables to set in the container.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig[]corev1.EnvVarfalse

Back to Custom Resources

Commands

kubectl moco plugin

kubectl-moco is a kubectl plugin for MOCO.

kubectl moco [global options] <subcommand> [sub options] args...

Global options

Global options are compatible with kubectl. For example, the following options are available.

Global optionsDefault valueDescription
--kubeconfig$HOME/.kube/configPath to the kubeconfig file to use for CLI requests.
-n, --namespacedefaultIf present, the namespace scope for this CLI request.

MySQL users

You can choose one of the following user for --mysql-user option value.

NameDescription
moco-readonlyA read-only user.
moco-writableA user that can edit users, databases, and tables.
moco-adminThe super-user.

kubectl moco mysql [options] CLUSTER_NAME [-- mysql args...]

Run mysql command in a specified MySQL instance.

OptionsDefault valueDescription
-u, --mysql-usermoco-readonlyLogin as the specified user
--indexindex of the primaryIndex of the target mysql instance
-i, --stdinfalsePass stdin to the mysql container
-t, --ttyfalseStdin is a TTY

Examples

This executes SELECT VERSION() on the primary instance in mycluster in foo namespace:

$ kubectl moco -n foo mysql mycluster -- -N -e 'SELECT VERSION()'

To execute SQL from a file:

$ cat sample.sql | kubectl moco -n foo mysql -u moco-writable -i mycluster

To run mysql interactively for the instance 2 in mycluster in the default namespace:

$ kubectl moco mysql --index 2 -it mycluster

kubectl moco credential [options] CLUSTER_NAME

Fetch the credential information of a specified user

OptionsDefault valueDescription
-u, --mysql-usermoco-readonlyFetch the credential of the specified user
--formatplainOutput format: plain or mycnf

kubectl moco switchover CLUSTER_NAME

Switch the primary instance to one of the replicas.

moco-controller

moco-controller controls MySQL clusters on Kubernetes.

Environment variables

NameRequiredDescription
POD_NAMESPACEYesThe namespace name where moco-controller runs.

Command line flags

Flags:
      --add_dir_header                   If true, adds the file directory to the header
      --agent-image string               The image of moco-agent sidecar container
      --alsologtostderr                  log to standard error as well as files
      --backup-image string              The image of moco-backup container
      --cert-dir string                  webhook certificate directory
      --check-interval duration          Interval of cluster maintenance (default 1m0s)
      --fluent-bit-image string          The image of fluent-bit sidecar container
      --grpc-cert-dir string             gRPC certificate directory (default "/grpc-cert")
      --health-probe-addr string         Listen address for health probes (default ":8081")
  -h, --help                             help for moco-controller
      --leader-election-id string        ID for leader election by controller-runtime (default "moco")
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --log_file string                  If non-empty, use this log file
      --log_file_max_size uint           Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                      log to standard error instead of files (default true)
      --metrics-addr string              The address the metric endpoint binds to (default ":8080")
      --mysqld-exporter-image string     The image of mysqld_exporter sidecar container
      --skip_headers                     If true, avoid header prefixes in the log messages
      --skip_log_headers                 If true, avoid headers when opening log files
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          number for the log level verbosity
      --version                          version for moco-controller
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging
      --webhook-addr string              Listen address for the webhook endpoint (default ":9443")
      --zap-devel                        Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
      --zap-encoder encoder              Zap log encoding (one of 'json' or 'console')
      --zap-log-level level              Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
      --zap-stacktrace-level level       Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').

moco-backup

moco-backup command is used in ghcr.io/cybozu-go/moco-backup container. Normally, users need not take care of this command.

Environment variables

moco-backup takes configurations of S3 API from environment variables. For details, read documentation of EnvConfig in github.com/aws/aws-sdk-go-v2/config.

It also requires MYSQL_PASSWORD environment variable to be set.

Global command-line flags

Global Flags:
      --endpoint string   S3 API endpoint URL
      --region string     AWS region
      --threads int       The number of threads to be used (default 4)
      --use-path-style    Use path-style S3 API
      --work-dir string   The writable working directory (default "/work")

Subcommands

backup subcommand

Usage: moco-backup backup BUCKET NAMESPACE NAME

  • BUCKET: The bucket name.
  • NAMESPACE: The namespace of the MySQLCluster.
  • NAME: The name of the MySQLCluster.

`restore subcommand

Usage: moco-backup restore BUCKET SOURCE_NAMESPACE SOURCE_NAME NAMESPACE NAME YYYYMMDD-hhmmss

  • BUCKET: The bucket name.
  • SOURCE_NAMESPACE: The source MySQLCluster's namespace.
  • SOURCE_NAME: The source MySQLCluster's name.
  • NAMESPACE: The target MySQLCluster's namespace.
  • NAME: The target MySQLCluster's name.
  • YYYYMMDD-hhmmss: The point-in-time to restore data. e.g. 20210523-150423

Metrics

moco-controller

moco-controller provides the following kind of metrics in Prometheus format. Aside from the standard Go runtime and process metrics, it exposes metrics related to controller-runtime, MySQL clusters, and backups.

MySQL clusters

All these metrics are prefixed with moco_cluster_ and have name and namespace labels.

NameDescriptionType
checks_totalThe number of times MOCO checked the clusterCounter
errors_totalThe number of times MOCO encountered errors when managing the clusterCounter
available1 if the cluster is available, 0 otherwiseGauge
healthy1 if the cluster is running without any problems, 0 otherwiseGauge
switchover_totalThe number of times MOCO changed the live primary instanceCounter
failover_totalThe number of times MOCO changed the failed primary instanceCounter
replicasThe number of mysqld instances in the clusterGauge
ready_replicasThe number of ready mysqld Pods in the clusterGauge
errant_replicasThe number of mysqld instances that have errant transactionsGauge

Backup

All these metrics are prefixed with moco_backup_ and have name and namespace labels.

NameDescriptionType
timestampThe number of seconds since January 1, 1970 UTC of the last successful backupGauge
elapsed_secondsThe number of seconds taken for the last backupGauge
dump_bytesThe size of compressed full backup dataGauge
binlog_bytesThe size of compressed binlog filesGauge
workdir_usage_bytesThe maximum usage of the working directoryGauge
warningsThe number of warnings in the last successful backupGauge

MySQL instance

For each mysqld instance, moco-agent exposes a set of metrics. Read github.com/cybozu-go/moco-agent/blob/main/docs/metrics.md for details.

Also, if you give a set of collector flag names to spec.collectors of MySQLCluster, a sidecar container running mysqld_exporter exposes the collected metrics for each mysqld instance.

Scrape rules

This is an example kubernetes_sd_config for Prometheus to collect all MOCO & MySQL metrics.

scrape_configs:
- job_name: 'moco-controller'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_label_app_kubernetes_io_component,__meta_kubernetes_pod_container_port_name]
    action: keep
    regex: moco-system;moco-controller;metrics

- job_name: 'moco-agent'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_container_port_name,__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: keep
    regex: mysql;agent-metrics;moco-.*
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace

- job_name: 'moco-mysql'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_container_port_name,__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: keep
    regex: mysql;mysqld-metrics;moco-.*
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
    action: replace
    target_label: name
  - source_labels: [__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: replace
    target_label: index
    regex: .*-([0-9])
  - source_labels: [__meta_kubernetes_pod_label_moco_cybozu_com_role]
    action: replace
    target_label: role

The collected metrics should have these labels:

  • namespace: MySQLCluster's metadata.namespace
  • name: MySQLCluster's metadata.name
  • index: The ordinal of MySQL instance Pod

Design notes

Design notes

The purpose of this document is to describe the backgrounds and the goals of MOCO. Implementation details are described in other documents.

Motivation

We are creating our own Kubernetes operator for clustering MySQL instances for the following reasons:

Firstly, our application requires strict-compatibility to the traditional MySQL. Although recent MySQL provides an advanced clustering solution called group replication that is based on Paxos, we cannot use it because of various limitations from group replication.

Secondly, we want to have a Kubernetes native and the simplest operator. For example, we can use Kubernetes Service to load-balance read queries to multiple replicas. Also, we do not want to support non-GTID based replications.

Lastly, none of the existing operators could satisfy our requirements.

Goals

  • Manage primary-replica style clustering of MySQL instances.
    • The primary instance is the only instance that allows writes.
    • Replica instances replicate data from the primary and are read-only.
  • Support replication from an external MySQL instance.
  • Support all the four transaction isolation levels.
  • No split-brain.
  • Allow large transactions.
  • Upgrade the operator without restarting MySQL Pods.
  • Safe and automatic upgrading of MySQL version.
  • Support automatic primary selection and switchover.
  • Support automatic failover.
  • Backup and restore features.
    • Support point-in-time recovery (PiTR).
  • Tenant users can specify the following parameters:
    • The version of MySQL instances.
    • The number of processor cores for each MySQL instance.
    • The amount of memory for each MySQL instance.
    • The amount of backing storage for each MySQL instance.
    • The number of replicas in the MySQL cluster.
    • Custom configuration parameters.
  • Allow CREATE / DROP TEMPORARY TABLE during a transaction.

Non-goals

  • Support for older MySQL versions (5.6, 5.7)

    As a late comer, we focus our development effort on the latest MySQL. This simplifies things and allows us to use advanced mechanisms such as CLONE INSTANCE.

  • Node fencing

    Fencing is a technique to safely isolated a failed Node. MOCO does not rely on Node fencing as it should be done externally.

    We can still implement failover in a safe way by configuring semi-sync parameters appropriately.

How MOCO reconciles MySQLCluster

MOCO creates and updates a StatefulSet and related resources for each MySQLCluster custom resource. This document describes how and when MOCO updates them.

Reconciler versions

MOCO's reconciliation routine should be consistent to avoid frequent updates.

That said, we may need to modify the reconciliation process in the future. To avoid updating the StatefulSet, MOCO has multiple versions of reconcilers.

For example, if a MySQLCluster is reconciled with version 1 of the reconciler, MOCO will keep using the version 1 reconciler to reconcile the MySQLCluster.

If the user edits MySQLCluster's spec field, MOCO can reconcile the MySQLCluster with the latest reconciler, for example version 2, because the user shall be ready for mysqld restarts.

The update policy of moco-agent container

We shall try to avoid updating moco-agent as much as possible.

The figure below illustrates the overview of resources related to clustering MySQL instances.

Overview of clustering related resources

StatefulSet

MOCO tries not to update the StatefulSet frequently. It updates the StatefulSet only when the update is a must.

The conditions for StatefulSet update

The StatefulSet will be updated when:

  • Some fields under spec of MySQLCluster are modified.
  • my.cnf for mysqld is updated.
  • the version of the reconciler used to reconcile the StatefulSet is obsoleted.
  • the image of moco-agent given to the controller is updated.
  • the image of mysqld_exporter given to the controller is updated.

When the StatefulSet is not updated

  • the image of fluent-bit given to the controller is changed.
    • because the controller does not depend on fluent-bit.

The fluent-bit sidecar container is updated only when some fields under spec of MySQLCluster are modified.

Secrets

MOCO generates random passwords for users that MOCO uses to access MySQL.

The generated passwords are stored in two Secrets. One is in the same namespace as moco-controller, and the other is in the namespace of MySQLCluster.

Certificate

MOCO creates a Certificate in the same namespace as moco-controller to issue a TLS certificate for moco-agent.

After cert-manager issues a TLS certificate and creates a Secret for it, MOCO copies the Secret to the namespace of MySQLCluster. For details, read security.md.

Service

MOCO creates three Services for each MySQLCluster, that is:

  • A headless Service, required for every StatefulSet
  • A Service for the primary mysqld instance
  • A Service for replica mysql instances

The Services' labels, annotations, and spec fields can be customized with MySQLCluster's status.serviceTemplate field.

The following fields in Service spec may not be customized, though.

  • clusterIP
  • ports
  • selector

ConfigMap

MOCO creates and updates a ConfigMap for my.cnf. The name of this ConfigMap is calculated from the contents of my.cnf that may be changed by users.

MOCO deletes old ConfigMaps of my.cnf after a new ConfigMap for my.cnf is created.

If the cluster does not disable a sidecar container for slow query logs, MOCO creates a ConfigMap for the sidecar.

PodDisruptionBudget

MOCO creates a PodDisruptionBudget for each MySQLCluster to prevent too few semi-sync replica servers.

The spec.maxUnavailable value is calculated from MySQLCluster's spec.replicas as follows:

`spec.maxUnavailable` = floor(`spec.replicas` / 2)

If spec.replicas is 1, MOCO does not create a PDB.

ServiceAccount

MOCO creates a ServiceAccount for Pods of the StatefulSet. The ServiceAccount is not bound to any Roles/ClusterRoles.

See backup.md for the overview of the backup and restoration mechanism.

CronJob

This is the only resource created when backup is enabled for MySQLCluster.

If the backup is disabled, the CronJob is deleted.

Job

To restore data from a backup, MOCO creates a Job. MOCO deletes the Job after the Job finishes successfully.

If the Job fails, MOCO leaves the Job.

How MOCO maintains MySQL clusters

For each MySQLCluster, MOCO creates and maintains a set of mysqld instances. The set contains one primary instance and may contain multiple replica instances depending on the spec.replicas value of MySQLCluster.

This document describes how MOCO does this job safely.

Terminology

  • Replication: GTID-based replication between mysqld instances.
  • Cluster: a group of mysqld instances that replicate data between them.
  • Primary (instance): a single source instance of mysqld in a cluster.
  • Replica (instance): a read-only instance of mysqld that synchronizes data with the primary instance.
  • Intermediate primary: a special primary instance that replicates data from an external mysqld.
  • Errant transaction: a transaction that exists only on a replica instance.
  • Errant replica: a replica instance that has errant transactions.
  • Switchover: operation to change a live primary to a replica and promote a replica to the new primary.
  • Failover: operation to replace a dead primary with a replica.

Prerequisites

MySQLCluster allows positive odd numbers for spec.replicas value. If 1, MOCO runs a single mysqld instance without configuring replication. If 3 or greater, MOCO chooses a mysqld instance as a primary, writable instance and configures all other instances as replicas of the primary instance.

status.currentPrimaryIndex in MySQLCluster is used to record the current chosen primary instance. Initially, status.currentPrimaryIndex is zero and therefore the index of the primary instance is zero.

As a special case, if spec.replicationSourceSecretName is set for MySQLCluster, the primary instance is configured as a replica of an external MySQL server. In this case, the primary instance will not be writable. We call this type of primary instance intermediate primary.

If spec.replicationSourceSecretName is not set, MOCO configures semisynchronous replication between the primary and replicas. Otherwise, the replication is asynchronous.

For semi-synchronous replication, MOCO configures rpl_semi_sync_master_timeout long enough so that it never degrades to asynchronous replication.

Likewise, MOCO configures rpl_semi_sync_master_wait_for_slave_count to (spec.replicas - 1 / 2) to make sure that at least half of replica instances have the same commit as the primary. e.g., If spec.replicas is 5, rpl_semi_sync_master_wait_for_slave_count will be set to 2.

MOCO also disables relay_log_recovery because enabling it would drop the relay logs on replicas.

mysqld always starts with super_read_only=1 to prevent erroneous writes, and with skip_slave_start to prevent misconfigured replication.

moco-agent, a sidecar container for MOCO, initializes MySQL users and plugins. At the end of the initialization, it issues RESET MASTER to clear executed GTID set.

moco-agent also provides a readiness probe for mysqld container. If a replica instance does not start replication threads or is too delayed to execute transactions, the container and the Pod will be determined as unready.

Limitations

Currently, MOCO does not re-initialize data after the primary instance fails.

After failover to a replica instance, the old primary may have errant transactions because it may recover unacknowledged transactions in its binary log. This is an inevitable limitation in MySQL semi-synchronous replication.

If this happens, MOCO detects the errant transaction and will not allow the old primary to rejoin the cluster as a replica.

Users need to delete the volume data (PersistentVolumeClaim) and the pod of the old primary to re-initialize it.

Possible states

MySQLCluster

MySQLCluster can be one of the following states.

The initial state is Cloning if spec.replicationSourceSecretName is set, or Restoring if spec.restore is set. Otherwise, the initial state is Incomplete.

Note that, if the primary Pod is ready, the mysqld is assured writable. Likewise, if a replica Pod is ready, the mysqld is assured read-only and running replication threads w/o too much delay.

  1. Healthy
    • All Pods are ready.
    • All replicas have no errant transactions.
    • All replicas are read-only and connected to the primary.
    • For intermediate primary instance, the primary works as a replica for an external mysqld and is read-only.
  2. Cloning
    • spec.replicationSourceSecretName is set.
    • status.cloned is false.
    • The cloning result exists and is not "Completed" or there is no cloning result and the instance has no data.
    • (note: if the primary has some data and has no cloning result, the instance was used to be a replica and then promoted to the primary.)
  3. Restoring
    • spec.restore is set.
    • status.restoredTime is not set.
  4. Degraded
    • The primary Pod is ready and does not lose data.
    • For intermediate primary instance, the primary works as a replica for an external mysqld and is read-only.
    • Half or more replicas are ready, read-only, connected to the primary, and have no errant transactions. For example, if spec.replicas is 5, two or more such replicas are needed.
    • At least one replica has some problems.
  5. Failed
    • The primary instance is not running or lost data.
    • More than half of replicas are running and have data without errant transactions. For example, if spec.replicas is 5, three or more such replicas are needed.
  6. Lost
    • The primary instance is not running or lost data.
    • Half or more replicas are not running or lost data or have errant transactions.
  7. Incomplete
    • None of the above states applies.

MOCO can recover the cluster to Healthy from Degraded, Failed, or Incomplete if all Pods are running and there are no errant transactions.

MOCO can recover the cluster to Degraded from Failed when not all Pods are running. Recovering from Failed is called failover.

MOCO cannot recover the cluster from Lost. Users need to restore data from backups.

Pod

mysqld is run as a container in a Pod. Therefore, MOCO needs to be aware of the following conditions.

  1. Missing: the Pod does not exist.
  2. Exist: the Pod exists and not Terminating or Demoting.
  3. Terminating: The Pod exists and metadata.deletionTimestamp is not null.
  4. Demoting: The Pod exists and has moco.cybozu.com/demote: true annotation.

If there are missing Pods, MOCO does nothing for the MySQLCluster.

If a primary instance Pod is Terminating or Demoting, MOCO controller changes the primary to one of the replica instances. This operation is called switchover.

MySQL data

MOCO checks replica instances whether they have errant transactions compared to the primary instance. If it detects such an instance, MOCO records the instance with MySQLCluster and excludes it from the cluster.

The user needs to delete the Pod and the volume manually and let the StatefulSet controller to re-create them. After a newly initialized instance gets created, MOCO will allow it to rejoin the cluster.

Invariants

  • By definition, the primary instance recorded in MySQLCluster has no errant transactions. It is always the single source of truth.
  • Errant replicas are not treated as ready even if their Pod status is ready.

The maintenance flow

MOCO runs the following infinite loop for each MySQLCluster. It stops when MySQLCluster resource is deleted.

  1. Gather the current status
  2. Update status of MySQLCluster
  3. Determine what MOCO should do for the cluster
  4. If there is nothing to do, wait a while and go to 1
  5. Do the determined operation then go to 1

Read the following sub-sections about 1 to 3.

Gather the current status

MOCO gathers the information from kube-apiserver and mysqld as follows:

  • MySQLCluster resource
  • Pod resources
    • If some of the Pods are missing, MOCO does nothing.
  • mysqld
    • SHOW SLAVE HOSTS (on the primary)
    • SHOW SLAVE STATUS (on the replicas)
    • Global variables such as gtid_executed or super_read_only
    • Result of CLONE from performance_schema.clone_status table

If MOCO cannot connect to an instance for a certain period, that instance is determined as failed.

Update status of MySQLCluster

In this phase, MOCO updates status field of MySQLCluster as follows:

  1. Determine the current MySQLCluster state.
  2. Add or update type=Initialized condition to status.conditions as
    • True if the cluster state is not Cloning.
    • otherwise, False.
  3. Add or update type=Available condition to status.conditions as
    • True if the cluster state is Healthy or Degraded.
    • otherwise, False.
  4. Add or update type=Healthy condition to status.conditions as
    • True if the cluster state is Healthy.
    • otherwise, False.
    • The Reason field is set to the cluster state such as "Failed" or "Incomplete".
  5. Set the number of ready replica Pods to status.syncedReplicas.
  6. Add newly found errant replicas to status.errantReplicaList.
  7. Remove re-initialized and/or no-longer errant replicas from status.errantReplicaList
  8. Set status.errantReplicas to the length of status.errantReplicaList.
  9. Set status.cloned to true if spec.replicationSourceSecret is not nil and the state is not Cloning.

Determine what MOCO should do for the cluster

The operation depends on the current cluster state.

The operation and its result are recorded as Events of MySQLCluster resource.

cf. Application Introspection and Debugging

Healthy

If the primary instance Pod is Terminating or Demoting, switch the primary instance to another replica. Otherwise, just wait a while.

The switchover is done as follows. It takes at least several seconds for a new primary to become writable.

  1. Make the primary instance super_read_only=1.
  2. Kill all existing connections except ones from localhost and ones for MOCO.
  3. Wait for a replica to catch up the executed GTID set of the primary instance.
  4. Set status.currentPrimaryIndex to the replica's index.
  5. If the old primary is Demoting, remove moco.cybozu.com/demote annotation from the Pod.

Cloning

Execute CLONE INSTANCE on the intermediate primary instance to clone data from an external MySQL instance.

If the cloning goes successful, do the same as Intermediate case.

Restoring

Do nothing.

Degraded

First, check if the primary instance Pod is Terminating or Demoting, and if it is, do the switchover just like Healthy case.

Then, do the same as Intermediate case to try to fix the problems. It is not possible to recover the cluster to Healthy if there are errant or stopped replicas, though.

Failed

MOCO chooses the most advanced instance as the new primary instance. The most advanced means that its retrieved GTID set is the superset of all other replicas except for those have errant transactions.

To prevent accidental writes to the old primary instance (so-called split-brain), MOCO stops replication IO_THREAD for all replicas. This way, the old primary cannot get necessary acks from replicas to write further transactions.

The failover is done as follows:

  1. Stop IO_THREAD on all replicas.
  2. Choose the most advanced replica as the new primary. Errant replicas recorded in MySQLCluster are excluded from the candidates.
  3. Wait for the replica to execute all retrieved GTID set.
  4. Update status.currentPrimaryIndex to the new primary's index.

Lost

There is nothing can be done.

Intermediate

  • On the primary that was an intermediate primary, wait for all the retrieved GTID set to be executed.
  • Start replication between the primary and non-errant replicas.
    • If a replication has no data, MOCO clones the primary data to the replica first.
  • Stop replication of errant replicas.
  • Set super_read_only=1 for replica instances that are writable.
  • Adjust moco.cybozu.com/role label to Pods according to their roles.
    • For errant replicas, the label is removed to prevent users from reading inconsistent data.
  • Finally, make the primary mysqld writable if the primary is not an intermediate primary.

Backup and restore

This document describes how MOCO takes a backup of MySQLCluster data and restores a cluster from a backup.

Overview

A MySQLCluster can be configured to take backups regularly by referencing a BackupPolicy in spec.backupPolicyName. For each MySQLCluster associated with a BackupPolicy, moco-controller creates a CronJob. The CronJob creates a Job to take a full backup periodically. The Job also takes a backup of binary logs for Point-in-Time Recovery (PiTR). The backups are stored in a S3-compatible object storage bucket.

This figure illustrates how MOCO takes a backup of a MySQLCluster.

Backup

  1. moco-controller creates a CronJob and Role/RoleBinding to allow access to MySQLCluster for the Job Pod.
  2. At each configured interval, CronJob creates a Job.
  3. The Job dumps all data from a mysqld using MySQL shell's dump instance utility.
  4. The Job creates a tarball of the dumped data and put it in a bucket of S3 compatible object storage.
  5. The Job also dumps binlogs since the last backup and put it in the same bucket (with a different name, of course).
  6. The Job finally updates MySQLCluster status to record the last successful backup.

To restore from a backup, users need to create a new MySQLCluster with spec.restore filled with necessary information such as the bucket name of the object storage, the object key, and so on.

The next figure illustrates how MOCO restores MySQL cluster from a backup.

Restore

  1. moco-controller creates a Job and Role/RoleBinding for restoration.
  2. The Job downloads a tarball of dumped files of the specified backup.
  3. The Job loads data into an empty mysqld using MySQL shell's dump loading utility.
  4. If the user wanted to restore data at a point-in-time, the Job downloads saved binlogs.
  5. The Job applies binlogs up to the specified point-in-time using mysqlbinlog.
  6. The Job finally updates MySQLCluster status to record the restoration time.

Design goals

Must:

  • Users must be able to configure different backup policies for each MySQLCluster.
  • Users must be able to restore MySQL data at a point-in-time from backups.
  • Users must be able to restore MySQL data without the original MySQLCluster resource.
  • moco-controller must export metrics about backups.

Should:

  • Backup data should be compressed to save the storage space.
  • Backup data should be stored in an object storage.
  • Backups should be taken from a replica instance as much as possible.

These "should's" are mostly in terms of money or performance.

Implementation

Backup file keys

Backup files are stored in an object storage bucket with the following keys.

  • Key for a tarball of a fully dumped MySQL: moco/<namespace>/<name>/YYYYMMDD-hhmmss/dump.tar
  • Key for a compressed tarball of binlog files: moco/<namespace>/<name>/YYYYMMDD-hhmmss/binlog.tar.zst

<namespace> is the namespace of MySQLCluster, and <name> is the name of MySQLCluster. YYYYMMDD-hhmmss is the date and time of the backup where YYYY is the year, MM is two-digit month, DD is two-digit day, hh is two-digit hour in 24-hour format, mm is two-digit minute, and ss is two-digit second.

Example: moco/foo/bar/20210515-230003/dump.tar

This allows multiple MySQLClusters to share the same bucket.

Timestamps

Internally, the time for PiTR is formatted in UTC timezone.

The restore Job runs mysqlbinlog with TZ=Etc/UTC timezone.

Backup

As described in Overview, the backup process is implemented with CronJob and Job. In addition, users need to provide a ServiceAccount for the Job.

The ServiceAccount is often used to grant access to the object storage bucket where the backup files will be stored. For instance, Amazon Elastic Kubernetes Service (EKS) has a feature to create such a ServiceAccount. Kubernetes itself is also developing such an enhancement called Container Object Storage Interface (COSI).

To allow the backup Job to update MySQLCluster status, MOCO creates Role and RoleBinding. The RoleBinding grants the access to the given ServiceAccount.

For the time being, MOCO only supports AWS S3 API as it prevails among other object storage APIs. We intend to extend the support to S3-compatible object storages such as MinIO and Ceph.

For the first time, the backup Job chooses a replica instance as the backup source if available. For the second and subsequent backups, the Job will choose the last chosen instance as long as it is still a replica and available.

The backups are divided into two: a full dump and binlogs. A full dump is a snapshot of the entire MySQL database. Binlogs are records of transactions. With mysqlbinlog, binlogs can be used to apply transactions to a database restored from a full dump for PiTR.

For the first time, MOCO only takes a full dump of a MySQL instance, and records the GTID at the backup. For the second and subsequent backups, MOCO will retrieve binlogs since the GTID of the last backup until now.

To take a full dump, MOCO uses MySQL shell's dump instance utility. It performs significantly faster than mysqldump or mysqlpump. The dump is compressed with zstd compression algorithm.

MOCO then creates a tarball of the dump and puts it to an object storage bucket.

To retrieve transactions since the last backup until now, mysqlbinlog is used with these flags:

The retrieved binlog files are packed into a tarball and compressed with zstd, then put to an object storage bucket.

Finally, the Job updates MySQLCluster status field with the following information:

  • The time of backup
  • The time spent on the backup
  • The ordinal of the backup source instance
  • server_uuid of the instance (to check whether the instance was re-initialized or not)
  • The binlog filename in SHOW MASTER STATUS output.
  • The size of the tarball of the dumped files
  • The size of the tarball of the binlog files
  • The maximum usage of the working directory
  • Warnings, if any

Restore

To restore MySQL data from a backup, users need to create a new MySQLCluster with appropriate spec.restore field. spec.restore needs to provide at least the following information:

  • The bucket name
  • Namespace and name of the original MySQLCluster
  • A point-in-time in RFC3339 format

After moco-controller identifies mysqld is running, it creates a Job to retrieve backup files and load them into mysqld.

The Job looks for the most recent tarball of the dumped files that is older than the specified point-in-time in the bucket, and retrieves it. The dumped files are then loaded to mysqld using MySQL shell's load dump utility.

If the point-in-time is different from the time of the dump file, and if there is a compressed tarball of binlog files, then the Job retrieves binlog files and applies transactions up to the point-in-time.

After restoration process finishes, the Job updates MySQLCluster status to record the restoration time. moco-controller then configures the clustering as usual.

If the Job fails, moco-controller leaves the Job as is. The restored MySQL cluster will also be left read-only. If some of the data have been restored, they can be read from the cluster.

If a failed Job is deleted, moco-controller will create a new Job to give it another chance. Users can safely delete a successful Job.

Caveats

  • No automatic deletion of backup files

    MOCO does not delete old backup files from object storage. Users should configure a bucket lifecycle policy to delete old backups automatically.

  • Duplicated backup Jobs

    CronJob may create two or more Jobs at a time. If this happens, only one Job can update MySQLCluster status.

  • Lost binlog files

    If binlog_expire_logs_seconds or expire_logs_days is set to a shorter value than the interval of backups, MOCO cannot save binlogs correctly. Users are responsible to configure binlog_expire_logs_seconds appropriately.

Considered options

There were many design choices and alternative methods to implement backup/restore feature for MySQL. Here are descriptions of why we determined the current design.

Why do we use S3-compatible object storage to store backups?

Compared to file systems, object storage is generally more cost-effective. It also has many useful features such as object lifecycle management.

AWS S3 API is the most prevailing API for object storages.

Why do we use Jobs for backup and restoration?

Backup and restoration can be a CPU- and memory-consuming task. Running such a task in moco-controller is dangerous because moco-controller manages a lot of MySQLCluster.

moco-agent is also not a safe place to run backup job because it is a sidecar of mysqld Pod. If backup is run in mysqld Pod, it would interfere with the mysqld process.

Why do we prefer mysqlsh to mysqldump?

The biggest reason is the difference in how these tools lock the instance.

mysqlsh uses LOCK INSTANCE FOR BACKUP which blocks DDL until the lock is released. mysqldump, on the other hand, allows DDL to be executed. Once DDL is executed and acquire meta data lock, which means that any DML for the table modified by DDL will be blocked.

Blocking DML during backup is not desirable, especially when the only available backup source is the primary instance.

Another reason is that mysqhsl is much faster than mysqldump / mysqlpump.

Why don't we do continuous backup?

Continuous backup is a technique to save executed transactions in real time. For MySQL, this can be done with mysqlbinlog --stop-never. This command continuously retrieves transactions from binary logs and outputs them to stdout.

MOCO does not adopt this technique for the following reasons:

  • We assume MOCO clusters have replica instances in most cases.

    When the data of the primary instance is lost, one of replicas can be promoted as a new primary.

  • It is troublesome to control the continuous backup process on Kubernetes.

    The process needs to be kept running between full backups. If we do so, the entire backup process should be a persistent workload, not a (Cron)Job.

Upgrading mysqld

This document describes how mysqld upgrades its data and what MOCO has to do about it.

Preconditions

MySQL data

Beginning with 8.0.16, mysqld can update all data that need to be updated when it starts running. This means that MOCO needs nothing to do with MySQL data.

One thing that we should care about is that the update process may take a long time. The startup probe of mysqld container should be configured to wait for mysqld to complete updating data.

ref: https://dev.mysql.com/doc/refman/8.0/en/upgrading-what-is-upgraded.html

Downgrading

MySQL 8.0 does not support any kind of downgrading.

ref: https://dev.mysql.com/doc/refman/8.0/en/downgrading.html

Internally, MySQL has a version called "data dictionary (DD) version". If two MySQL versions have the same DD version, they are considered to have data compatibility.

ref: https://github.com/mysql/mysql-server/blob/mysql-8.0.24/sql/dd/dd_version.h#L209

Nevertheless, DD versions do change from time to time between revisions of MySQL 8.0. Therefore, the simplest way to avoid DD version mismatch is to not downgrade MySQL.

Upgrading a replication setup

In a nutshell, replica MySQL instances should be the same or newer than the source MySQL instance.

refs:

  • https://dev.mysql.com/doc/refman/8.0/en/replication-compatibility.html
  • https://dev.mysql.com/doc/refman/8.0/en/replication-upgrade.html

StatefulSet behavior

When the Pod template of a StatefulSet is updated, Kubernetes updates the Pods. With the default update strategy RollingUpdate, the Pods are updated one by one from the largest ordinal to the smallest.

StatefulSet controller keeps the old Pod template until it completes the rolling update. If a Pod that is not being updated are deleted, StatefulSet controller restores the Pod from the old template.

This means that, if the cluster is Healthy, MySQL is assured to be updated one by one from the instance of the largest ordinal to the smallest.

refs:

  • https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates
  • https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update

Automatic switchover

MOCO switches the primary instance when the Pod of the instance is being deleted. Read clustering.md for details.

MOCO implementation

With the preconditions listed above, MOCO can upgrade mysqld in MySQLCluster safely as follows.

  1. Set .spec.updateStrategy field in StatefulSet to RollingUpdate.
  2. Choose the lowest ordinal Pod as the next primary upon a switchover.
  3. Configure the startup probe of mysqld container to wait long enough.
    • By default, MOCO configures the probe to wait up to one hour.
    • Users can adjust the duration for each MySQLCluster.

Example

Suppose that we are updating a three-instance cluster. The mysqld instances in the cluster have ordinals 0, 1, and 2, and the current primary instance is instance 1.

After MOCO updates the Pod template of the StatefulSet created for the cluster, Kubernetes start re-creating Pods starting from instance 2.

Instance 2 is a replica and therefore is safe for an update.

Next to instance 2, the instance 1 Pod is deleted. The deletion triggers an automatic switchover so that MOCO changes the primary to the instance 0 because it has the lowest ordinal. Because instance 0 is running an old mysqld, the preconditions are kept.

Finally, instance 0 is re-created in the same way. This time, MOCO switches the primary to instance 1. Since both instance 1 and 2 has been updated and instance 0 is being deleted, the preconditions are kept.

Limitations

If an instance is down during an upgrade, MOCO may choose an already updated instance as the new primary even though some instances are still running an old version.

If this happens, users may need to manually delete the old replica data and re-initialize the replica to restore the cluster health.

User's responsibility

Security considerations

gRPC API

moco-agent, a sidecar container in mysqld Pod, provides gRPC API to execute CLONE INSTANCE and required operations after CLONE. More importantly, the request contains credentials to access the source database.

To protect the credentials and prevent abuse of API, MOCO configures mTLS between moco-agent and moco-controller as follows:

  1. Create an Issuer resource in moco-system namespace as the Certificate Authority.
  2. Create a Certificate resource to issue the certificate for moco-controller.
  3. moco-controller issues certificates for each MySQLCluster by creating Certificate resources.
  4. moco-controller copies Secret resources created by cert-manager to the namespaces of MySQLCluster.
  5. Both moco-controller and moco-agent verifies the certificate with the CA certificate.
    • The CA certificate is embedded in the Secret resources.
  6. moco-agent additionally verifies the certificate from moco-controller if it's Common Name is moco-controller.

MySQL passwords

MOCO generates its user passwords randomly with the OS random device. The passwords then stored as Secret resources.

As to communication between moco-controller and mysqld, it is not (yet) over TLS. That said, the password is encrypted anyway thanks to caching_sha2_password authentication.