MOCO documentation

This is the documentation site for MOCO. MOCO is a Kubernetes operator for MySQL created and maintained by Cybozu.

Getting started

Setup

Quick setup

You can choose between two installation methods.

MOCO depends on cert-manager. If cert-manager is not installed on your cluster, install it as follows:

$ curl -fsLO https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml
$ kubectl apply -f cert-manager.yaml

Install using raw manifests:

$ curl -fsLO https://github.com/cybozu-go/moco/releases/latest/download/moco.yaml
$ kubectl apply -f moco.yaml

Install using Helm chart:

$ helm repo add moco https://cybozu-go.github.io/moco/
$ helm repo update
$ helm install --create-namespace --namespace moco-system moco moco/moco

Customize manifests

If you want to edit the manifest, config/ directory contains the source YAML for kustomize.

Next step

Read usage.md and create your first MySQL cluster!

MOCO Helm Chart

How to use MOCO Helm repository

You need to add this repository to your Helm repositories:

$ helm repo add moco https://cybozu-go.github.io/moco/
$ helm repo update

Quick start

Installing cert-manager

$ curl -fsL https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml | kubectl apply -f -

Installing the Chart

NOTE:

This installation method requires cert-manager to be installed beforehand. To install the chart with the release name moco using a dedicated namespace(recommended):

$ helm install --create-namespace --namespace moco-system moco moco/moco

Specify parameters using --set key=value[,key=value] argument to helm install.

Alternatively a YAML file that specifies the values for the parameters can be provided like this:

$ helm install --create-namespace --namespace moco-system moco -f values.yaml moco/moco

Values

Key	Type	Default	Description
replicaCount	number	`2`	Number of controller replicas.
image.repository	string	`"ghcr.io/cybozu-go/moco"`	MOCO image repository to use.
image.pullPolicy	string	`IfNotPresent`	MOCO image pulling policy.
image.tag	string	`{{ .Chart.AppVersion }}`	MOCO image tag to use.
imagePullSecrets	list	`[]`	Secrets for pulling MOCO image from private repository.
resources	object	`{"requests":{"cpu":"100m","memory":"20Mi"}}`	resources used by moco-controller.
crds.enabled	bool	`true`	Install and update CRDs as part of the Helm chart.
extraArgs	list	`[]`	Additional command line flags to pass to moco-controller binary.
nodeSelector	object	`{}`	nodeSelector used by moco-controller.
affinity	object	`{}`	affinity used by moco-controller.
tolerations	list	`[]`	tolerations used by moco-controller.
topologySpreadConstraints	list	`[]`	topologySpreadConstraints used by moco-controller.
priorityClassName	string	`""`	PriorityClass used by moco-controller.
monitoring.enabled	bool	`false`	Enable monitoring configuration. Requires Prometheus (CRDs) to be installed.
monitoring.podMonitors.enabled	bool	`true`	Create Prometheus pod monitors.
monitoring.podMonitors.interval	string	`""`	Custom Prometheus scrape interval.
monitoring.podMonitors.scrapeTimeout	string	`""`	Custom Prometheus scrape timeout.

Generate Manifests

You can use the helm template command to render manifests.

$ helm template --namespace moco-system moco moco/moco

CRD considerations

Installing or updating CRDs

MOCO Helm Chart installs or updates CRDs by default. If you want to manage CRDs on your own, turn off the crds.enabled parameter.

Removing CRDs

Helm does not remove the CRDs due to the helm.sh/resource-policy: keep annotation. When uninstalling, please remove the CRDs manually.

Migrate to v0.11.0 or higher

Chart version v0.11.0 introduces the crds.enabled parameter.

When updating to a new chart from chart v0.10.x or lower, you MUST leave this parameter true (the default value). If you turn off this option when updating, the CRD will be removed, causing data loss.

Migrate to v0.3.0

Chart version v0.3.0 has breaking changes. The .metadata.name of the resource generated by Chart is changed.

e.g.

{{ template "moco.fullname" . }}-foo-resources -> moco-foo-resources

Related Issue: cybozu-go/moco#426

If you are using a release name other than moco, you need to migrate.

The migration steps involve deleting and recreating each MOCO resource once, except CRDs. Since the CRDs are not deleted, the pods running existing MySQL clusters are not deleted, so there is no downtime. However, the migration process should be completed in a short time since the moco-controller will be temporarily deleted and no control over the cluster will be available.

migration steps

Show the installed chart

$ helm list -n <YOUR NAMESPACE>
NAME    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
moco    moco-system     1               2022-08-17 11:28:23.418752 +0900 JST    deployed        moco-0.2.3      0.12.1

Render the manifests

$ helm template --namespace moco-system --version <YOUR CHART VERSION> <YOUR INSTALL NAME> moco/moco > render.yaml

Setup kustomize

$ cat > kustomization.yaml <<'EOF'
resources:
  - render.yaml
patches:
  - crd-patch.yaml
EOF

$ cat > crd-patch.yaml <<'EOF'
$patch: delete
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: backuppolicies.moco.cybozu.com
---
$patch: delete
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: mysqlclusters.moco.cybozu.com
EOF

Delete resources

$ kustomize build ./ | kubectl delete -f -
serviceaccount "moco-controller-manager" deleted
role.rbac.authorization.k8s.io "moco-leader-election-role" deleted
clusterrole.rbac.authorization.k8s.io "moco-backuppolicy-editor-role" deleted
clusterrole.rbac.authorization.k8s.io "moco-backuppolicy-viewer-role" deleted
clusterrole.rbac.authorization.k8s.io "moco-manager-role" deleted
clusterrole.rbac.authorization.k8s.io "moco-mysqlcluster-editor-role" deleted
clusterrole.rbac.authorization.k8s.io "moco-mysqlcluster-viewer-role" deleted
rolebinding.rbac.authorization.k8s.io "moco-leader-election-rolebinding" deleted
clusterrolebinding.rbac.authorization.k8s.io "moco-manager-rolebinding" deleted
service "moco-webhook-service" deleted
deployment.apps "moco-controller" deleted
certificate.cert-manager.io "moco-controller-grpc" deleted
certificate.cert-manager.io "moco-grpc-ca" deleted
certificate.cert-manager.io "moco-serving-cert" deleted
issuer.cert-manager.io "moco-grpc-issuer" deleted
issuer.cert-manager.io "moco-selfsigned-issuer" deleted
mutatingwebhookconfiguration.admissionregistration.k8s.io "moco-mutating-webhook-configuration" deleted
validatingwebhookconfiguration.admissionregistration.k8s.io "moco-validating-webhook-configuration" deleted

Delete Secret

$ kubectl delete secret sh.helm.release.v1.<YOUR INSTALL NAME>.v1 -n <YOUR NAMESPACE>

Re-install the v0.3.0 chart

$ helm install --create-namespace --namespace moco-system --version 0.3.0 moco moco/moco

Release Chart

See RELEASE.md.

Installing kubectl-moco

kubectl-moco is a plugin for kubectl to control MySQL clusters of MOCO.

Pre-built binaries are available on GitHub releases for Windows, Linux, and MacOS.

Installing using Krew

Krew is the plugin manager for kubectl command-line tool.

See the documentation for how to install Krew.

$ kubectl krew update
$ kubectl krew install moco

Installing manually

Set OS to the operating system name

OS is one of linux, windows, or darwin (MacOS).

If Go is available, OS can be set automatically as follows:
```
$ OS=$(go env GOOS)
```
Set ARCH to the architecture name

ARCH is one of amd64 or arm64.

If Go is available, ARCH can be set automatically as follows:
```
$ ARCH=$(go env GOARCH)
```
Set VERSION to the MOCO version

See the MOCO release page: https://github.com/cybozu-go/moco/releases
```
$ VERSION=< The version you want to install >
```

Download the binary and put it in a directory of your PATH.

The following is an example to install the plugin in /usr/local/bin.

$ curl -L -sS https://github.com/cybozu-go/moco/releases/download/${VERSION}/kubectl-moco_${VERSION}_${OS}_${ARCH}.tar.gz \
  | tar xz -C /usr/local/bin kubectl-moco

Check the installation by running kubectl moco -h.

$ kubectl moco -h
the utility command for MOCO.

Usage:
  kubectl-moco [command]

Available Commands:
  credential  Fetch the credential of a specified user
  help        Help about any command
  mysql       Run mysql command in a specified MySQL instance
  switchover  Switch the primary instance

...

How to use MOCO

After setting up MOCO, you can create MySQL clusters with a custom resource called MySQLCluster.

Basics

MOCO creates a cluster of mysqld instances for each MySQLCluster. A cluster can consists of 1, 3, or 5 mysqld instances.

MOCO configures semi-synchronous GTID-based replication between mysqld instances in a cluster if the cluster size is 3 or 5. A 3-instance cluster can tolerate up to 1 replica failure, and a 5-instance cluster can tolerate up to 2 replica failures.

In a cluster, there is only one instance called primary. The primary instance is the source of truth. It is the only writable instance in the cluster, and the source of the replication. All other instances are called replica. A replica is a read-only instance and replicates data from the primary.

Limitations

Errant replicas

An inherent limitation of GTID-based semi-synchronous replication is that a failed instance would have errant transactions. If this happens, the instance needs to be re-created by removing all data.

MOCO does not re-create such an instance. It only detects instances having errant transactions and excludes them from the cluster. Users need to monitor them and re-create the instances.

Read-only primary

MOCO from time to time sets the primary mysqld instance read-only for a switchover or other reasons. Applications that use MOCO MySQL need to be aware of this.

Creating clusters

Creating an empty cluster

An empty cluster always has a writable instance called the primary. All other instances are called replicas. Replicas are read-only and replicate data from the primary.

The following YAML is to create a three-instance cluster. It has an anti-affinity for Pods so that all instances will be scheduled to different Nodes. It also sets the limits for memory and CPU to make the Pod Guaranteed.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
  # replicas is the number of mysqld Pods.  The default is 1.
  replicas: 3
  podTemplate:
    spec:
      # Make the data directory writable. If moco-init fails with "Permission denied", uncomment the following settings.
      # securityContext:
      #   fsGroup: 10000
      #   fsGroupChangePolicy: "OnRootMismatch"  # available since k8s 1.20
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - mysql
              - key: app.kubernetes.io/instance
                operator: In
                values:
                - test
            topologyKey: "kubernetes.io/hostname"
      containers:
      # At least a container named "mysqld" must be defined.
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.4.6
        # By limiting CPU and memory, Pods will have Guaranteed QoS class.
        # requests can be omitted; it will be set to the same value as limits.
        resources:
          limits:
            cpu: "10"
            memory: "10Gi"
  volumeClaimTemplates:
  # At least a PVC named "mysql-data" must be defined.
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

By default, MOCO uses preferredDuringSchedulingIgnoredDuringExecution to prevent Pods from being placed on the same Node.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: moco-<MYSQLCLSTER_NAME>
  namespace: default
...
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - mysql
                - key: app.kubernetes.io/created-by
                  operator: In
                  values:
                  - moco
                - key: app.kubernetes.io/instance
                  operator: In
                  values:
                  - <MYSQLCLSTER_NAME>
              topologyKey: kubernetes.io/hostname
            weight: 100
...

There are other example manifests in examples directory.

The complete reference of MySQLCluster is crd_mysqlcluster_v1beta2.md.

Creating a cluster that replicates data from an external mysqld

Let's call the source mysqld instance donor.

First, make sure partial_revokes is enabled on the donor; Replicating data from the donor with partial_revokes disabled will result in replication inconsistencies or errors since MOCO uses partial_revokes functionality.

We use the clone plugin to copy the whole data quickly. After the cloning, MOCO needs to create some user accounts and install plugins.

On the donor, you need to install the plugin and create two user accounts as follows:

mysql> INSTALL PLUGIN clone SONAME 'mysql_clone.so';
mysql> CREATE USER 'clone-donor'@'%' IDENTIFIED BY 'xxxxxxxxxxx';
mysql> GRANT BACKUP_ADMIN, REPLICATION SLAVE ON *.* TO 'clone-donor'@'%';
mysql> CREATE USER 'clone-init'@'localhost' IDENTIFIED BY 'yyyyyyyyyyy';
mysql> GRANT ALL ON *.* TO 'clone-init'@'localhost' WITH GRANT OPTION;
mysql> GRANT PROXY ON ''@'' TO 'clone-init'@'localhost' WITH GRANT OPTION;

You may change the user names and should change their passwords.

Then create a Secret in the same namespace as MySQLCluster:

$ kubectl -n <namespace> create secret generic donor-secret \
    --from-literal=HOST=<donor-host> \
    --from-literal=PORT=<donor-port> \
    --from-literal=USER=clone-donor \
    --from-literal=PASSWORD=xxxxxxxxxxx \
    --from-literal=INIT_USER=clone-init \
    --from-literal=INIT_PASSWORD=yyyyyyyyyyy

You may change the secret name.

Finally, create MySQLCluster with spec.replicationSourceSecretName set to the Secret name as follows. The mysql image must be the same version as the donor's.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  replicationSourceSecretName: donor-secret
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.4.6  # must be the same version as the donor
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

To stop the replication from the donor, update MySQLCluster with spec.replicationSourceSecretName: null.

Bring your own image

We provide pre-built MySQL container images at ghcr.io/cybozu-go/moco/mysql. If you want to build and use your own image, read custom-mysqld.md.

Configurations

The default and constant configuration values for mysqld are available on pkg.go.dev. The settings in ConstMycnf cannot be changed while the settings in DefaultMycnf can be overridden.

You can change the default values or set undefined values by creating a ConfigMap in the same namespace as MySQLCluster, and setting spec.mysqlConfigMapName in MySQLCluster to the name of the ConfigMap as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: foo
  name: mycnf
data:
  long_query_time: "5"
  innodb_buffer_pool_size: "10G"
---
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  # set this to the name of ConfigMap
  mysqlConfigMapName: mycnf
  ...

InnoDB buffer pool size

If innodb_buffer_pool_size is not specified, MOCO sets it automatically to 70% of the value of resources.requests.memory (or resources.limits.memory) for mysqld container.

If both resources.request.memory and resources.limits.memory are not set, innodb_buffer_pool_size will be set to 128M.

Opaque configuration

Some configuration variables cannot be fully configured with ConfigMap values. For example, --performance-schema-instrument needs to be specified multiple times.

You may set them through a special config key _include. The value of _include will be included in my.cnf as opaque.

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: foo
  name: mycnf
data:
  _include: |
    performance-schema-instrument='memory/%=ON'
    performance-schema-instrument='wait/synch/%/innodb/%=ON'
    performance-schema-instrument='wait/lock/table/sql/handler=OFF'
    performance-schema-instrument='wait/lock/metadata/sql/mdl=OFF'

Care must be taken not to overwrite critical configurations such as log_bin since MOCO does not check the contents from _include.

Using the cluster

`kubectl moco`

From outside of your Kubernetes cluster, you can access MOCO MySQL instances using kubectl-moco. kubectl-moco is a plugin for kubectl. Pre-built binaries are available on GitHub releases.

The following is an example to run mysql command interactively to access the primary instance of test MySQLCluster in foo namespace.

$ kubectl moco -n foo mysql -it test

Read the reference manual of kubectl-moco for further details and examples.

MySQL users

MOCO prepares a set of users.

moco-readonly can read all tables of all databases.
moco-writable can create users, databases, or tables.
moco-admin is the super user.

The exact privileges that moco-readonly has are:

PROCESS
REPLICATION CLIENT
REPLICATION SLAVE
SELECT
SHOW DATABASES
SHOW VIEW

The exact privileges that moco-writable has are:

ALTER
ALTER ROUTINE
CREATE
CREATE ROLE
CREATE ROUTINE
CREATE TEMPORARY TABLES
CREATE USER
CREATE VIEW
DELETE
DROP
DROP ROLE
EVENT
EXECUTE
INDEX
INSERT
LOCK TABLES
PROCESS
REFERENCES
REPLICATION CLIENT
REPLICATION SLAVE
SELECT
SHOW DATABASES
SHOW VIEW
TRIGGER
UPDATE

moco-writable cannot edit tables in mysql database, though.

You can create other users and grant them certain privileges as either moco-writable or moco-admin.

$ kubectl moco mysql -u moco-writable test -- -e "CREATE USER 'foo'@'%' IDENTIFIED BY 'bar'"
$ kubectl moco mysql -u moco-writable test -- -e "CREATE DATABASE db1"
$ kubectl moco mysql -u moco-writable test -- -e "GRANT ALL ON db1.* TO 'foo'@'%'"

Connecting to `mysqld` over network

MOCO prepares two Services for each MySQLCluster. For example, a MySQLCluster named test in foo Namespace has the following Services.

Service Name	DNS Name	Description
`moco-test-primary`	`moco-test-primary.foo.svc`	Connect to the primary instance.
`moco-test-replica`	`moco-test-replica.foo.svc`	Connect to replica instances.

moco-test-replica can be used only for read access.

The type of these Services is usually ClusterIP. The following is an example to change Service type to LoadBalancer and add an annotation for MetalLB.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  primaryServiceTemplate:
    metadata:
      annotations:
        metallb.universe.tf/address-pool: production-public-ips
    spec:
      type: LoadBalancer
...

Backup and restore

MOCO can take full and incremental backups regularly. The backup data are stored in Amazon S3 compatible object storages.

You can restore data from a backup to a new MySQL cluster.

Object storage bucket

Bucket is a management unit of objects in S3. MOCO stores backups in a specified bucket.

MOCO does not remove backups. To remove old backups automatically, you can set a lifecycle configuration to the bucket.

ref: Setting lifecycle configuration on a bucket

A bucket can be shared safely across multiple MySQLClusters. Object keys are prefixed with moco/.

BackupPolicy

BackupPolicy is a custom resource to define a policy for taking backups.

The following is an example BackupPolicy to take a backup every day and store data in MinIO:

apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
metadata:
  namespace: backup
  name: daily
spec:
  # Backup schedule.  Any CRON format is allowed.
  schedule: "@daily"

  jobConfig:
    # An existing ServiceAccount name is required.
    serviceAccountName: backup-owner
    env:
    - name: AWS_ACCESS_KEY_ID
      value: minioadmin
    - name: AWS_SECRET_ACCESS_KEY
      value: minioadmin
    # bucketName is required. Other fields are optional.
    # If backendType is s3 (default), specify the region of the bucket via region filed or AWS_REGION environment variable.
    bucketConfig:
      bucketName: moco
      region: us-east-1
      endpointURL: http://minio.default.svc:9000
      usePathStyle: true
    # MOCO uses a filesystem volume to store data temporarily.
    workVolume:
      # Using emptyDir as a working directory is NOT recommended.
      # The recommended way is to use generic ephemeral volume with a provisioner
      # that can provide enough capacity.
      # https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes
      emptyDir: {}

To enable backup for a MySQLCluster, reference the BackupPolicy name like this:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: foo
spec:
  backupPolicyName: daily  # The policy name
...

Note: If you want to specify the ObjectBucket name in a ConfigMap or Secret, you can use envFrom and specify the environment variable name in jobConfig.bucketConfig.bucketName as follows. This behavior is tested.

apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
metadata:
  namespace: backup
  name: daily
spec:
  jobConfig:
    bucketConfig:
      bucketName: "$(BUCKET_NAME)"
      region: us-east-1
      endpointURL: http://minio.default.svc:9000
      usePathStyle: true
    envFrom:
    - configMapRef:
        name: bucket-name
...
---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: backup
  name: bucket-name
data:
  BUCKET_NAME: moco

MOCO creates a CronJob for each MySQLCluster that has spec.backupPolicyName.

The CronJob's name is moco-backup- + the name of MySQLCluster. For the above example, a CronJob named moco-backup-foo is created in default namespace.

The following podAntiAffinity is set by default for CronJob. If you want to override it, set BackupPolicy.spec.jobConfig.affinity.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: moco-backup-foo
spec:
...
  jobTemplate:
    spec:
      template:
        spec:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
                - podAffinityTerm:
                    labelSelector:
                      matchExpressions:
                        - key: app.kubernetes.io/name
                          operator: In
                          values:
                            - mysql-backup
                        - key: app.kubernetes.io/created-by
                          operator: In
                          values:
                            - moco
                    topologyKey: kubernetes.io/hostname
                  weight: 100
...

Credentials to access S3 bucket

Depending on your Kubernetes service provider and object storage, there are various ways to give credentials to access the object storage bucket.

For Amazon's Elastic Kubernetes Service (EKS) and S3 users, the easiest way is probably to use IAM Roles for Service Accounts (IRSA).

ref: IAM ROLES FOR SERVICE ACCOUNTS

Another popular way is to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables as shown in the above example.

Taking an emergency backup

You can take an emergency backup by creating a Job from the CronJob for backup.

$ kubectl create job --from=cronjob/moco-backup-foo emergency-backup

Restore

To restore data from a backup, create a new MyQLCluster with spec.restore field as follows:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: backup
  name: target
spec:
  # restore field is not editable.
  # to modify parameters, delete and re-create MySQLCluster.
  restore:
    # The source MySQLCluster's name and namespace
    sourceName: source
    sourceNamespace: backup

    # The restore point-in-time in RFC3339 format.
    restorePoint: "2021-05-26T12:34:56Z"

    # jobConfig is the same in BackupPolicy
    jobConfig:
      serviceAccountName: backup-owner
      env:
      - name: AWS_ACCESS_KEY_ID
        value: minioadmin
      - name: AWS_SECRET_ACCESS_KEY
        value: minioadmin
      bucketConfig:
        bucketName: moco
        region: us-east-1
        endpointURL: http://minio.default.svc:9000
        usePathStyle: true
      workVolume:
        emptyDir: {}
...

Further details

Read backup.md for further details.

Deleting the cluster

By deleting MySQLCluster, all resources including PersistentVolumeClaims generated from the templates are automatically removed.

If you want to keep the PersistentVolumeClaims, remove metadata.ownerReferences from them before you delete a MySQLCluster.

Status, metrics, and logs

Cluster status

You can see the health and availability status of MySQLCluster as follows:

$ kubectl get mysqlcluster
NAME   AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS
test   True        True      0         3

The cluster is available when the primary Pod is running and ready.
The cluster is healthy when there is no problems.
PRIMARY is the index of the current primary instance Pod.
SYNCED REPLICAS is the number of ready Pods.
ERRANT REPLICAS is the number of instances having errant transactions.

You can also use kubectl describe mysqlcluster to see the recent events on the cluster.

Pod status

MOCO adds mysqld containers a liveness probe and a readiness probe to check the replication status in addition to the process status.

A replica Pod is ready only when it is replicating data from the primary without a significant delay. The default threshold of the delay is 60 seconds. The threshold can be configured as follows.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  maxDelaySeconds: 180
  ...

Unready replica Pods are automatically excluded from the load-balancing targets so that users will not see too old data.

Metrics

MOCO provides a built-in support to collect and expose mysqld metrics using mysqld_exporter.

This is an example YAML to enable mysqld_exporter. spec.collectors is a list of mysqld_exporter flag names without collect. prefix.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  collectors:
  - engine_innodb_status
  - info_schema.innodb_metrics
  podTemplate:
    ...

See metrics.md for all available metrics and how to collect them using Prometheus.

Logs

Error logs from mysqld can be viewed as follows:

$ kubectl logs moco-test-0 mysqld

Slow logs from mysqld can be viewed as follows:

$ kubectl logs moco-test-0 slow-log

Maintenance

Increasing the number of instances in the cluster

Edit spec.replicas field of MySQLCluster:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  replicas: 5
  ...

You can only increase the number of instances in a MySQLCluster from 1 to 3 or 5, or from 3 to 5. Decreasing the number of instances is not allowed.

Switchover

Switchover is an operation to change the live primary to one of the replicas.

MOCO automatically switch the primary when the Pod of the primary instance is to be deleted.

Users can manually trigger a switchover with kubectl moco switchover CLUSTER_NAME. Read kubectl-moco.md for details.

Failover

Failover is an operation to replace the dead primary with the most advanced replica. MOCO automatically does this as soon as it detects that the primary is down.

The most advanced replica is a replica who has retrieved the most up-to-date transaction from the dead primary. Since MOCO configures loss-less semi-synchronous replication, the failover is guaranteed not to lose any user data.

After a failover, the old primary may become an errant replica as described.

Upgrading mysql version

You can upgrade the MySQL version of a MySQL cluster as follows:

Check that the cluster is healthy.
Check release notes of MySQL for any incompatibilities between the current and the new versions.
Edit the Pod template of the MySQLCluster and update mysqld container image:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
      containers:
      - name: mysqld
        # Edit the next line
        image: ghcr.io/cybozu-go/moco/mysql:8.4.6

You are advised to make backups and/or create a replica cluster before starting the upgrade process. Read upgrading.md for further details.

Re-initializing an errant replica

Delete the PVC and Pod of the errant replica, like this:

$ kubectl delete --wait=false pvc mysql-data-moco-test-0
$ kubectl delete --grace-period=1 pods moco-test-0

Depending on your Kubernetes version, StatefulSet controller may create a pending Pod before PVC gets deleted. Delete such pending Pods until PVC is actually removed.

Stop Clustering and Reconciliation

In MOCO, you can optionally stop the clustering and reconciliation of a MySQLCluster.

To stop clustering and reconciliation, use the following commands.

$ kubectl moco stop clustering <CLSUTER_NAME>
$ kubectl moco stop reconciliation <CLSUTER_NAME>

To resume the stopped clustering and reconciliation, use the following commands.

$ kubectl moco start clustering <CLSUTER_NAME>
$ kubectl moco start reconciliation <CLSUTER_NAME>

You could use this feature in the following cases:

To stop the replication of a MySQLCluster and perform a manual operation to align the GTID
- Run the kubectl moco stop clustering command on the MySQLCluster where you want to stop the replication
To suppress the full update of MySQLCluster that occurs during the upgrade of MOCO
- Run the kubectl moco stop reconciliation command on the MySQLCluster on which you want to suppress the update

To check whether clustering and reconciliation are stopped, use kubectl get mysqlcluster. Moreover, while clustering is stopped, AVAILABLE and HEALTHY values will be Unknown.

$ kubectl get mysqlcluster
NAME   AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS   CLUSTERING ACTIVE   RECONCILE ACTIVE   LAST BACKUP
test   Unknown     Unknown   0         3                                   False               False              <no value>

The MOCO controller outputs the following metrics to indicate that clustering has been stopped. 1 if the cluster is clustering or reconciliation stopped, 0 otherwise.

moco_cluster_clustering_stopped{name="mycluster", namespace="mynamesapce"} 1
moco_cluster_reconciliation_stopped{name="mycluster", namespace="mynamesapce"} 1

During the stop of clustering, monitoring of the cluster from MOCO will be halted, and the value of the following metrics will become NaN.

moco_cluster_available{name="test",namespace="default"} NaN
moco_cluster_healthy{name="test",namespace="default"} NaN
moco_cluster_ready_replicas{name="test",namespace="default"} NaN
moco_cluster_errant_replicas{name="test",namespace="default"} NaN

Set to Read Only

When you want to set MOCO's MySQL to read-only, use the the following commands.

MOCO makes the primary instance writable in the clustering process. Therefore, please be sure to stop clustering when you set it to read-only.

$ kubectl moco stop clustering <CLSUTER_NAME>
$ kubectl moco mysql -u moco-admin <CLSUTER_NAME> -- -e "SET GLOBAL super_read_only=1"

You can check whether the cluster is read-only with the following command.

$ kubectl moco mysql -it <CLSUTER_NAME> -- -e "SELECT @@super_read_only"
+-------------------+
| @@super_read_only |
+-------------------+
|                 1 |
+-------------------+

If you want to leave read-only mode, restart clustering as follows. Then, MOCO will make the cluster writable.

$ kubectl moco start clustering <CLSUTER_NAME>

Advanced topics

Building custom image of `mysqld`

There are pre-built mysqld container images for MOCO on ghcr.io/cybozu-go/moco/mysql. Users can use one of these images to supply mysqld container in MySQLCluster like:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
spec:
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.4.6

If you want to build and use your own mysqld, read the rest of this document.

Dockerfile

The easiest way to build a custom mysqld for MOCO is to copy and edit our Dockerfile. You can find it under containers/mysql directory in github.com/cybozu-go/moco.

You should keep the following points:

ENTRYPOINT should be ["mysqld"]
USER should be 10000:10000
sleep command must exist in one of the PATH directories.

How to build `mysqld`

On Ubuntu 24.04, you can build the source code as follows:

$ sudo apt-get update
$ sudo apt-get -y --no-install-recommends install build-essential libssl-dev \
    cmake libncurses5-dev libgoogle-perftools-dev libnuma-dev libaio-dev pkg-config libtirpc-dev
$ curl -fsSL -O https://dev.mysql.com/get/Downloads/MySQL-8.4/mysql-8.4.6.tar.gz
$ tar -x -z -f mysql-8.4.6.tar.gz
$ cd mysql-8.4.6
$ mkdir bld
$ cd bld
$ cmake .. -DBUILD_CONFIG=mysql_release -DCMAKE_BUILD_TYPE=Release \
    -DWITH_NUMA=1 -DWITH_TCMALLOC=1
$ make -j $(nproc)
$ make install

Customize default container

MOCO has containers that are automatically added by the system in addition to containers added by the user. (e.g. agent, moco-init etc...)

The MySQLCluster.spec.podTemplate.overwriteContainers field can be used to overwrite such containers. Currently, only container resources and securityContexts can be overwritten. overwriteContainers is only available in MySQLCluster v1beta2.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.4.6
    overwriteContainers:
    - name: agent
      resources:
        requests:
          cpu: 50m
    - name: moco-init
      securityContext:
        capabilities:
          add: ["SYS_NICE"]

System containers

The following is a list of system containers and default resource settings. Specifying container names in overwriteContainers that are not listed here will result in an error in API validation.

Name	Default CPU Requests/Limits	Default Memory Requests/Limits	Description
agent	`100m` / `100m`	`100Mi` / `100Mi`	MOCO's agent container running in sidecar. refs: https://github.com/cybozu-go/moco-agent
moco-init	`100m` / `100m`	`512Mi` / `512Mi`	Initializes MySQL data directory and create a configuration snippet to give instance specific configuration values such as server_id and admin_address.
slow-log	`100m` / `100m`	`20Mi` / `20Mi`	Sidecar container for outputting slow query logs.
mysqld-exporter	`200m` / `200m`	`100Mi` / `100Mi`	MySQL server exporter sidecar container.

All system containers are configured runAsUser=1000 and runAsGroup=1000 in securityContext. If you specify securityContext in overwriteContainers, this configuration of specified container will be overwritten.

Change the volumeClaimTemplates

MOCO supports MySQLCluster .spec.volumeClaimTemplates changes.

When .spec.volumeClaimTemplates is changed, moco-controller will try to recreate the StatefulSet. This is because modification of volumeClaimTemplates in StatefulSet is currently not allowed.

Re-creation StatefulSet is done with the same behavior as kubectl delete sts moco-xxx --cascade=orphan, without removing the working Pod.

NOTE: It may be possible to edit the StatefulSet directly in the future.

ref: https://github.com/kubernetes/enhancements/issues/661

When re-creating a StatefulSet, moco-controller supports no operation except for volume expansion as described below. It simply re-creates the StatefulSet. However, by specifying the --pvc-sync-annotation-keys and --pvc-sync-label-keys flags in the controller, you can designate the annotations and labels to be synchronized from .spec.volumeClaimTemplates to PVC during the recreation of the StatefulSet.

For all other labels and annotations, given the potential side effects, such updates must be performed by the user themselves. This guideline is essential to prevent potential side-effects if entities other than the moco-controller are manipulating the PVC's metadata.

Metrics

The success or failure of the re-creating a StatefulSet is notified to the user in the following metrics:

moco_cluster_statefulset_recreate_total{name="mycluster", namespace="mynamesapce"} 3
moco_cluster_statefulset_recreate_errors_total{name="mycluster", namespace="mynamesapce"} 1

If a StatefulSet fails to recreate, the metrics in moco_cluster_statefulset_recreate_errors_total is incremented after each reconcile, so users can notice anomalies by monitoring this metrics.

See the metrics documentation for more details.

Volume expansion

moco-controller automatically resizes the PVC when the size of the MySQLCluster volume claim is extended. If the volume plugin supports online file system expansion, the PVs used by the Pod will be expanded online.

If volume is to be expanded, .allowVolumeExpansion of the StorageClass must be true. moco-controller will validate with the admission webhook and reject the request if volume expansion is not allowed.

If the volume plugin does not support online file system expansion, the Pod must be restarted for the volume expansion to reflect. This must be done manually by the user.

When moco-controller resizes a PVC, there may be a discrepancy between the PVC defined in the MySQLCluster and the actual PVC size. For example, if you are using github.com/topolvm/pvc-autoresizer. In this case, moco-controller will only update if the actual PVC size is smaller than the PVC size after the change.

Metrics

The success or failure of the PVC resizing is notified to the user in the following metrics:

moco_cluster_volume_resized_total{name="mycluster", namespace="mynamesapce"} 4
moco_cluster_volume_resized_errors_total{name="mycluster", namespace="mynamesapce"} 1

This metrics is incremented if the volume size change succeeds or fails. If fails to volume size changed, the metrics in moco_cluster_volume_resized_errors_total is incremented after each reconcile, so users can notice anomalies by monitoring this metrics.

See the metrics documentation for more details.

Volume reduction

MOCO supports PVC reduction, but unlike PVC expansion, the user must perform the operation manually.

The steps are as follows:

The user modifies the .spec.volumeClaimTemplates of the MySQLCluster and sets a smaller volume size.
MOCO updates the .spec.volumeClaimTemplates of the StatefulSet. This does not propagate to existing Pods, PVCs, or PVs.
The user manually deletes the MySQL Pod & PVC.
Wait for the Pod & PVC to be recreated by the statefulset-controller, and for MOCO to clone the data.
Once the cluster becomes Healthy, the user deletes the next Pod and PVC.
It is completed when all Pods and PVCs are recreated.

1. The user modifies the `.spec.volumeClaimTemplates` of the MySQLCluster and sets a smaller volume size

For example, the user modifies the .spec.volumeClaimTemplates of the MySQLCluster as follows:

  apiVersion: moco.cybozu.com/v1beta2
  kind: MySQLCluster
  metadata:
    namespace: default
    name: test
  spec:
    replicas: 3
    podTemplate:
      spec:
        containers:
        - name: mysqld
          image: ghcr.io/cybozu-go/moco/mysql:8.4.6
    volumeClaimTemplates:
    - metadata:
        name: mysql-data
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
-           storage: 1Gi
+           storage: 500Mi

2. MOCO updates the `.spec.volumeClaimTemplates` of the StatefulSet. This does not propagate to existing Pods, PVCs, or PVs

The moco-controller will update the .spec.volumeClaimTemplates of the StatefulSet. The actual modification of the StatefulSet's .spec.volumeClaimTemplates is not allowed, so this change is achieved by recreating the StatefulSet. At this time, only the recreation of StatefulSet is performed, without deleting the Pods and PVCs.

3. The user manually deletes the MySQL Pod & PVC

The user manually deletes the PVC and Pod. Use the following command to delete them:

$ kubectl delete --wait=false pvc <pvc-name>
$ kubectl delete --grace-period=1 <pod-name>

4. Wait for the Pod & PVC to be recreated by the statefulset-controller, and for MOCO to clone the data

The statefulset-controller recreates Pods and PVCs, creating a new PVC with a reduced size. Once the MOCO successfully starts a Pod, it begins cloning the data.

$ kubectl get mysqlcluster,po,pvc
NAME                                AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS   LAST BACKUP
mysqlcluster.moco.cybozu.com/test   True        False     0         2                                   <no value>

NAME              READY   STATUS     RESTARTS   AGE
pod/moco-test-0   3/3     Running    0          2m14s
pod/moco-test-1   3/3     Running    0          114s
pod/moco-test-2   0/3     Init:1/2   0          7s

NAME                                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/mysql-data-moco-test-0   Bound    pvc-03c73525-0d6d-49de-b68a-f8af4c4c7faa   1Gi        RWO            standard       2m14s
persistentvolumeclaim/mysql-data-moco-test-1   Bound    pvc-73c26baa-3432-4c85-b5b6-875ffd2456d9   1Gi        RWO            standard       114s
persistentvolumeclaim/mysql-data-moco-test-2   Bound    pvc-779b5b3c-3efc-4048-a549-a4bd2d74ed4e   500Mi      RWO            standard       7s

5. Once the cluster becomes Healthy, the user deletes the next Pod and PVC

The user waits until the MySQLCluster state becomes Healthy, and then deletes the next Pod and PVC.

$ kubectl get mysqlcluster
NAME                                AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS   LAST BACKUP
mysqlcluster.moco.cybozu.com/test   True        True      1         3                                   <no value>

6. It is completed when all Pods and PVCs are recreated

Repeat steps 3 to 5 until all Pods and PVCs are recreated.

References

Design document

RollingUpdate strategy

MOCO manages MySQLCluster pods using StatefulSets.

MySQLCluster/test
└─StatefulSet/moco-test
  ├─ControllerRevision/moco-test-554c56f456
  ├─ControllerRevision/moco-test-5794c57c7c
  ├─Pod/moco-test-0
  ├─Pod/moco-test-1
  └─Pod/moco-test-2

By default, StatefulSet's standard rolling update does not consider whether MySQLCluster is Healthy during pod updates. This can sometimes cause problems, as a rolling update may proceed even if MySQLCluster becomes UnHealthy during the process.

To address this issue, MOCO controls StatefulSet partitions to perform rolling updates. This behavior is enabled by default.

Partitions

By setting a number in .spec.updateStrategy.rollingUpdate.partition of a StatefulSet, you can divide the rolling update into partitions. When a partition is specified, pods with a pod number equal to or greater than the partition value are updated. Pods with a pod number smaller than the partition value are not updated, and even if those pods are deleted, they will be recreated with the previous version.

Architecture

When Creating a StatefulSet

When creating a StatefulSet, MOCO updates the partition of the StatefulSet to the same value as the replica using MutatingAdmissionWebhook.

When Updating a StatefulSet

When a StatefulSet is updated, MOCO determines the contents of the StatefulSet update and controls partitions using AdmissionWebhook.

If the StatefulSet update is only the partition number
- The MutatingAdmissionWebhook does nothing.
If fields other than the partition of the StatefulSet are updated
- The MutatingAdmissionWebhook updates the partition of the StatefulSet to the same value as the replica using MutatingAdmissionWebhook.
```
replicas: 3
...
updateStrategy:
  type: RollingUpdate
  rollingUpdate:
    partition: 3
...
```

Updating Partitions

MOCO monitors the rollout status of the StatefulSet and the status of MySQLCluster. If the update of pods based on the current partition value is completed successfully and the containers are Running, and the status of MySQLCluster is Healthy, MOCO decrements the partition of the StatefulSet by 1. This operation is repeated until the partition value reaches 0.

Forcefully Rolling Out

By setting the annotation moco.cybozu.com/force-rolling-update to true, you can update the StatefulSet without partition control.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: test
  annotations:
    moco.cybozu.com/force-rolling-update: "true"
...

When creating or updating a StatefulSet with the annotation moco.cybozu.com/force-rolling-update set, MOCO deletes the partition setting using MutatingAdmissionWebhook.

Metrics

MOCO outputs the following metrics related to rolling updates:

moco_cluster_current_replicas
- The same as .status.currentReplicas of the StatefulSet.
moco_cluster_updated_replicas
- The same as .status.updatedReplicas of the StatefulSet.
moco_cluster_last_partition_updated
- The time the partition was last updated.

By setting an alert with the condition that moco_cluster_updated_replicas is not equal to moco_cluster_replicas and a certain amount of time has passed since moco_cluster_last_partition_updated, you can detect MySQLClusters where the rolling update is stopped.

Known issues

This document lists the known issues of MOCO.

Multi-threaded replication

Multi-threaded replication

Status: Resolved

If you use MOCO with MySQL version 8.0.25 or earlier, you should not configure the replicas with replica_parallel_workers > 1. Multi-threaded replication will cause the replica to fail to resume after the crash.

Currently, MOCO does not support MySQL version 8.0.25 or earlier, so this issue does not occur.

Custom resources

Custom Resources

MySQLCluster

Sub Resources

BackupStatus

BackupStatus represents the status of the last successful backup.

Field	Description	Scheme	Required
time	The time of the backup. This is used to generate object keys of backup files in a bucket.	metav1.Time	true
elapsed	Elapsed is the time spent on the backup.	metav1.Duration	true
sourceIndex	SourceIndex is the ordinal of the backup source instance.	int	true
sourceUUID	SourceUUID is the `server_uuid` of the backup source instance.	string	true
uuidSet	UUIDSet is the `server_uuid` set of all candidate instances for the backup source.	map[string]string	true
binlogFilename	BinlogFilename is the binlog filename that the backup source instance was writing to at the backup.	string	true
gtidSet	GTIDSet is the GTID set of the full dump of database.	string	true
dumpSize	DumpSize is the size in bytes of a full dump of database stored in an object storage bucket.	int64	true
binlogSize	BinlogSize is the size in bytes of a tarball of binlog files stored in an object storage bucket.	int64	true
workDirUsage	WorkDirUsage is the max usage in bytes of the woking directory.	int64	true
warnings	Warnings are list of warnings from the last backup, if any.	[]string	true

Back to Custom Resources

MySQLCluster

MySQLCluster is the Schema for the mysqlclusters API

Field	Scheme	Required
metadata	metav1.ObjectMeta	false
spec	MySQLClusterSpec	false
status	MySQLClusterStatus	false

Back to Custom Resources

MySQLClusterList

MySQLClusterList contains a list of MySQLCluster

Field	Description	Scheme	Required
metadata		metav1.ListMeta	false
items		[]MySQLCluster	true

Back to Custom Resources

MySQLClusterSpec

MySQLClusterSpec defines the desired state of MySQLCluster

Field	Description	Scheme	Required
replicas	Replicas is the number of instances. Available values are positive odd numbers.	int32	false
podTemplate	PodTemplate is a `Pod` template for MySQL server container.	PodTemplateSpec	true
volumeClaimTemplates	VolumeClaimTemplates is a list of `PersistentVolumeClaim` templates for MySQL server container. A claim named "mysql-data" must be included in the list.	[]PersistentVolumeClaim	true
primaryServiceTemplate	PrimaryServiceTemplate is a `Service` template for primary.	*ServiceTemplate	false
replicaServiceTemplate	ReplicaServiceTemplate is a `Service` template for replica.	*ServiceTemplate	false
mysqlConfigMapName	MySQLConfigMapName is a `ConfigMap` name of MySQL config.	*string	false
replicationSourceSecretName	ReplicationSourceSecretName is a `Secret` name which contains replication source info. If this field is given, the `MySQLCluster` works as an intermediate primary.	*string	false
collectors	Collectors is the list of collector flag names of mysqld_exporter. If this field is not empty, MOCO adds mysqld_exporter as a sidecar to collect and export mysqld metrics in Prometheus format.\n\nSee https://github.com/prometheus/mysqld_exporter/blob/master/README.md#collector-flags for flag names.\n\nExample: ["engine_innodb_status", "info_schema.innodb_metrics"]	[]string	false
serverIDBase	ServerIDBase, if set, will become the base number of server-id of each MySQL instance of this cluster. For example, if this is 100, the server-ids will be 100, 101, 102, and so on. If the field is not given or zero, MOCO automatically sets a random positive integer.	int32	false
maxDelaySeconds	MaxDelaySeconds configures the readiness probe of mysqld container. For a replica mysqld instance, if it is delayed to apply transactions over this threshold, the mysqld instance will be marked as non-ready. The default is 60 seconds. Setting this field to 0 disables the delay check in the probe.	*int	false
maxDelaySecondsForPodDeletion	MaxDelaySecondsForPodDeletion configures the maximum allowed replication delay before a Pod deletion is blocked. If the replication delay exceeds this threshold, deletion of the primary pod will be prevented. The default is 0 seconds. Setting this field to 0 disables the delay check for pod deletion.	int64	false
startupWaitSeconds	StartupWaitSeconds is the maximum duration to wait for `mysqld` container to start working. The default is 3600 seconds.	int32	false
logRotationSchedule	LogRotationSchedule specifies the schedule to rotate MySQL logs. If not set, the default is to rotate logs every 5 minutes. See https://pkg.go.dev/github.com/robfig/cron/v3#hdr-CRON_Expression_Format for the field format.	string	false
logRotationSize	LogRotationSize specifies the size to rotate MySQL logs If not set, size-based log rotation is disabled by default	int	false
backupPolicyName	The name of BackupPolicy custom resource in the same namespace. If this is set, MOCO creates a CronJob to take backup of this MySQL cluster periodically.	*string	false
restore	Restore is the specification to perform Point-in-Time-Recovery from existing cluster. If this field is not null, MOCO restores the data as specified and create a new cluster with the data. This field is not editable.	*RestoreSpec	false
disableSlowQueryLogContainer	DisableSlowQueryLogContainer controls whether to add a sidecar container named "slow-log" to output slow logs as the containers output. If set to true, the sidecar container and configmap used by the sidecar container are not added. The default is false.	bool	false
slowQueryLogConfigTmpl	SlowQueryLogConfigTmpl is the template for slow query log configuration file. If this field is null, MOCO uses the default slow query log configuration. `{{ .Path }}` will be replaced with the path to the slow query log file.	*string	false
agentUseLocalhost	AgentUseLocalhost configures the mysqld interface to bind and be accessed over localhost instead of pod name. During container init moco-agent will set mysql admin interface is bound to localhost. The moco-agent will also communicate with mysqld over localhost when acting as a sidecar.	bool	false
initializeTimezoneData	InitializeTimezoneData controls whether the init container should populate the timezone data. If set to true, the init container will load timezone data into MySQL. The default is false.	bool	false
offline	Offline sets the cluster offline, releasing compute resources. Data is not removed.	bool	false

Back to Custom Resources

MySQLClusterStatus

MySQLClusterStatus defines the observed state of MySQLCluster

Field	Description	Scheme	Required
conditions	Conditions is an array of conditions.	[]metav1.Condition	false
currentPrimaryIndex	CurrentPrimaryIndex is the index of the current primary Pod in StatefulSet. Initially, this is zero.	int	true
syncedReplicas	SyncedReplicas is the number of synced instances including the primary.	int	false
errantReplicas	ErrantReplicas is the number of instances that have errant transactions.	int	false
errantReplicaList	ErrantReplicaList is the list of indices of errant replicas.	[]int	false
backup	Backup is the status of the last successful backup.	BackupStatus	true
restoredTime	RestoredTime is the time when the cluster data is restored.	*metav1.Time	false
cloned	Cloned indicates if the initial cloning from an external source has been completed.	bool	false
reconcileInfo	ReconcileInfo represents version information for reconciler.	ReconcileInfo	true

Back to Custom Resources

ObjectMeta

ObjectMeta is metadata of objects. This is partially copied from metav1.ObjectMeta.

Field	Description	Scheme	Required
name	Name is the name of the object.	string	false
labels	Labels is a map of string keys and values.	map[string]string	false
annotations	Annotations is a map of string keys and values.	map[string]string	false

Back to Custom Resources

OverwriteContainer

OverwriteContainer defines the container spec used for overwriting. For more information, please read the following documentation. https://cybozu-go.github.io/moco/customize-system-container.html

Field	Description	Scheme	Required
name	Name of the container to overwrite.	OverwriteableContainerName	true
resources	Resources is the container resource to be overwritten.	*ResourceRequirementsApplyConfiguration	false
securityContext	SecurityContext is the container SecurityContext to be overwritten.	*SecurityContextApplyConfiguration	false

Back to Custom Resources

PersistentVolumeClaim

PersistentVolumeClaim is a user's request for and claim to a persistent volume. This is slightly modified from corev1.PersistentVolumeClaim.

Field	Description	Scheme	Required
metadata	Standard object's metadata.	ObjectMeta	true
spec	Spec defines the desired characteristics of a volume requested by a pod author.	PersistentVolumeClaimSpecApplyConfiguration	true

Back to Custom Resources

PodTemplateSpec

PodTemplateSpec describes the data a pod should have when created from a template. This is slightly modified from corev1.PodTemplateSpec.

Field	Description	Scheme	Required
metadata	Standard object's metadata. The name in this metadata is ignored.	ObjectMeta	false
spec	Specification of the desired behavior of the pod. The name of the MySQL server container in this spec must be `mysqld`.	PodSpecApplyConfiguration	true
overwriteContainers	OverwriteContainers overwrites the container definitions provided by default by the system.	[]OverwriteContainer	false

Back to Custom Resources

ReconcileInfo

ReconcileInfo is the type to record the last reconciliation information.

Field	Description	Scheme	Required
generation	Generation is the `metadata.generation` value of the last reconciliation. See also https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#status-subresource	int64	false
reconcileVersion	ReconcileVersion is the version of the operator reconciler.	int	true

Back to Custom Resources

RestoreSpec

RestoreSpec represents a set of parameters for Point-in-Time Recovery.

Field	Description	Scheme	Required
sourceName	SourceName is the name of the source `MySQLCluster`.	string	true
sourceNamespace	SourceNamespace is the namespace of the source `MySQLCluster`.	string	true
restorePoint	RestorePoint is the target date and time to restore data. The format is RFC3339. e.g. "2006-01-02T15:04:05Z"	metav1.Time	true
jobConfig	Specifies parameters for restore Pod.	JobConfig	true
schema	Schema is the name of the schema to restore. If empty, all schemas are restored. This is used for `mysqlbinlog` option `--database`. Thus, this option changes behavior depending on binlog_format. For more information, please read the following documentation. https://dev.mysql.com/doc/refman/8.0/en/mysqlbinlog.html#option_mysqlbinlog_database	string	false

Back to Custom Resources

ServiceTemplate

ServiceTemplate defines the desired spec and annotations of Service

Field	Description	Scheme	Required
metadata	Standard object's metadata. Only `annotations` and `labels` are valid.	ObjectMeta	false
spec	Spec is the ServiceSpec	*ServiceSpecApplyConfiguration	false

Back to Custom Resources

BucketConfig

BucketConfig is a set of parameter to access an object storage bucket.

Field	Description	Scheme	Required
bucketName	The name of the bucket	string	true
region	The region of the bucket. This can also be set through `AWS_REGION` environment variable.	string	false
endpointURL	The API endpoint URL. Set this for non-S3 object storages.	string	false
usePathStyle	Allows you to enable the client to use path-style addressing, i.e., https?://ENDPOINT/BUCKET/KEY. By default, a virtual-host addressing is used (https?://BUCKET.ENDPOINT/KEY).	bool	false
backendType	BackendType is an identifier for the object storage to be used.	string	false
caCert	Path to SSL CA certificate file used in addition to system default.	string	false

Back to Custom Resources

JobConfig

JobConfig is a set of parameters for backup and restore job Pods.

Field	Description	Scheme	Required
serviceAccountName	ServiceAccountName specifies the ServiceAccount to run the Pod.	string	true
bucketConfig	Specifies how to access an object storage bucket.	BucketConfig	true
workVolume	WorkVolume is the volume source for the working directory. Since the backup or restore task can use a lot of bytes in the working directory, You should always give a volume with enough capacity.\n\nThe recommended volume source is a generic ephemeral volume. https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes	VolumeSourceApplyConfiguration	true
threads	Threads is the number of threads used for backup or restoration.	int	false
cpu	CPU is the amount of CPU requested for the Pod.	*resource.Quantity	false
maxCpu	MaxCPU is the amount of maximum CPU for the Pod.	*resource.Quantity	false
memory	Memory is the amount of memory requested for the Pod.	*resource.Quantity	false
maxMemory	MaxMemory is the amount of maximum memory for the Pod.	*resource.Quantity	false
envFrom	List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig	[]EnvFromSourceApplyConfiguration	false
env	List of environment variables to set in the container.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig	[]EnvVarApplyConfiguration	false
affinity	If specified, the pod's scheduling constraints.	*AffinityApplyConfiguration	false
volumes	Volumes defines the list of volumes that can be mounted by containers in the Pod.	[]VolumeApplyConfiguration	false
volumeMounts	VolumeMounts describes a list of volume mounts that are to be mounted in a container.	[]VolumeMountApplyConfiguration	false

Back to Custom Resources

Custom Resources

BackupPolicy

Sub Resources

BackupPolicy

BackupPolicy is a namespaced resource that should be referenced from MySQLCluster.

Field	Description	Scheme	Required
metadata		metav1.ObjectMeta	false
spec		BackupPolicySpec	true

Back to Custom Resources

BackupPolicyList

BackupPolicyList contains a list of BackupPolicy

Field	Description	Scheme	Required
metadata		metav1.ListMeta	false
items		[]BackupPolicy	true

Back to Custom Resources

BackupPolicySpec

BackupPolicySpec defines the configuration items for MySQLCluster backup.\n\nThe following fields will be copied to CronJob.spec:\n\n- Schedule - StartingDeadlineSeconds - ConcurrencyPolicy - SuccessfulJobsHistoryLimit - FailedJobsHistoryLimit\n\nThe following fields will be copied to CronJob.spec.jobTemplate.\n\n- ActiveDeadlineSeconds - BackoffLimit

Field	Description	Scheme	Required
schedule	The schedule in Cron format for periodic backups. See https://en.wikipedia.org/wiki/Cron	string	true
jobConfig	Specifies parameters for backup Pod.	JobConfig	true
startingDeadlineSeconds	Optional deadline in seconds for starting the job if it misses scheduled time for any reason. Missed jobs executions will be counted as failed ones.	*int64	false
concurrencyPolicy	Specifies how to treat concurrent executions of a Job. Valid values are: - "Allow" (default): allows CronJobs to run concurrently; - "Forbid": forbids concurrent runs, skipping next run if previous run hasn't finished yet; - "Replace": cancels currently running job and replaces it with a new one	batchv1.ConcurrencyPolicy	false
activeDeadlineSeconds	Specifies the duration in seconds relative to the startTime that the job may be continuously active before the system tries to terminate it; value must be positive integer. If a Job is suspended (at creation or through an update), this timer will effectively be stopped and reset when the Job is resumed again.	*int64	false
backoffLimit	Specifies the number of retries before marking this job failed. Defaults to 6	*int32	false
successfulJobsHistoryLimit	The number of successful finished jobs to retain. This is a pointer to distinguish between explicit zero and not specified. Defaults to 3.	*int32	false
failedJobsHistoryLimit	The number of failed finished jobs to retain. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1.	*int32	false

Back to Custom Resources

BucketConfig

BucketConfig is a set of parameter to access an object storage bucket.

Field	Description	Scheme	Required
bucketName	The name of the bucket	string	true
region	The region of the bucket. This can also be set through `AWS_REGION` environment variable.	string	false
endpointURL	The API endpoint URL. Set this for non-S3 object storages.	string	false
usePathStyle	Allows you to enable the client to use path-style addressing, i.e., https?://ENDPOINT/BUCKET/KEY. By default, a virtual-host addressing is used (https?://BUCKET.ENDPOINT/KEY).	bool	false
backendType	BackendType is an identifier for the object storage to be used.	string	false
caCert	Path to SSL CA certificate file used in addition to system default.	string	false

Back to Custom Resources

JobConfig

JobConfig is a set of parameters for backup and restore job Pods.

Field	Description	Scheme	Required
serviceAccountName	ServiceAccountName specifies the ServiceAccount to run the Pod.	string	true
bucketConfig	Specifies how to access an object storage bucket.	BucketConfig	true
workVolume	WorkVolume is the volume source for the working directory. Since the backup or restore task can use a lot of bytes in the working directory, You should always give a volume with enough capacity.\n\nThe recommended volume source is a generic ephemeral volume. https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes	VolumeSourceApplyConfiguration	true
threads	Threads is the number of threads used for backup or restoration.	int	false
cpu	CPU is the amount of CPU requested for the Pod.	*resource.Quantity	false
maxCpu	MaxCPU is the amount of maximum CPU for the Pod.	*resource.Quantity	false
memory	Memory is the amount of memory requested for the Pod.	*resource.Quantity	false
maxMemory	MaxMemory is the amount of maximum memory for the Pod.	*resource.Quantity	false
envFrom	List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig	[]EnvFromSourceApplyConfiguration	false
env	List of environment variables to set in the container.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig	[]EnvVarApplyConfiguration	false
affinity	If specified, the pod's scheduling constraints.	*AffinityApplyConfiguration	false
volumes	Volumes defines the list of volumes that can be mounted by containers in the Pod.	[]VolumeApplyConfiguration	false
volumeMounts	VolumeMounts describes a list of volume mounts that are to be mounted in a container.	[]VolumeMountApplyConfiguration	false

Back to Custom Resources

Commands

kubectl moco plugin

kubectl-moco is a kubectl plugin for MOCO.

kubectl moco [global options] <subcommand> [sub options] args...

Global options

Global options are compatible with kubectl. For example, the following options are available.

Global options	Default value	Description
`--kubeconfig`	`$HOME/.kube/config`	Path to the kubeconfig file to use for CLI requests.
`-n, --namespace`	`default`	If present, the namespace scope for this CLI request.

MySQL users

You can choose one of the following user for --mysql-user option value.

Name	Description
`moco-readonly`	A read-only user.
`moco-writable`	A user that can edit users, databases, and tables.
`moco-admin`	The super-user.

`kubectl moco mysql [options] CLUSTER_NAME [-- mysql args...]`

Run mysql command in a specified MySQL instance.

Options	Default value	Description
`-u, --mysql-user`	`moco-readonly`	Login as the specified user
`--index`	index of the primary	Index of the target mysql instance
`-i, --stdin`	`false`	Pass stdin to the mysql container
`-t, --tty`	`false`	Stdin is a TTY

Examples

This executes SELECT VERSION() on the primary instance in mycluster in foo namespace:

$ kubectl moco -n foo mysql mycluster -- -N -e 'SELECT VERSION()'

To execute SQL from a file:

$ cat sample.sql | kubectl moco -n foo mysql -u moco-writable -i mycluster

To run mysql interactively for the instance 2 in mycluster in the default namespace:

$ kubectl moco mysql --index 2 -it mycluster

`kubectl moco credential [options] CLUSTER_NAME`

Fetch the credential information of a specified user

Options	Default value	Description
`-u, --mysql-user`	`moco-readonly`	Fetch the credential of the specified user
`--format`	`plain`	Output format: `plain` or `mycnf`

`kubectl moco switchover CLUSTER_NAME`

Switch the primary instance to one of the replicas.

Stop or start clustering and reconciliation

Read Stop Clustering and Reconciliation.

`kubectl moco stop clustering CLUSTER_NAME`

Stop the clustering of the specified MySQLCluster.

`kubectl moco start clustering CLUSTER_NAME`

Start the clustering of the specified MySQLCluster.

`kubectl moco stop reconciliation CLUSTER_NAME`

Stop the reconciliation of the specified MySQLCluster.

`kubectl moco start reconciliation CLUSTER_NAME`

Start the reconciliation of the specified MySQLCluster.

`moco-controller`

moco-controller controls MySQL clusters on Kubernetes.

Environment variables

Name	Required	Description
`POD_NAMESPACE`	Yes	The namespace name where `moco-controller` runs.

Command line flags

Flags:
      --add_dir_header                      If true, adds the file directory to the header of the log messages
      --agent-image string                  The image of moco-agent sidecar container (default "ghcr.io/cybozu-go/moco-agent:0.15.0")
      --alsologtostderr                     log to standard error as well as files (no effect when -logtostderr=true)
      --apiserver-qps-throttle int          The maximum QPS to the API server. (default 20)
      --backup-image string                 The image of moco-backup container (default "ghcr.io/cybozu-go/moco-backup:0.23.2")
      --cert-dir string                     webhook certificate directory
      --check-interval duration             Interval of cluster maintenance (default 1m0s)
      --fluent-bit-image string             The image of fluent-bit sidecar container (default "ghcr.io/cybozu-go/moco/fluent-bit:3.0.2.1")
      --grpc-cert-dir string                gRPC certificate directory (default "/grpc-cert")
      --health-probe-addr string            Listen address for health probes (default ":8081")
  -h, --help                                help for moco-controller
      --leader-election-id string           ID for leader election by controller-runtime (default "moco")
      --log_backtrace_at traceLocation      when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                      If non-empty, write log files in this directory (no effect when -logtostderr=true)
      --log_file string                     If non-empty, use this log file (no effect when -logtostderr=true)
      --log_file_max_size uint              Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                         log to standard error instead of files (default true)
      --max-concurrent-reconciles int       The maximum number of concurrent reconciles which can be run (default 8)
      --metrics-addr string                 Listen address for metric endpoint (default ":8080")
      --mysql-configmap-history-limit int   The maximum number of MySQLConfigMap's history to be kept (default 10)
      --mysqld-exporter-image string        The image of mysqld_exporter sidecar container (default "ghcr.io/cybozu-go/moco/mysqld_exporter:0.15.1.2")
      --one_output                          If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
      --pprof-addr string                   Listen address for pprof endpoints. pprof is disabled by default
      --pvc-sync-annotation-keys strings    The keys of annotations from MySQLCluster's volumeClaimTemplates to be synced to the PVC
      --pvc-sync-label-keys strings         The keys of labels from MySQLCluster's volumeClaimTemplates to be synced to the PVC
      --skip_headers                        If true, avoid header prefixes in the log messages
      --skip_log_headers                    If true, avoid headers when opening log files (no effect when -logtostderr=true)
      --stderrthreshold severity            logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=true) (default 2)
  -v, --v Level                             number for the log level verbosity
      --version                             version for moco-controller
      --vmodule moduleSpec                  comma-separated list of pattern=N settings for file-filtered logging
      --webhook-addr string                 Listen address for the webhook endpoint (default ":9443")
      --zap-devel                           Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
      --zap-encoder encoder                 Zap log encoding (one of 'json' or 'console')
      --zap-log-level level                 Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
      --zap-stacktrace-level level          Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
      --zap-time-encoding time-encoding     Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

`moco-backup`

moco-backup command is used in ghcr.io/cybozu-go/moco-backup container. Normally, users need not take care of this command.

Environment variables

moco-backup takes configurations of S3 API from environment variables. For details, read documentation of EnvConfig in github.com/aws/aws-sdk-go-v2/config.

It also requires MYSQL_PASSWORD environment variable to be set.

Global command-line flags

Global Flags:
      --endpoint string   S3 API endpoint URL
      --region string     AWS region
      --threads int       The number of threads to be used (default 4)
      --use-path-style    Use path-style S3 API
      --work-dir string   The writable working directory (default "/work")
      --ca-cert string    Path to SSL CA certificate file used in addition to system default

Subcommands

`backup` subcommand

Usage: moco-backup backup BUCKET NAMESPACE NAME

BUCKET: The bucket name.
NAMESPACE: The namespace of the MySQLCluster.
NAME: The name of the MySQLCluster.

`restore subcommand

Usage: moco-backup restore BUCKET SOURCE_NAMESPACE SOURCE_NAME NAMESPACE NAME YYYYMMDD-hhmmss

BUCKET: The bucket name.
SOURCE_NAMESPACE: The source MySQLCluster's namespace.
SOURCE_NAME: The source MySQLCluster's name.
NAMESPACE: The target MySQLCluster's namespace.
NAME: The target MySQLCluster's name.
YYYYMMDD-hhmmss: The point-in-time to restore data. e.g. 20210523-150423

Metrics

moco-controller

moco-controller provides the following kind of metrics in Prometheus format. Aside from the standard Go runtime and process metrics, it exposes metrics related to controller-runtime, MySQL clusters, and backups.

MySQL clusters

All these metrics are prefixed with moco_cluster_ and have name and namespace labels.

Name	Description	Type
`checks_total`	The number of times MOCO checked the cluster	Counter
`errors_total`	The number of times MOCO encountered errors when managing the cluster	Counter
`available`	1 if the cluster is available, 0 otherwise	Gauge
`healthy`	1 if the cluster is running without any problems, 0 otherwise	Gauge
`switchover_total`	The number of times MOCO changed the live primary instance	Counter
`failover_total`	The number of times MOCO changed the failed primary instance	Counter
`replicas`	The number of mysqld instances in the cluster	Gauge
`ready_replicas`	The number of ready mysqld Pods in the cluster	Gauge
`current_replicas`	The number of current replicas	Gauge
`updated_replicas`	The number of updated replicas	Gauge
`last_partition_updated`	The timestamp of the last successful partition update	Gauge
`clustering_stopped`	1 if the cluster is clustering stopped, 0 otherwise	Gauge
`reconciliation_stopped`	1 if the cluster is reconciliation stopped, 0 otherwise	Gauge
`errant_replicas`	The number of mysqld instances that have errant transactions	Gauge
`processing_time_seconds`	The length of time in seconds processing the cluster	Histogram
`volume_resized_total`	The number of successful volume resizes	Counter
`volume_resized_errors_total`	The number of failed volume resizes	Counter
`statefulset_recreate_total`	The number of successful StatefulSet recreates	Counter
`statefulset_recreate_errors_total`	The number of failed StatefulSet recreates	Counter

Backup

All these metrics are prefixed with moco_backup_ and have name and namespace labels.

Name	Description	Type
`timestamp`	The number of seconds since January 1, 1970 UTC of the last successful backup	Gauge
`elapsed_seconds`	The number of seconds taken for the last backup	Gauge
`dump_bytes`	The size of compressed full backup data	Gauge
`binlog_bytes`	The size of compressed binlog files	Gauge
`workdir_usage_bytes`	The maximum usage of the working directory	Gauge
`warnings`	The number of warnings in the last successful backup	Gauge

MySQL instance

For each mysqld instance, moco-agent exposes a set of metrics. Read github.com/cybozu-go/moco-agent/blob/main/docs/metrics.md for details.

Also, if you give a set of collector flag names to spec.collectors of MySQLCluster, a sidecar container running mysqld_exporter exposes the collected metrics for each mysqld instance.

Scrape rules

This is an example kubernetes_sd_config for Prometheus to collect all MOCO & MySQL metrics.

scrape_configs:
- job_name: 'moco-controller'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_label_app_kubernetes_io_component,__meta_kubernetes_pod_container_port_name]
    action: keep
    regex: moco-system;moco-controller;metrics

- job_name: 'moco-agent'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_container_port_name,__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: keep
    regex: mysql;agent-metrics;moco-.*
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace

- job_name: 'moco-mysql'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_container_port_name,__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: keep
    regex: mysql;mysqld-metrics;moco-.*
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
    action: replace
    target_label: name
  - source_labels: [__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: replace
    target_label: index
    regex: .*-([0-9])
  - source_labels: [__meta_kubernetes_pod_label_moco_cybozu_com_role]
    action: replace
    target_label: role

The collected metrics should have these labels:

namespace: MySQLCluster's metadata.namespace
name: MySQLCluster's metadata.name
index: The ordinal of MySQL instance Pod

Design notes

The purpose of this document is to describe the backgrounds and the goals of MOCO. Implementation details are described in other documents.

Motivation

We are creating our own Kubernetes operator for clustering MySQL instances for the following reasons:

Firstly, our application requires strict-compatibility to the traditional MySQL. Although recent MySQL provides an advanced clustering solution called group replication that is based on Paxos, we cannot use it because of various limitations from group replication.

Secondly, we want to have a Kubernetes native and the simplest operator. For example, we can use Kubernetes Service to load-balance read queries to multiple replicas. Also, we do not want to support non-GTID based replications.

Lastly, none of the existing operators could satisfy our requirements.

Goals

Manage primary-replica style clustering of MySQL instances.
- The primary instance is the only instance that allows writes.
- Replica instances replicate data from the primary and are read-only.
Support replication from an external MySQL instance.
Support all the four transaction isolation levels.
No split-brain.
Allow large transactions.
Upgrade the operator without restarting MySQL Pods.
Safe and automatic upgrading of MySQL version.
Support automatic primary selection and switchover.
Support automatic failover.
Backup and restore features.
- Support point-in-time recovery (PiTR).
Tenant users can specify the following parameters:
- The version of MySQL instances.
- The number of processor cores for each MySQL instance.
- The amount of memory for each MySQL instance.
- The amount of backing storage for each MySQL instance.
- The number of replicas in the MySQL cluster.
- Custom configuration parameters.
Allow CREATE / DROP TEMPORARY TABLE during a transaction.

Non-goals

Support for older MySQL versions (5.6, 5.7)

As a late comer, we focus our development effort on the latest MySQL. This simplifies things and allows us to use advanced mechanisms such as CLONE INSTANCE.
Node fencing

Fencing is a technique to safely isolated a failed Node. MOCO does not rely on Node fencing as it should be done externally.

We can still implement failover in a safe way by configuring semi-sync parameters appropriately.

How MOCO reconciles MySQLCluster

MOCO creates and updates a StatefulSet and related resources for each MySQLCluster custom resource. This document describes how and when MOCO updates them.

Reconciler versions

MOCO's reconciliation routine should be consistent to avoid frequent updates.

That said, we may need to modify the reconciliation process in the future. To avoid updating the StatefulSet, MOCO has multiple versions of reconcilers.

For example, if a MySQLCluster is reconciled with version 1 of the reconciler, MOCO will keep using the version 1 reconciler to reconcile the MySQLCluster.

If the user edits MySQLCluster's spec field, MOCO can reconcile the MySQLCluster with the latest reconciler, for example version 2, because the user shall be ready for mysqld restarts.

The update policy of moco-agent container

We shall try to avoid updating moco-agent as much as possible.

The figure below illustrates the overview of resources related to clustering MySQL instances.

Overview of clustering related resources

StatefulSet

MOCO tries not to update the StatefulSet frequently. It updates the StatefulSet only when the update is a must.

The conditions for StatefulSet update

The StatefulSet will be updated when:

Some fields under spec of MySQLCluster are modified.
my.cnf for mysqld is updated.
the version of the reconciler used to reconcile the StatefulSet is obsoleted.
the image of moco-agent given to the controller is updated.
the image of mysqld_exporter given to the controller is updated.

When the StatefulSet is not updated

the image of fluent-bit given to the controller is changed.
- because the controller does not depend on fluent-bit.

The fluent-bit sidecar container is updated only when some fields under spec of MySQLCluster are modified.

Status about StatefulSet

In MySQLCluster.Status.Condition, there is a condition named StatefulSetReady.
This indicates the readieness of StatefulSet.
The condition will be True when the rolling update of StatefulSet completely finishes.

Secrets

MOCO generates random passwords for users that MOCO uses to access MySQL.

The generated passwords are stored in two Secrets. One is in the same namespace as moco-controller, and the other is in the namespace of MySQLCluster.

Certificate

MOCO creates a Certificate in the same namespace as moco-controller to issue a TLS certificate for moco-agent.

After cert-manager issues a TLS certificate and creates a Secret for it, MOCO copies the Secret to the namespace of MySQLCluster. For details, read security.md.

Service

MOCO creates three Services for each MySQLCluster, that is:

A headless Service, required for every StatefulSet
A Service for the primary mysqld instance
A Service for replica mysql instances

The Services' labels, annotations, and spec fields can be customized with MySQLCluster's spec.primaryServiceTemplate and spec.replicaServiceTemplate field. The spec.primaryServiceTemplate configures the Service for the primary mysqld instance and the spec.replicaServiceTemplate configures the Service for the replica mysqld instances.

The following fields in Service spec may not be customized, though.

clusterIP
selector

The ports field in the Service spec is also customizable. However, for the mysql and mysqlx ports, MOCO overwrites the fixed value to the port, protocol and targetPort fields.

ConfigMap

MOCO creates and updates a ConfigMap for my.cnf. The name of this ConfigMap is calculated from the contents of my.cnf that may be changed by users.

MOCO deletes old ConfigMaps of my.cnf after a new ConfigMap for my.cnf is created.

If the cluster does not disable a sidecar container for slow query logs, MOCO creates a ConfigMap for the sidecar.

PodDisruptionBudget

MOCO creates a PodDisruptionBudget for each MySQLCluster to prevent too few semi-sync replica servers.

The spec.maxUnavailable value is calculated from MySQLCluster's spec.replicas as follows:

`spec.maxUnavailable` = floor(`spec.replicas` / 2)

If spec.replicas is 1, MOCO does not create a PDB.

ServiceAccount

MOCO creates a ServiceAccount for Pods of the StatefulSet. The ServiceAccount is not bound to any Roles/ClusterRoles.

See backup.md for the overview of the backup and restoration mechanism.

CronJob

This is the only resource created when backup is enabled for MySQLCluster.

If the backup is disabled, the CronJob is deleted.

Job

To restore data from a backup, MOCO creates a Job. MOCO deletes the Job after the Job finishes successfully.

If the Job fails, MOCO leaves the Job.

Status of Reconcliation

In MySQLCluster.Status.Condition, there is a condition named ReconcileSuccess.
This indicates the status of reconcilation.
The condition will be True when the reconcile function successfully finishes.

How MOCO maintains MySQL clusters

For each MySQLCluster, MOCO creates and maintains a set of mysqld instances. The set contains one primary instance and may contain multiple replica instances depending on the spec.replicas value of MySQLCluster.

This document describes how MOCO does this job safely.

Terminology

Replication: GTID-based replication between mysqld instances.
Cluster: a group of mysqld instances that replicate data between them.
Primary (instance): a single source instance of mysqld in a cluster.
Replica (instance): a read-only instance of mysqld that synchronizes data with the primary instance.
Intermediate primary: a special primary instance that replicates data from an external mysqld.
Errant transaction: a transaction that exists only on a replica instance.
Errant replica: a replica instance that has errant transactions.
Switchover: operation to change a live primary to a replica and promote a replica to the new primary.
Failover: operation to replace a dead primary with a replica.

Prerequisites

MySQLCluster allows positive odd numbers for spec.replicas value. If 1, MOCO runs a single mysqld instance without configuring replication. If 3 or greater, MOCO chooses a mysqld instance as a primary, writable instance and configures all other instances as replicas of the primary instance.

status.currentPrimaryIndex in MySQLCluster is used to record the current chosen primary instance. Initially, status.currentPrimaryIndex is zero and therefore the index of the primary instance is zero.

As a special case, if spec.replicationSourceSecretName is set for MySQLCluster, the primary instance is configured as a replica of an external MySQL server. In this case, the primary instance will not be writable. We call this type of primary instance intermediate primary.

If spec.replicationSourceSecretName is not set, MOCO configures semisynchronous replication between the primary and replicas. Otherwise, the replication is asynchronous.

For semi-synchronous replication, MOCO configures rpl_semi_sync_master_timeout long enough so that it never degrades to asynchronous replication.

Likewise, MOCO configures rpl_semi_sync_master_wait_for_slave_count to (spec.replicas - 1 / 2) to make sure that at least half of replica instances have the same commit as the primary. e.g., If spec.replicas is 5, rpl_semi_sync_master_wait_for_slave_count will be set to 2.

MOCO also disables relay_log_recovery because enabling it would drop the relay logs on replicas.

mysqld always starts with super_read_only=1 to prevent erroneous writes, and with skip_replica_start to prevent misconfigured replication.

moco-agent, a sidecar container for MOCO, initializes MySQL users and plugins. At the end of the initialization, it issues RESET MASTER | RESET BINARY LOGS AND GTIDS to clear executed GTID set.

moco-agent also provides a readiness probe for mysqld container. If a replica instance does not start replication threads or is too delayed to execute transactions, the container and the Pod will be determined as unready.

Limitations

Currently, MOCO does not re-initialize data after the primary instance fails.

After failover to a replica instance, the old primary may have errant transactions because it may recover unacknowledged transactions in its binary log. This is an inevitable limitation in MySQL semi-synchronous replication.

If this happens, MOCO detects the errant transaction and will not allow the old primary to rejoin the cluster as a replica.

Users need to delete the volume data (PersistentVolumeClaim) and the pod of the old primary to re-initialize it.

Possible states

MySQLCluster

MySQLCluster can be one of the following states.

The initial state is Cloning if spec.replicationSourceSecretName is set, or Restoring if spec.restore is set. Otherwise, the initial state is Incomplete.

Note that, if the primary Pod is ready, the mysqld is assured writable. Likewise, if a replica Pod is ready, the mysqld is assured read-only and running replication threads w/o too much delay.

Healthy
- All Pods are ready.
- All replicas have no errant transactions.
- All replicas are read-only and connected to the primary.
- For intermediate primary instance, the primary works as a replica for an external mysqld and is read-only.
Cloning
- spec.replicationSourceSecretName is set.
- status.cloned is false.
- The cloning result exists and is not "Completed" or there is no cloning result and the instance has no data.
- (note: if the primary has some data and has no cloning result, the instance was used to be a replica and then promoted to the primary.)
Restoring
- spec.restore is set.
- status.restoredTime is not set.
Degraded
- The primary Pod is ready and does not lose data.
- For intermediate primary instance, the primary works as a replica for an external mysqld and is read-only.
- Half or more replicas are ready, read-only, connected to the primary, and have no errant transactions. For example, if spec.replicas is 5, two or more such replicas are needed.
- At least one replica has some problems.
  - This also includes cases where a replica's rpl_semi_sync_master_wait_sessions is greater than 0. See related issues. #813
Failed
- The primary instance is not running or lost data.
- More than half of replicas are running and have data without errant transactions. For example, if spec.replicas is 5, three or more such replicas are needed.
Lost
- The primary instance is not running or lost data.
- Half or more replicas are not running or lost data or have errant transactions.
Incomplete
- None of the above states applies.

MOCO can recover the cluster to Healthy from Degraded, Failed, or Incomplete if all Pods are running and there are no errant transactions.

MOCO can recover the cluster to Degraded from Failed when not all Pods are running. Recovering from Failed is called failover.

MOCO cannot recover the cluster from Lost. Users need to restore data from backups.

Pod

mysqld is run as a container in a Pod. Therefore, MOCO needs to be aware of the following conditions.

Missing: the Pod does not exist.
Exist: the Pod exists and not Terminating or Demoting.
Terminating: The Pod exists and metadata.deletionTimestamp is not null.
Demoting: The Pod exists and has moco.cybozu.com/demote: true annotation.

If there are missing Pods, MOCO does nothing for the MySQLCluster.

If a primary instance Pod is Terminating or Demoting, MOCO controller changes the primary to one of the replica instances. This operation is called switchover.

MySQL data

MOCO checks replica instances whether they have errant transactions compared to the primary instance. If it detects such an instance, MOCO records the instance with MySQLCluster and excludes it from the cluster.

The user needs to delete the Pod and the volume manually and let the StatefulSet controller to re-create them. After a newly initialized instance gets created, MOCO will allow it to rejoin the cluster.

Invariants

By definition, the primary instance recorded in MySQLCluster has no errant transactions. It is always the single source of truth.
Errant replicas are not treated as ready even if their Pod status is ready.

The maintenance flow

MOCO runs the following infinite loop for each MySQLCluster. It stops when MySQLCluster resource is deleted.

Gather the current status
Update status of MySQLCluster
Determine what MOCO should do for the cluster
If there is nothing to do, wait a while and go to 1
Do the determined operation then go to 1

Read the following sub-sections about 1 to 3.

Gather the current status

MOCO gathers the information from kube-apiserver and mysqld as follows:

MySQLCluster resource
Pod resources
- If some of the Pods are missing, MOCO does nothing.
mysqld
- SHOW REPLICAS (on the primary)
- SHOW REPLICA STATUS (on the replicas)
- Global variables such as gtid_executed or super_read_only
- Result of CLONE from performance_schema.clone_status table

If MOCO cannot connect to an instance for a certain period, that instance is determined as failed.

Update `status` of MySQLCluster

In this phase, MOCO updates status field of MySQLCluster as follows:

Determine the current MySQLCluster state.
Add or update type=Initialized condition to status.conditions as
- True if the cluster state is not Cloning.
- otherwise, False.
Add or update type=Available condition to status.conditions as
- True if the cluster state is Healthy or Degraded.
- otherwise, False.
Add or update type=Healthy condition to status.conditions as
- True if the cluster state is Healthy.
- otherwise, False.
- The Reason field is set to the cluster state such as "Failed" or "Incomplete".
Set the number of ready replica Pods to status.syncedReplicas.
Add newly found errant replicas to status.errantReplicaList.
Remove re-initialized and/or no-longer errant replicas from status.errantReplicaList
Set status.errantReplicas to the length of status.errantReplicaList.
Set status.cloned to true if spec.replicationSourceSecret is not nil and the state is not Cloning.

Determine what MOCO should do for the cluster

The operation depends on the current cluster state.

The operation and its result are recorded as Events of MySQLCluster resource.

cf. Application Introspection and Debugging

Healthy

If the primary instance Pod is Terminating or Demoting, switch the primary instance to another replica. Otherwise, just wait a while.

The switchover is done as follows. It takes at least several seconds for a new primary to become writable.

Make the primary instance super_read_only=1.
Kill all existing connections except ones from localhost and ones for MOCO.
Wait for a replica to catch up the executed GTID set of the primary instance.
Set status.currentPrimaryIndex to the replica's index.
If the old primary is Demoting, remove moco.cybozu.com/demote annotation from the Pod.

Cloning

Execute CLONE INSTANCE on the intermediate primary instance to clone data from an external MySQL instance.

If the cloning goes successful, do the same as Intermediate case.

Restoring

Do nothing.

Degraded

First, check if the primary instance Pod is Terminating or Demoting, and if it is, do the switchover just like Healthy case.

Then, do the same as Intermediate case to try to fix the problems. It is not possible to recover the cluster to Healthy if there are errant or stopped replicas, though.

Failed

MOCO chooses the most advanced instance as the new primary instance. The most advanced means that its retrieved GTID set is the superset of all other replicas except for those have errant transactions.

To prevent accidental writes to the old primary instance (so-called split-brain), MOCO stops replication IO_THREAD for all replicas. This way, the old primary cannot get necessary acks from replicas to write further transactions.

The failover is done as follows:

Stop IO_THREAD on all replicas.
Choose the most advanced replica as the new primary. Errant replicas recorded in MySQLCluster are excluded from the candidates.
Wait for the replica to execute all retrieved GTID set.
Update status.currentPrimaryIndex to the new primary's index.

Lost

There is nothing can be done.

Intermediate

On the primary that was an intermediate primary, wait for all the retrieved GTID set to be executed.
Start replication between the primary and non-errant replicas.
- If a replication has no data, MOCO clones the primary data to the replica first.
Stop replication of errant replicas.
Set super_read_only=1 for replica instances that are writable.
Adjust moco.cybozu.com/role label to Pods according to their roles.
- For errant replicas, the label is removed to prevent users from reading inconsistent data.
Finally, make the primary mysqld writable if the primary is not an intermediate primary.

Backup and restore

This document describes how MOCO takes a backup of MySQLCluster data and restores a cluster from a backup.

Overview

A MySQLCluster can be configured to take backups regularly by referencing a BackupPolicy in spec.backupPolicyName. For each MySQLCluster associated with a BackupPolicy, moco-controller creates a CronJob. The CronJob creates a Job to take a full backup periodically. The Job also takes a backup of binary logs for Point-in-Time Recovery (PiTR). The backups are stored in a S3-compatible object storage bucket.

This figure illustrates how MOCO takes a backup of a MySQLCluster.

Backup

moco-controller creates a CronJob and Role/RoleBinding to allow access to MySQLCluster for the Job Pod.
At each configured interval, CronJob creates a Job.
The Job dumps all data from a mysqld using MySQL shell's dump instance utility.
The Job creates a tarball of the dumped data and put it in a bucket of S3 compatible object storage.
The Job also dumps binlogs since the last backup and put it in the same bucket (with a different name, of course).
The Job finally updates MySQLCluster status to record the last successful backup.

To restore from a backup, users need to create a new MySQLCluster with spec.restore filled with necessary information such as the bucket name of the object storage, the object key, and so on.

The next figure illustrates how MOCO restores MySQL cluster from a backup.

Restore

moco-controller creates a Job and Role/RoleBinding for restoration.
The Job downloads a tarball of dumped files of the specified backup.
The Job loads data into an empty mysqld using MySQL shell's dump loading utility.
If the user wanted to restore data at a point-in-time, the Job downloads saved binlogs.
The Job applies binlogs up to the specified point-in-time using mysqlbinlog.
The Job finally updates MySQLCluster status to record the restoration time.

Design goals

Must:

Users must be able to configure different backup policies for each MySQLCluster.
Users must be able to restore MySQL data at a point-in-time from backups.
Users must be able to restore MySQL data without the original MySQLCluster resource.
moco-controller must export metrics about backups.

Should:

Backup data should be compressed to save the storage space.
Backup data should be stored in an object storage.
Backups should be taken from a replica instance as much as possible.

These "should's" are mostly in terms of money or performance.

Implementation

Backup file keys

Backup files are stored in an object storage bucket with the following keys.

Key for a tarball of a fully dumped MySQL: moco/<namespace>/<name>/YYYYMMDD-hhmmss/dump.tar
Key for a compressed tarball of binlog files: moco/<namespace>/<name>/YYYYMMDD-hhmmss/binlog.tar.zst

<namespace> is the namespace of MySQLCluster, and <name> is the name of MySQLCluster. YYYYMMDD-hhmmss is the date and time of the backup where YYYY is the year, MM is two-digit month, DD is two-digit day, hh is two-digit hour in 24-hour format, mm is two-digit minute, and ss is two-digit second.

Example: moco/foo/bar/20210515-230003/dump.tar

This allows multiple MySQLClusters to share the same bucket.

Timestamps

Internally, the time for PiTR is formatted in UTC timezone.

The restore Job runs mysqlbinlog with TZ=Etc/UTC timezone.

Backup

As described in Overview, the backup process is implemented with CronJob and Job. In addition, users need to provide a ServiceAccount for the Job.

The ServiceAccount is often used to grant access to the object storage bucket where the backup files will be stored. For instance, Amazon Elastic Kubernetes Service (EKS) has a feature to create such a ServiceAccount. Kubernetes itself is also developing such an enhancement called Container Object Storage Interface (COSI).

To allow the backup Job to update MySQLCluster status, MOCO creates Role and RoleBinding. The RoleBinding grants the access to the given ServiceAccount.

By default, MOCO uses the Amazon S3 API, the most popular object storage API. Therefore, it also works with object storage that has an S3-compatible API, such as MinIO and Ceph. Object storage that uses non-S3 compatible APIs is only partially supported.

Currently supported object storage includes:

Amazon S3-compatible API
Google Cloud Storage API

For the first time, the backup Job chooses a replica instance as the backup source if available. For the second and subsequent backups, the Job will choose the last chosen instance as long as it is still a replica and available.

The backups are divided into two: a full dump and binlogs. A full dump is a snapshot of the entire MySQL database. Binlogs are records of transactions. With mysqlbinlog, binlogs can be used to apply transactions to a database restored from a full dump for PiTR.

For the first time, MOCO only takes a full dump of a MySQL instance, and records the GTID at the backup. For the second and subsequent backups, MOCO will retrieve binlogs since the GTID of the last backup until now.

To take a full dump, MOCO uses MySQL shell's dump instance utility. It performs significantly faster than mysqldump or mysqlpump. The dump is compressed with zstd compression algorithm.

MOCO then creates a tarball of the dump and puts it to an object storage bucket.

To retrieve transactions since the last backup until now, mysqlbinlog is used with these flags:

The retrieved binlog files are packed into a tarball and compressed with zstd, then put to an object storage bucket.

Finally, the Job updates MySQLCluster status field with the following information:

The time of backup
The time spent on the backup
The ordinal of the backup source instance
server_uuid of the instance (to check whether the instance was re-initialized or not)
The binlog filename in SHOW MASTER STATUS | SHOW BINARY LOG STATUS output.
The size of the tarball of the dumped files
The size of the tarball of the binlog files
The maximum usage of the working directory
Warnings, if any

When executing an incremental backup, the backup source must be a pod whose server_uuid has not changed since the last backup. If the server_uuid has changed, the pod may be missing some of the binlogs generated since the last backup.

The following is how to choose a pod to be the backup source.

flowchart TD
A{"first time?"}
A -->|"yes"| B
A -->|"no"| C["x ← Get the indexes of the pod whose server_uuid has not changed"] --> D

B{Are replicas available?}
B -->|"yes"| B1["return\nreplicaIdx\ndoBackupBinlog=false"]
style B1 fill:#c1ffff
B -->|"no"| B2["return\nprimaryIdx\ndoBackupBinlog=false"]
style B2 fill:#ffffc1

D{"Is x empty?"}
D -->|"yes"| E["add warning to bm.warnings"] --> F
style E fill:#ffc1c1
D -->|"no"| G

F{"Are replicas available?"}
F -->|"yes"| F1["return\nreplicaIdx\ndoBackupBinlog=false"]
style F1 fill:#ffc1c1
F -->|"no"| F2["return\nprimaryIdx\ndoBackupBinlog=false"]
style F2 fill:#ffc1c1

G{"Are there replica indexes in x?"}
G -->|"yes"| H
G -->|"no"| G1["return\nprimaryIdx\ndoBackupBinlog=true"]
style G1 fill:#ffffc1

H{"Is lastIndex included in x?"}
H -->|"yes"| I
H -->|"no"| H1["return\nreplicaIdx\ndoBackupBinlog=true"]
style H1 fill:#c1ffff

I{"Is lastIndex primary?"}
I -->|"yes"| I1["return\nreplicaIdx\ndoBackupBinlog=true"]
style I1 fill:#c1ffff
I -->|"no"| I2["return\nlastIdx\ndoBackupBinlog=true"]
style I2 fill:#c1ffff

Restore

To restore MySQL data from a backup, users need to create a new MySQLCluster with appropriate spec.restore field. spec.restore needs to provide at least the following information:

The bucket name
Namespace and name of the original MySQLCluster
A point-in-time in RFC3339 format

After moco-controller identifies mysqld is running, it creates a Job to retrieve backup files and load them into mysqld.

The Job looks for the most recent tarball of the dumped files that is older than the specified point-in-time in the bucket, and retrieves it. The dumped files are then loaded to mysqld using MySQL shell's load dump utility.

If the point-in-time is different from the time of the dump file, and if there is a compressed tarball of binlog files, then the Job retrieves binlog files and applies transactions up to the point-in-time.

After restoration process finishes, the Job updates MySQLCluster status to record the restoration time. moco-controller then configures the clustering as usual.

If the Job fails, moco-controller leaves the Job as is. The restored MySQL cluster will also be left read-only. If some of the data have been restored, they can be read from the cluster.

If a failed Job is deleted, moco-controller will create a new Job to give it another chance. Users can safely delete a successful Job.

Caveats

No automatic deletion of backup files

MOCO does not delete old backup files from object storage. Users should configure a bucket lifecycle policy to delete old backups automatically.
Duplicated backup Jobs

CronJob may create two or more Jobs at a time. If this happens, only one Job can update MySQLCluster status.
Lost binlog files

If binlog_expire_logs_seconds or expire_logs_days is set to a shorter value than the interval of backups, MOCO cannot save binlogs correctly. Users are responsible to configure binlog_expire_logs_seconds appropriately.

Considered options

There were many design choices and alternative methods to implement backup/restore feature for MySQL. Here are descriptions of why we determined the current design.

Why do we use S3-compatible object storage to store backups?

Compared to file systems, object storage is generally more cost-effective. It also has many useful features such as object lifecycle management.

AWS S3 API is the most prevailing API for object storages.

What object storage is supported?

MOCO currently supports the following object storage APIs:

Amazon S3
Google Cloud Storage

MOCO uses the Amazon S3 API by default. You can specify BackupPolicy.spec.jobConfig.bucketConfig.backendType to specify the object storage API to use. Currently, two identifiers can be specified, backendType for s3 or gcs. If not specified, it will be defaults to s3.

The following is an example of a backup setup using Google Cloud Storage:

apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
...
spec:
  schedule: "@daily"
  jobConfig:
    serviceAccountName: backup-owner
    env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: <dummy>
    bucketConfig:
      bucketName: moco
      endpointURL: https://storage.googleapis.com
      backendType: gcs
    workVolume:
      emptyDir: {}

Why do we use Jobs for backup and restoration?

Backup and restoration can be a CPU- and memory-consuming task. Running such a task in moco-controller is dangerous because moco-controller manages a lot of MySQLCluster.

moco-agent is also not a safe place to run backup job because it is a sidecar of mysqld Pod. If backup is run in mysqld Pod, it would interfere with the mysqld process.

Why do we prefer `mysqlsh` to `mysqldump`?

The biggest reason is the difference in how these tools lock the instance.

mysqlsh uses LOCK INSTANCE FOR BACKUP which blocks DDL until the lock is released. mysqldump, on the other hand, allows DDL to be executed. Once DDL is executed and acquire meta data lock, which means that any DML for the table modified by DDL will be blocked.

Blocking DML during backup is not desirable, especially when the only available backup source is the primary instance.

Another reason is that mysqlsh is much faster than mysqldump / mysqlpump.

Why don't we do continuous backup?

Continuous backup is a technique to save executed transactions in real time. For MySQL, this can be done with mysqlbinlog --stop-never. This command continuously retrieves transactions from binary logs and outputs them to stdout.

MOCO does not adopt this technique for the following reasons:

We assume MOCO clusters have replica instances in most cases.

When the data of the primary instance is lost, one of replicas can be promoted as a new primary.
It is troublesome to control the continuous backup process on Kubernetes.

The process needs to be kept running between full backups. If we do so, the entire backup process should be a persistent workload, not a (Cron)Job.

Upgrading mysqld

This document describes how mysqld upgrades its data and what MOCO has to do about it.

Preconditions

MySQL data

Beginning with 8.0.16, mysqld can update all data that need to be updated when it starts running. This means that MOCO needs nothing to do with MySQL data.

One thing that we should care about is that the update process may take a long time. The startup probe of mysqld container should be configured to wait for mysqld to complete updating data.

ref: https://dev.mysql.com/doc/refman/8.0/en/upgrading-what-is-upgraded.html

Downgrading

MySQL 8.0 does not support any kind of downgrading.

ref: https://dev.mysql.com/doc/refman/8.0/en/downgrading.html

Internally, MySQL has a version called "data dictionary (DD) version". If two MySQL versions have the same DD version, they are considered to have data compatibility.

ref: https://github.com/mysql/mysql-server/blob/mysql-8.0.24/sql/dd/dd_version.h#L209

Nevertheless, DD versions do change from time to time between revisions of MySQL 8.0. Therefore, the simplest way to avoid DD version mismatch is to not downgrade MySQL.

Upgrading a replication setup

In a nutshell, replica MySQL instances should be the same or newer than the source MySQL instance.

refs:

https://dev.mysql.com/doc/refman/8.0/en/replication-compatibility.html
https://dev.mysql.com/doc/refman/8.0/en/replication-upgrade.html

StatefulSet behavior

When the Pod template of a StatefulSet is updated, Kubernetes updates the Pods. With the default update strategy RollingUpdate, the Pods are updated one by one from the largest ordinal to the smallest.

StatefulSet controller keeps the old Pod template until it completes the rolling update. If a Pod that is not being updated are deleted, StatefulSet controller restores the Pod from the old template.

This means that, if the cluster is Healthy, MySQL is assured to be updated one by one from the instance of the largest ordinal to the smallest.

refs:

https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates
https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update

Automatic switchover

MOCO switches the primary instance when the Pod of the instance is being deleted. Read clustering.md for details.

MOCO implementation

With the preconditions listed above, MOCO can upgrade mysqld in MySQLCluster safely as follows.

Set .spec.updateStrategy field in StatefulSet to RollingUpdate.
Choose the lowest ordinal Pod as the next primary upon a switchover.
Configure the startup probe of mysqld container to wait long enough.
- By default, MOCO configures the probe to wait up to one hour.
- Users can adjust the duration for each MySQLCluster.

Example

Suppose that we are updating a three-instance cluster. The mysqld instances in the cluster have ordinals 0, 1, and 2, and the current primary instance is instance 1.

After MOCO updates the Pod template of the StatefulSet created for the cluster, Kubernetes start re-creating Pods starting from instance 2.

Instance 2 is a replica and therefore is safe for an update.

Next to instance 2, the instance 1 Pod is deleted. The deletion triggers an automatic switchover so that MOCO changes the primary to the instance 0 because it has the lowest ordinal. Because instance 0 is running an old mysqld, the preconditions are kept.

Finally, instance 0 is re-created in the same way. This time, MOCO switches the primary to instance 1. Since both instance 1 and 2 has been updated and instance 0 is being deleted, the preconditions are kept.

Limitations

If an instance is down during an upgrade, MOCO may choose an already updated instance as the new primary even though some instances are still running an old version.

If this happens, users may need to manually delete the old replica data and re-initialize the replica to restore the cluster health.

User's responsibility

Make sure that the cluster is healthy before upgrading
Check and prepare your installation for upgrade
Do not attempt to downgrade MySQL

Security considerations

gRPC API

moco-agent, a sidecar container in mysqld Pod, provides gRPC API to execute CLONE INSTANCE and required operations after CLONE. More importantly, the request contains credentials to access the source database.

To protect the credentials and prevent abuse of API, MOCO configures mTLS between moco-agent and moco-controller as follows:

Create an Issuer resource in moco-system namespace as the Certificate Authority.
Create a Certificate resource to issue the certificate for moco-controller.
moco-controller issues certificates for each MySQLCluster by creating Certificate resources.
moco-controller copies Secret resources created by cert-manager to the namespaces of MySQLCluster.
Both moco-controller and moco-agent verifies the certificate with the CA certificate.
- The CA certificate is embedded in the Secret resources.
moco-agent additionally verifies the certificate from moco-controller if it's Common Name is moco-controller.

MySQL passwords

MOCO generates its user passwords randomly with the OS random device. The passwords then stored as Secret resources.

As to communication between moco-controller and mysqld, it is not (yet) over TLS. That said, the password is encrypted anyway thanks to caching_sha2_password authentication.

Keyboard shortcuts

MOCO Documentation