MOCO documentation

moco logo

This is the documentation site for MOCO. MOCO is a Kubernetes operator for MySQL created and maintained by Cybozu.

Getting started

Setup

Quick setup

You can choose between two installation methods.

MOCO depends on cert-manager. If cert-manager is not installed on your cluster, install it as follows:

$ curl -fsLO https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml
$ kubectl apply -f cert-manager.yaml

Install using raw manifests:

$ curl -fsLO https://github.com/cybozu-go/moco/releases/latest/download/moco.yaml
$ kubectl apply -f moco.yaml

Install using Helm chart:

$ helm repo add moco https://cybozu-go.github.io/moco/
$ helm repo update
$ helm install --create-namespace --namespace moco-system moco moco/moco

Customize manifests

If you want to edit the manifest, config/ directory contains the source YAML for kustomize.

Next step

Read usage.md and create your first MySQL cluster!

MOCO Helm Chart

How to use MOCO Helm repository

You need to add this repository to your Helm repositories:

$ helm repo add moco https://cybozu-go.github.io/moco/
$ helm repo update

Quick start

Installing cert-manager

$ curl -fsL https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml | kubectl apply -f -

Installing the Chart

NOTE:

This installation method requires cert-manager to be installed beforehand. To install the chart with the release name moco using a dedicated namespace(recommended):

$ helm install --create-namespace --namespace moco-system moco moco/moco

Specify parameters using --set key=value[,key=value] argument to helm install.

Alternatively a YAML file that specifies the values for the parameters can be provided like this:

$ helm install --create-namespace --namespace moco-system moco -f values.yaml moco/moco

Values

KeyTypeDefaultDescription
image.repositorystring"ghcr.io/cybozu-go/moco"MOCO image repository to use.
image.tagstring{{ .Chart.AppVersion }}MOCO image tag to use.
resourcesobject{"requests":{"cpu":"100m","memory":"20Mi"}}resources used by moco-controller.
extraArgslist[]Additional command line flags to pass to moco-controller binary.
nodeSelectorobject{}nodeSelector used by moco-controller.
affinityobject{}affinity used by moco-controller.
tolerationslist[]tolerations used by moco-controller.
topologySpreadConstraintslist[]topologySpreadConstraints used by moco-controller.
priorityClassNamestring""PriorityClass used by moco-controller.

Generate Manifests

You can use the helm template command to render manifests.

$ helm template --namespace moco-system moco moco/moco

Upgrade CRDs

There is no support at this time for upgrading or deleting CRDs using Helm. Users must manually upgrade the CRD if there is a change in the CRD used by MOCO.

https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#install-a-crd-declaration-before-using-the-resource

Migrate to v0.3.0

Chart version v0.3.0 has breaking changes. The .metadata.name of the resource generated by Chart is changed.

e.g.

  • {{ template "moco.fullname" . }}-foo-resources -> moco-foo-resources

Related Issue: cybozu-go/moco#426

If you are using a release name other than moco, you need to migrate.

The migration steps involve deleting and recreating each MOCO resource once, except CRDs. Since the CRDs are not deleted, the pods running existing MySQL clusters are not deleted, so there is no downtime. However, the migration process should be completed in a short time since the moco-controller will be temporarily deleted and no control over the cluster will be available.

migration steps
  1. Show the installed chart

    $ helm list -n <YOUR NAMESPACE>
    NAME    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
    moco    moco-system     1               2022-08-17 11:28:23.418752 +0900 JST    deployed        moco-0.2.3      0.12.1
    
  2. Render the manifests

    $ helm template --namespace moco-system --version <YOUR CHART VERSION> <YOUR INSTALL NAME> moco/moco > render.yaml
    
  3. Setup kustomize

    $ cat > kustomization.yaml <<'EOF'
    resources:
      - render.yaml
    patches:
      - crd-patch.yaml
    EOF
    
    $ cat > crd-patch.yaml <<'EOF'
    $patch: delete
    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      name: backuppolicies.moco.cybozu.com
    ---
    $patch: delete
    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      name: mysqlclusters.moco.cybozu.com
    EOF
    
  4. Delete resources

    $ kustomize build ./ | kubectl delete -f -
    serviceaccount "moco-controller-manager" deleted
    role.rbac.authorization.k8s.io "moco-leader-election-role" deleted
    clusterrole.rbac.authorization.k8s.io "moco-backuppolicy-editor-role" deleted
    clusterrole.rbac.authorization.k8s.io "moco-backuppolicy-viewer-role" deleted
    clusterrole.rbac.authorization.k8s.io "moco-manager-role" deleted
    clusterrole.rbac.authorization.k8s.io "moco-mysqlcluster-editor-role" deleted
    clusterrole.rbac.authorization.k8s.io "moco-mysqlcluster-viewer-role" deleted
    rolebinding.rbac.authorization.k8s.io "moco-leader-election-rolebinding" deleted
    clusterrolebinding.rbac.authorization.k8s.io "moco-manager-rolebinding" deleted
    service "moco-webhook-service" deleted
    deployment.apps "moco-controller" deleted
    certificate.cert-manager.io "moco-controller-grpc" deleted
    certificate.cert-manager.io "moco-grpc-ca" deleted
    certificate.cert-manager.io "moco-serving-cert" deleted
    issuer.cert-manager.io "moco-grpc-issuer" deleted
    issuer.cert-manager.io "moco-selfsigned-issuer" deleted
    mutatingwebhookconfiguration.admissionregistration.k8s.io "moco-mutating-webhook-configuration" deleted
    validatingwebhookconfiguration.admissionregistration.k8s.io "moco-validating-webhook-configuration" deleted
    
  5. Delete Secret

    $ kubectl delete secret sh.helm.release.v1.<YOUR INSTALL NAME>.v1 -n <YOUR NAMESPACE>
    
  6. Re-install the v0.3.0 chart

    $ helm install --create-namespace --namespace moco-system --version 0.3.0 moco moco/moco
    

Release Chart

See RELEASE.md.

Installing kubectl-moco

kubectl-moco is a plugin for kubectl to control MySQL clusters of MOCO.

Pre-built binaries are available on GitHub releases for Windows, Linux, and MacOS.

Installing using Krew

Krew is the plugin manager for kubectl command-line tool.

See the documentation for how to install Krew.

$ kubectl krew update
$ kubectl krew install moco

Installing manually

  1. Set OS to the operating system name

    OS is one of linux, windows, or darwin (MacOS).

    If Go is available, OS can be set automatically as follows:

    $ OS=$(go env GOOS)
    
  2. Set ARCH to the architecture name

    ARCH is one of amd64 or arm64.

    If Go is available, ARCH can be set automatically as follows:

    $ ARCH=$(go env GOARCH)
    
  3. Set VERSION to the MOCO version

    See the MOCO release page: https://github.com/cybozu-go/moco/releases

    $ VERSION=< The version you want to install >
    
  4. Download the binary and put it in a directory of your PATH.

    The following is an example to install the plugin in /usr/local/bin.

    $ curl -L -sS https://github.com/cybozu-go/moco/releases/download/${VERSION}/kubectl-moco_${VERSION}_${OS}_${ARCH}.tar.gz \
      | tar xz -C /usr/local/bin kubectl-moco
    
  5. Check the installation by running kubectl moco -h.

    $ kubectl moco -h
    the utility command for MOCO.
    
    Usage:
      kubectl-moco [command]
    
    Available Commands:
      credential  Fetch the credential of a specified user
      help        Help about any command
      mysql       Run mysql command in a specified MySQL instance
      switchover  Switch the primary instance
    
    ...
    

How to use MOCO

After setting up MOCO, you can create MySQL clusters with a custom resource called MySQLCluster.

Basics

MOCO creates a cluster of mysqld instances for each MySQLCluster. A cluster can consists of 1, 3, or 5 mysqld instances.

MOCO configures semi-synchronous GTID-based replication between mysqld instances in a cluster if the cluster size is 3 or 5. A 3-instance cluster can tolerate up to 1 replica failure, and a 5-instance cluster can tolerate up to 2 replica failures.

In a cluster, there is only one instance called primary. The primary instance is the source of truth. It is the only writable instance in the cluster, and the source of the replication. All other instances are called replica. A replica is a read-only instance and replicates data from the primary.

Limitations

Errant replicas

An inherent limitation of GTID-based semi-synchronous replication is that a failed instance would have errant transactions. If this happens, the instance needs to be re-created by removing all data.

MOCO does not re-create such an instance. It only detects instances having errant transactions and excludes them from the cluster. Users need to monitor them and re-create the instances.

Read-only primary

MOCO from time to time sets the primary mysqld instance read-only for a switchover or other reasons. Applications that use MOCO MySQL need to be aware of this.

Creating clusters

Creating an empty cluster

An empty cluster always has a writable instance called the primary. All other instances are called replicas. Replicas are read-only and replicate data from the primary.

The following YAML is to create a three-instance cluster. It has an anti-affinity for Pods so that all instances will be scheduled to different Nodes. It also sets the limits for memory and CPU to make the Pod Guaranteed.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
  # replicas is the number of mysqld Pods.  The default is 1.
  replicas: 3
  podTemplate:
    spec:
      # Make the data directory writable. If moco-init fails with "Permission denied", uncomment the following settings.
      # securityContext:
      #   fsGroup: 10000
      #   fsGroupChangePolicy: "OnRootMismatch"  # available since k8s 1.20
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - mysql
              - key: app.kubernetes.io/instance
                operator: In
                values:
                - test
            topologyKey: "kubernetes.io/hostname"
      containers:
      # At least a container named "mysqld" must be defined.
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.35
        # By limiting CPU and memory, Pods will have Guaranteed QoS class.
        # requests can be omitted; it will be set to the same value as limits.
        resources:
          limits:
            cpu: "10"
            memory: "10Gi"
  volumeClaimTemplates:
  # At least a PVC named "mysql-data" must be defined.
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

By default, MOCO uses preferredDuringSchedulingIgnoredDuringExecution to prevent Pods from being placed on the same Node.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: moco-<MYSQLCLSTER_NAME>
  namespace: default
...
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - mysql
                - key: app.kubernetes.io/created-by
                  operator: In
                  values:
                  - moco
                - key: app.kubernetes.io/instance
                  operator: In
                  values:
                  - <MYSQLCLSTER_NAME>
              topologyKey: kubernetes.io/hostname
            weight: 100
...

There are other example manifests in examples directory.

The complete reference of MySQLCluster is crd_mysqlcluster_v1beta2.md.

Creating a cluster that replicates data from an external mysqld

Let's call the source mysqld instance donor.

First, make sure partial_revokes is enabled on the donor; Replicating data from the donor with partial_revokes disabled will result in replication inconsistencies or errors since MOCO uses partial_revokes functionality.

We use the clone plugin to copy the whole data quickly. After the cloning, MOCO needs to create some user accounts and install plugins.

On the donor, you need to install the plugin and create two user accounts as follows:

mysql> INSTALL PLUGIN clone SONAME 'mysql_clone.so';
mysql> CREATE USER 'clone-donor'@'%' IDENTIFIED BY 'xxxxxxxxxxx';
mysql> GRANT BACKUP_ADMIN, REPLICATION SLAVE ON *.* TO 'clone-donor'@'%';
mysql> CREATE USER 'clone-init'@'localhost' IDENTIFIED BY 'yyyyyyyyyyy';
mysql> GRANT ALL ON *.* TO 'clone-init'@'localhost' WITH GRANT OPTION;
mysql> GRANT PROXY ON ''@'' TO 'clone-init'@'localhost' WITH GRANT OPTION;

You may change the user names and should change their passwords.

Then create a Secret in the same namespace as MySQLCluster:

$ kubectl -n <namespace> create secret generic donor-secret \
    --from-literal=HOST=<donor-host> \
    --from-literal=PORT=<donor-port> \
    --from-literal=USER=clone-donor \
    --from-literal=PASSWORD=xxxxxxxxxxx \
    --from-literal=INIT_USER=clone-init \
    --from-literal=INIT_PASSWORD=yyyyyyyyyyy

You may change the secret name.

Finally, create MySQLCluster with spec.replicationSourceSecretName set to the Secret name as follows. The mysql image must be the same version as the donor's.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  replicationSourceSecretName: donor-secret
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.35  # must be the same version as the donor
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

To stop the replication from the donor, update MySQLCluster with spec.replicationSourceSecretName: null.

Bring your own image

We provide pre-built MySQL container images at ghcr.io/cybozu-go/moco/mysql. If you want to build and use your own image, read custom-mysqld.md.

Configurations

The default and constant configuration values for mysqld are available on pkg.go.dev. The settings in ConstMycnf cannot be changed while the settings in DefaultMycnf can be overridden.

You can change the default values or set undefined values by creating a ConfigMap in the same namespace as MySQLCluster, and setting spec.mysqlConfigMapName in MySQLCluster to the name of the ConfigMap as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: foo
  name: mycnf
data:
  long_query_time: "5"
  innodb_buffer_pool_size: "10G"
---
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  # set this to the name of ConfigMap
  mysqlConfigMapName: mycnf
  ...

InnoDB buffer pool size

If innodb_buffer_pool_size is not specified, MOCO sets it automatically to 70% of the value of resources.requests.memory (or resources.limits.memory) for mysqld container.

If both resources.request.memory and resources.limits.memory are not set, innodb_buffer_pool_size will be set to 128M.

Opaque configuration

Some configuration variables cannot be fully configured with ConfigMap values. For example, --performance-schema-instrument needs to be specified multiple times.

You may set them through a special config key _include. The value of _include will be included in my.cnf as opaque.

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: foo
  name: mycnf
data:
  _include: |
    performance-schema-instrument='memory/%=ON'
    performance-schema-instrument='wait/synch/%/innodb/%=ON'
    performance-schema-instrument='wait/lock/table/sql/handler=OFF'
    performance-schema-instrument='wait/lock/metadata/sql/mdl=OFF'

Care must be taken not to overwrite critical configurations such as log_bin since MOCO does not check the contents from _include.

Using the cluster

kubectl moco

From outside of your Kubernetes cluster, you can access MOCO MySQL instances using kubectl-moco. kubectl-moco is a plugin for kubectl. Pre-built binaries are available on GitHub releases.

The following is an example to run mysql command interactively to access the primary instance of test MySQLCluster in foo namespace.

$ kubectl moco -n foo mysql -it test

Read the reference manual of kubectl-moco for further details and examples.

MySQL users

MOCO prepares a set of users.

  • moco-readonly can read all tables of all databases.
  • moco-writable can create users, databases, or tables.
  • moco-admin is the super user.

The exact privileges that moco-readonly has are:

  • PROCESS
  • REPLICATION CLIENT
  • REPLICATION SLAVE
  • SELECT
  • SHOW DATABASES
  • SHOW VIEW

The exact privileges that moco-writable has are:

  • ALTER
  • ALTER ROUTINE
  • CREATE
  • CREATE ROLE
  • CREATE ROUTINE
  • CREATE TEMPORARY TABLES
  • CREATE USER
  • CREATE VIEW
  • DELETE
  • DROP
  • DROP ROLE
  • EVENT
  • EXECUTE
  • INDEX
  • INSERT
  • LOCK TABLES
  • PROCESS
  • REFERENCES
  • REPLICATION CLIENT
  • REPLICATION SLAVE
  • SELECT
  • SHOW DATABASES
  • SHOW VIEW
  • TRIGGER
  • UPDATE

moco-writable cannot edit tables in mysql database, though.

You can create other users and grant them certain privileges as either moco-writable or moco-admin.

$ kubectl moco mysql -u moco-writable test -- -e "CREATE USER 'foo'@'%' IDENTIFIED BY 'bar'"
$ kubectl moco mysql -u moco-writable test -- -e "CREATE DATABASE db1"
$ kubectl moco mysql -u moco-writable test -- -e "GRANT ALL ON db1.* TO 'foo'@'%'"

Connecting to mysqld over network

MOCO prepares two Services for each MySQLCluster. For example, a MySQLCluster named test in foo Namespace has the following Services.

Service NameDNS NameDescription
moco-test-primarymoco-test-primary.foo.svcConnect to the primary instance.
moco-test-replicamoco-test-replica.foo.svcConnect to replica instances.

moco-test-replica can be used only for read access.

The type of these Services is usually ClusterIP. The following is an example to change Service type to LoadBalancer and add an annotation for MetalLB.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  primaryServiceTemplate:
    metadata:
      annotations:
        metallb.universe.tf/address-pool: production-public-ips
    spec:
      type: LoadBalancer
...

Backup and restore

MOCO can take full and incremental backups regularly. The backup data are stored in Amazon S3 compatible object storages.

You can restore data from a backup to a new MySQL cluster.

Object storage bucket

Bucket is a management unit of objects in S3. MOCO stores backups in a specified bucket.

MOCO does not remove backups. To remove old backups automatically, you can set a lifecycle configuration to the bucket.

ref: Setting lifecycle configuration on a bucket

A bucket can be shared safely across multiple MySQLClusters. Object keys are prefixed with moco/.

BackupPolicy

BackupPolicy is a custom resource to define a policy for taking backups.

The following is an example BackupPolicy to take a backup every day and store data in MinIO:

apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
metadata:
  namespace: backup
  name: daily
spec:
  # Backup schedule.  Any CRON format is allowed.
  schedule: "@daily"

  jobConfig:
    # An existing ServiceAccount name is required.
    serviceAccountName: backup-owner
    env:
    - name: AWS_ACCESS_KEY_ID
      value: minioadmin
    - name: AWS_SECRET_ACCESS_KEY
      value: minioadmin

    # bucketName is required.  Other fields are optional.
    bucketConfig:
      bucketName: moco
      endpointURL: http://minio.default.svc:9000
      usePathStyle: true

    # MOCO uses a filesystem volume to store data temporarily.
    workVolume:
      # Using emptyDir as a working directory is NOT recommended.
      # The recommended way is to use generic ephemeral volume with a provisioner
      # that can provide enough capacity.
      # https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes
      emptyDir: {}

To enable backup for a MySQLCluster, reference the BackupPolicy name like this:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: foo
spec:
  backupPolicyName: daily  # The policy name
...

Note: If you want to specify the ObjectBucket name in a ConfigMap or Secret, you can use envFrom and specify the environment variable name in jobConfig.bucketConfig.bucketName as follows. This behavior is tested.

apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
metadata:
  namespace: backup
  name: daily
spec:
  jobConfig:
    bucketConfig:
      bucketName: "$(BUCKET_NAME)"
      endpointURL: http://minio.default.svc:9000
      usePathStyle: true
    envFrom:
    - configMapRef:
        name: bucket-name
...
---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: backup
  name: bucket-name
data:
  BUCKET_NAME: moco

MOCO creates a CronJob for each MySQLCluster that has spec.backupPolicyName.

The CronJob's name is moco-backup- + the name of MySQLCluster. For the above example, a CronJob named moco-backup-foo is created in default namespace.

The following podAntiAffinity is set by default for CronJob. If you want to override it, set BackupPolicy.spec.jobConfig.affinity.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: moco-backup-foo
spec:
...
  jobTemplate:
    spec:
      template:
        spec:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
                - podAffinityTerm:
                    labelSelector:
                      matchExpressions:
                        - key: app.kubernetes.io/name
                          operator: In
                          values:
                            - mysql-backup
                        - key: app.kubernetes.io/created-by
                          operator: In
                          values:
                            - moco
                    topologyKey: kubernetes.io/hostname
                  weight: 100
...

Credentials to access S3 bucket

Depending on your Kubernetes service provider and object storage, there are various ways to give credentials to access the object storage bucket.

For Amazon's Elastic Kubernetes Service (EKS) and S3 users, the easiest way is probably to use IAM Roles for Service Accounts (IRSA).

ref: IAM ROLES FOR SERVICE ACCOUNTS

Another popular way is to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables as shown in the above example.

Taking an emergency backup

You can take an emergency backup by creating a Job from the CronJob for backup.

$ kubectl create job --from=cronjob/moco-backup-foo emergency-backup

Restore

To restore data from a backup, create a new MyQLCluster with spec.restore field as follows:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: backup
  name: target
spec:
  # restore field is not editable.
  # to modify parameters, delete and re-create MySQLCluster.
  restore:
    # The source MySQLCluster's name and namespace
    sourceName: source
    sourceNamespace: backup

    # The restore point-in-time in RFC3339 format.
    restorePoint: "2021-05-26T12:34:56Z"

    # jobConfig is the same in BackupPolicy
    jobConfig:
      serviceAccountName: backup-owner
      env:
      - name: AWS_ACCESS_KEY_ID
        value: minioadmin
      - name: AWS_SECRET_ACCESS_KEY
        value: minioadmin
      bucketConfig:
        bucketName: moco
        endpointURL: http://minio.default.svc:9000
        usePathStyle: true
      workVolume:
        emptyDir: {}
...

Further details

Read backup.md for further details.

Deleting the cluster

By deleting MySQLCluster, all resources including PersistentVolumeClaims generated from the templates are automatically removed.

If you want to keep the PersistentVolumeClaims, remove metadata.ownerReferences from them before you delete a MySQLCluster.

Status, metrics, and logs

Cluster status

You can see the health and availability status of MySQLCluster as follows:

$ kubectl get mysqlcluster
NAME   AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS
test   True        True      0         3
  • The cluster is available when the primary Pod is running and ready.
  • The cluster is healthy when there is no problems.
  • PRIMARY is the index of the current primary instance Pod.
  • SYNCED REPLICAS is the number of ready Pods.
  • ERRANT REPLICAS is the number of instances having errant transactions.

You can also use kubectl describe mysqlcluster to see the recent events on the cluster.

Pod status

MOCO adds mysqld containers a liveness probe and a readiness probe to check the replication status in addition to the process status.

A replica Pod is ready only when it is replicating data from the primary without a significant delay. The default threshold of the delay is 60 seconds. The threshold can be configured as follows.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  maxDelaySeconds: 180
  ...

Unready replica Pods are automatically excluded from the load-balancing targets so that users will not see too old data.

Metrics

MOCO provides a built-in support to collect and expose mysqld metrics using mysqld_exporter.

This is an example YAML to enable mysqld_exporter. spec.collectors is a list of mysqld_exporter flag names without collect. prefix.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  collectors:
  - engine_innodb_status
  - info_schema.innodb_metrics
  podTemplate:
    ...

See metrics.md for all available metrics and how to collect them using Prometheus.

Logs

Error logs from mysqld can be viewed as follows:

$ kubectl logs moco-test-0 mysqld

Slow logs from mysqld can be viewed as follows:

$ kubectl logs moco-test-0 slow-log

Maintenance

Increasing the number of instances in the cluster

Edit spec.replicas field of MySQLCluster:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: foo
  name: test
spec:
  replicas: 5
  ...

You can only increase the number of instances in a MySQLCluster from 1 to 3 or 5, or from 3 to 5. Decreasing the number of instances is not allowed.

Switchover

Switchover is an operation to change the live primary to one of the replicas.

MOCO automatically switch the primary when the Pod of the primary instance is to be deleted.

Users can manually trigger a switchover with kubectl moco switchover CLUSTER_NAME. Read kubectl-moco.md for details.

Failover

Failover is an operation to replace the dead primary with the most advanced replica. MOCO automatically does this as soon as it detects that the primary is down.

The most advanced replica is a replica who has retrieved the most up-to-date transaction from the dead primary. Since MOCO configures loss-less semi-synchronous replication, the failover is guaranteed not to lose any user data.

After a failover, the old primary may become an errant replica as described.

Upgrading mysql version

You can upgrade the MySQL version of a MySQL cluster as follows:

  1. Check that the cluster is healthy.
  2. Check release notes of MySQL for any incompatibilities between the current and the new versions.
  3. Edit the Pod template of the MySQLCluster and update mysqld container image:
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
      containers:
      - name: mysqld
        # Edit the next line
        image: ghcr.io/cybozu-go/moco/mysql:8.0.35

You are advised to make backups and/or create a replica cluster before starting the upgrade process. Read upgrading.md for further details.

Re-initializing an errant replica

Delete the PVC and Pod of the errant replica, like this:

$ kubectl delete --wait=false pvc mysql-data-moco-test-0
$ kubectl delete --grace-period=1 pods moco-test-0

Depending on your Kubernetes version, StatefulSet controller may create a pending Pod before PVC gets deleted. Delete such pending Pods until PVC is actually removed.

Stop Clustering and Reconciliation

In MOCO, you can optionally stop the clustering and reconciliation of a MySQLCluster.

To stop clustering and reconciliation, use the following commands.

$ kubectl moco stop clustering <CLSUTER_NAME>
$ kubectl moco stop reconciliation <CLSUTER_NAME>

To resume the stopped clustering and reconciliation, use the following commands.

$ kubectl moco start clustering <CLSUTER_NAME>
$ kubectl moco start reconciliation <CLSUTER_NAME>

You could use this feature in the following cases:

  1. To stop the replication of a MySQLCluster and perform a manual operation to align the GTID
    • Run the kubectl moco stop clustering command on the MySQLCluster where you want to stop the replication
  2. To suppress the full update of MySQLCluster that occurs during the upgrade of MOCO
    • Run the kubectl moco stop reconciliation command on the MySQLCluster on which you want to suppress the update

To check whether clustering and reconciliation are stopped, use kubectl get mysqlcluster. Moreover, while clustering is stopped, AVAILABLE and HEALTHY values will be Unknown.

$ kubectl get mysqlcluster
NAME   AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS   CLUSTERING ACTIVE   RECONCILE ACTIVE   LAST BACKUP
test   Unknown     Unknown   0         3                                   False               False              <no value>

The MOCO controller outputs the following metrics to indicate that clustering has been stopped. 1 if the cluster is clustering or reconciliation stopped, 0 otherwise.

moco_cluster_clustering_stopped{name="mycluster", namespace="mynamesapce"} 1
moco_cluster_reconciliation_stopped{name="mycluster", namespace="mynamesapce"} 1

During the stop of clustering, monitoring of the cluster from MOCO will be halted, and the value of the following metrics will become NaN.

moco_cluster_available{name="test",namespace="default"} NaN
moco_cluster_healthy{name="test",namespace="default"} NaN
moco_cluster_ready_replicas{name="test",namespace="default"} NaN
moco_cluster_errant_replicas{name="test",namespace="default"} NaN

Advanced topics

Building custom image of mysqld

There are pre-built mysqld container images for MOCO on ghcr.io/cybozu-go/moco/mysql. Users can use one of these images to supply mysqld container in MySQLCluster like:

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
spec:
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.35

If you want to build and use your own mysqld, read the rest of this document.

Dockerfile

The easiest way to build a custom mysqld for MOCO is to copy and edit our Dockerfile. You can find it under containers/mysql directory in github.com/cybozu-go/moco.

You should keep the following points:

  • ENTRYPOINT should be ["mysqld"]
  • USER should be 10000:10000
  • sleep command must exist in one of the PATH directories.

How to build mysqld

On Ubuntu 20.04, you can build the source code as follows:

$ sudo apt-get update
$ sudo apt-get -y --no-install-recommends install build-essential libssl-dev \
    cmake libncurses5-dev libjemalloc-dev libnuma-dev libaio-dev pkg-config
$ curl -fsSL -O https://dev.mysql.com/get/Downloads/MySQL-8.0/mysql-boost-8.0.20.tar.gz
$ tar -x -z -f mysql-boost-8.0.20.tar.gz
$ cd mysql-8.0.20
$ mkdir bld
$ cd bld
$ cmake .. -DBUILD_CONFIG=mysql_release -DCMAKE_BUILD_TYPE=Release \
    -DWITH_BOOST=$(ls -d ../boost/boost_*) -DWITH_NUMA=1 -DWITH_JEMALLOC=1
$ make -j $(nproc)
$ make install

Customize default container

MOCO has containers that are automatically added by the system in addition to containers added by the user. (e.g. agent, moco-init etc...)

The MySQLCluster.spec.podTemplate.overwriteContainers field can be used to overwrite such containers. Currently, only container resources can be overwritten. overwriteContainers is only available in MySQLCluster v1beta2.

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: test
spec:
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.30
    overwriteContainers:
    - name: agent
      resources:
        requests:
          cpu: 50m

System containers

The following is a list of system containers used by MOCO. Specifying container names in overwriteContainers that are not listed here will result in an error in API validation.

NameDefault CPU Requests/LimitsDefault Memory Requests/LimitsDescription
agent100m / 100m100Mi / 100MiMOCO's agent container running in sidecar. refs: https://github.com/cybozu-go/moco-agent
moco-init100m / 100m300Mi / 300MiInitializes MySQL data directory and create a configuration snippet to give instance specific configuration values such as server_id and admin_address.
slow-log100m / 100m20Mi / 20MiSidecar container for outputting slow query logs.
mysqld-exporter200m / 200m100Mi / 100MiMySQL server exporter sidecar container.

Change the volumeClaimTemplates

MOCO supports MySQLCluster .spec.volumeClaimTemplates changes.

When .spec.volumeClaimTemplates is changed, moco-controller will try to recreate the StatefulSet. This is because modification of volumeClaimTemplates in StatefulSet is currently not allowed.

Re-creation StatefulSet is done with the same behavior as kubectl delete sts moco-xxx --cascade=orphan, without removing the working Pod.

NOTE: It may be possible to edit the StatefulSet directly in the future.

ref: https://github.com/kubernetes/enhancements/issues/661

When re-creating a StatefulSet, moco-controller supports no operation except for volume expansion as described below. It simply re-creates the StatefulSet. However, by specifying the --pvc-sync-annotation-keys and --pvc-sync-label-keys flags in the controller, you can designate the annotations and labels to be synchronized from .spec.volumeClaimTemplates to PVC during the recreation of the StatefulSet.

For all other labels and annotations, given the potential side effects, such updates must be performed by the user themselves. This guideline is essential to prevent potential side-effects if entities other than the moco-controller are manipulating the PVC's metadata.

Metrics

The success or failure of the re-creating a StatefulSet is notified to the user in the following metrics:

moco_cluster_statefulset_recreate_total{name="mycluster", namespace="mynamesapce"} 3
moco_cluster_statefulset_recreate_errors_total{name="mycluster", namespace="mynamesapce"} 1

If a StatefulSet fails to recreate, the metrics in moco_cluster_statefulset_recreate_errors_total is incremented after each reconcile, so users can notice anomalies by monitoring this metrics.

See the metrics documentation for more details.

Volume expansion

moco-controller automatically resizes the PVC when the size of the MySQLCluster volume claim is extended. If the volume plugin supports online file system expansion, the PVs used by the Pod will be expanded online.

If volume is to be expanded, .allowVolumeExpansion of the StorageClass must be true. moco-controller will validate with the admission webhook and reject the request if volume expansion is not allowed.

If the volume plugin does not support online file system expansion, the Pod must be restarted for the volume expansion to reflect. This must be done manually by the user.

When moco-controller resizes a PVC, there may be a discrepancy between the PVC defined in the MySQLCluster and the actual PVC size. For example, if you are using github.com/topolvm/pvc-autoresizer. In this case, moco-controller will only update if the actual PVC size is smaller than the PVC size after the change.

Metrics

The success or failure of the PVC resizing is notified to the user in the following metrics:

moco_cluster_volume_resized_total{name="mycluster", namespace="mynamesapce"} 4
moco_cluster_volume_resized_errors_total{name="mycluster", namespace="mynamesapce"} 1

This metrics is incremented if the volume size change succeeds or fails. If fails to volume size changed, the metrics in moco_cluster_volume_resized_errors_total is incremented after each reconcile, so users can notice anomalies by monitoring this metrics.

See the metrics documentation for more details.

Volume reduction

MOCO supports PVC reduction, but unlike PVC expansion, the user must perform the operation manually.

The steps are as follows:

  1. The user modifies the .spec.volumeClaimTemplates of the MySQLCluster and sets a smaller volume size.
  2. MOCO updates the .spec.volumeClaimTemplates of the StatefulSet. This does not propagate to existing Pods, PVCs, or PVs.
  3. The user manually deletes the MySQL Pod & PVC.
  4. Wait for the Pod & PVC to be recreated by the statefulset-controller, and for MOCO to clone the data.
  5. Once the cluster becomes Healthy, the user deletes the next Pod and PVC.
  6. It is completed when all Pods and PVCs are recreated.

1. The user modifies the .spec.volumeClaimTemplates of the MySQLCluster and sets a smaller volume size

For example, the user modifies the .spec.volumeClaimTemplates of the MySQLCluster as follows:

  apiVersion: moco.cybozu.com/v1beta2
  kind: MySQLCluster
  metadata:
    namespace: default
    name: test
  spec:
    replicas: 3
    podTemplate:
      spec:
        containers:
        - name: mysqld
          image: ghcr.io/cybozu-go/moco/mysql:8.0.30
    volumeClaimTemplates:
    - metadata:
        name: mysql-data
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
-           storage: 1Gi
+           storage: 500Mi

2. MOCO updates the .spec.volumeClaimTemplates of the StatefulSet. This does not propagate to existing Pods, PVCs, or PVs

The moco-controller will update the .spec.volumeClaimTemplates of the StatefulSet. The actual modification of the StatefulSet's .spec.volumeClaimTemplates is not allowed, so this change is achieved by recreating the StatefulSet. At this time, only the recreation of StatefulSet is performed, without deleting the Pods and PVCs.

3. The user manually deletes the MySQL Pod & PVC

The user manually deletes the PVC and Pod. Use the following command to delete them:

$ kubectl delete --wait=false pvc <pvc-name>
$ kubectl delete --grace-period=1 <pod-name>

4. Wait for the Pod & PVC to be recreated by the statefulset-controller, and for MOCO to clone the data

The statefulset-controller recreates Pods and PVCs, creating a new PVC with a reduced size. Once the MOCO successfully starts a Pod, it begins cloning the data.

$ kubectl get mysqlcluster,po,pvc
NAME                                AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS   LAST BACKUP
mysqlcluster.moco.cybozu.com/test   True        False     0         2                                   <no value>

NAME              READY   STATUS     RESTARTS   AGE
pod/moco-test-0   3/3     Running    0          2m14s
pod/moco-test-1   3/3     Running    0          114s
pod/moco-test-2   0/3     Init:1/2   0          7s

NAME                                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/mysql-data-moco-test-0   Bound    pvc-03c73525-0d6d-49de-b68a-f8af4c4c7faa   1Gi        RWO            standard       2m14s
persistentvolumeclaim/mysql-data-moco-test-1   Bound    pvc-73c26baa-3432-4c85-b5b6-875ffd2456d9   1Gi        RWO            standard       114s
persistentvolumeclaim/mysql-data-moco-test-2   Bound    pvc-779b5b3c-3efc-4048-a549-a4bd2d74ed4e   500Mi      RWO            standard       7s

5. Once the cluster becomes Healthy, the user deletes the next Pod and PVC

The user waits until the MySQLCluster state becomes Healthy, and then deletes the next Pod and PVC.

$ kubectl get mysqlcluster
NAME                                AVAILABLE   HEALTHY   PRIMARY   SYNCED REPLICAS   ERRANT REPLICAS   LAST BACKUP
mysqlcluster.moco.cybozu.com/test   True        True      1         3                                   <no value>

6. It is completed when all Pods and PVCs are recreated

Repeat steps 3 to 5 until all Pods and PVCs are recreated.

References

Known issues

This document lists the known issues of MOCO.

Multi-threaded replication

Status: not fixed as of MOCO v0.9.5

If you use MOCO with MySQL version 8.0.25 or earlier, you should not configure the replicas with slave_parallel_workers > 1. Multi-threaded replication will cause the replica to fail to resume after the crash.

This issue is registered as #322 and will be addressed at no distant date.

Custom resources

Custom Resources

Sub Resources

BackupStatus

BackupStatus represents the status of the last successful backup.

FieldDescriptionSchemeRequired
timeThe time of the backup. This is used to generate object keys of backup files in a bucket.metav1.Timetrue
elapsedElapsed is the time spent on the backup.metav1.Durationtrue
sourceIndexSourceIndex is the ordinal of the backup source instance.inttrue
sourceUUIDSourceUUID is the server_uuid of the backup source instance.stringtrue
uuidSetUUIDSet is the server_uuid set of all candidate instances for the backup source.map[string]stringtrue
binlogFilenameBinlogFilename is the binlog filename that the backup source instance was writing to at the backup.stringtrue
gtidSetGTIDSet is the GTID set of the full dump of database.stringtrue
dumpSizeDumpSize is the size in bytes of a full dump of database stored in an object storage bucket.int64true
binlogSizeBinlogSize is the size in bytes of a tarball of binlog files stored in an object storage bucket.int64true
workDirUsageWorkDirUsage is the max usage in bytes of the woking directory.int64true
warningsWarnings are list of warnings from the last backup, if any.[]stringtrue

Back to Custom Resources

MySQLCluster

MySQLCluster is the Schema for the mysqlclusters API

FieldDescriptionSchemeRequired
metadatametav1.ObjectMetafalse
specMySQLClusterSpecfalse
statusMySQLClusterStatusfalse

Back to Custom Resources

MySQLClusterList

MySQLClusterList contains a list of MySQLCluster

FieldDescriptionSchemeRequired
metadatametav1.ListMetafalse
items[]MySQLClustertrue

Back to Custom Resources

MySQLClusterSpec

MySQLClusterSpec defines the desired state of MySQLCluster

FieldDescriptionSchemeRequired
replicasReplicas is the number of instances. Available values are positive odd numbers.int32false
podTemplatePodTemplate is a Pod template for MySQL server container.PodTemplateSpectrue
volumeClaimTemplatesVolumeClaimTemplates is a list of PersistentVolumeClaim templates for MySQL server container. A claim named "mysql-data" must be included in the list.[]PersistentVolumeClaimtrue
primaryServiceTemplatePrimaryServiceTemplate is a Service template for primary.*ServiceTemplatefalse
replicaServiceTemplateReplicaServiceTemplate is a Service template for replica.*ServiceTemplatefalse
mysqlConfigMapNameMySQLConfigMapName is a ConfigMap name of MySQL config.*stringfalse
replicationSourceSecretNameReplicationSourceSecretName is a Secret name which contains replication source info. If this field is given, the MySQLCluster works as an intermediate primary.*stringfalse
collectorsCollectors is the list of collector flag names of mysqld_exporter. If this field is not empty, MOCO adds mysqld_exporter as a sidecar to collect and export mysqld metrics in Prometheus format.\n\nSee https://github.com/prometheus/mysqld_exporter/blob/master/README.md#collector-flags for flag names.\n\nExample: ["engine_innodb_status", "info_schema.innodb_metrics"][]stringfalse
serverIDBaseServerIDBase, if set, will become the base number of server-id of each MySQL instance of this cluster. For example, if this is 100, the server-ids will be 100, 101, 102, and so on. If the field is not given or zero, MOCO automatically sets a random positive integer.int32false
maxDelaySecondsMaxDelaySeconds configures the readiness probe of mysqld container. For a replica mysqld instance, if it is delayed to apply transactions over this threshold, the mysqld instance will be marked as non-ready. The default is 60 seconds. Setting this field to 0 disables the delay check in the probe.*intfalse
startupWaitSecondsStartupWaitSeconds is the maximum duration to wait for mysqld container to start working. The default is 3600 seconds.int32false
logRotationScheduleLogRotationSchedule specifies the schedule to rotate MySQL logs. If not set, the default is to rotate logs every 5 minutes. See https://pkg.go.dev/github.com/robfig/cron/v3#hdr-CRON_Expression_Format for the field format.stringfalse
backupPolicyNameThe name of BackupPolicy custom resource in the same namespace. If this is set, MOCO creates a CronJob to take backup of this MySQL cluster periodically.*stringfalse
restoreRestore is the specification to perform Point-in-Time-Recovery from existing cluster. If this field is not null, MOCO restores the data as specified and create a new cluster with the data. This field is not editable.*RestoreSpecfalse
disableSlowQueryLogContainerDisableSlowQueryLogContainer controls whether to add a sidecar container named "slow-log" to output slow logs as the containers output. If set to true, the sidecar container is not added. The default is false.boolfalse

Back to Custom Resources

MySQLClusterStatus

MySQLClusterStatus defines the observed state of MySQLCluster

FieldDescriptionSchemeRequired
conditionsConditions is an array of conditions.[]metav1.Conditionfalse
currentPrimaryIndexCurrentPrimaryIndex is the index of the current primary Pod in StatefulSet. Initially, this is zero.inttrue
syncedReplicasSyncedReplicas is the number of synced instances including the primary.intfalse
errantReplicasErrantReplicas is the number of instances that have errant transactions.intfalse
errantReplicaListErrantReplicaList is the list of indices of errant replicas.[]intfalse
backupBackup is the status of the last successful backup.BackupStatustrue
restoredTimeRestoredTime is the time when the cluster data is restored.*metav1.Timefalse
clonedCloned indicates if the initial cloning from an external source has been completed.boolfalse
reconcileInfoReconcileInfo represents version information for reconciler.ReconcileInfotrue

Back to Custom Resources

ObjectMeta

ObjectMeta is metadata of objects. This is partially copied from metav1.ObjectMeta.

FieldDescriptionSchemeRequired
nameName is the name of the object.stringfalse
labelsLabels is a map of string keys and values.map[string]stringfalse
annotationsAnnotations is a map of string keys and values.map[string]stringfalse

Back to Custom Resources

OverwriteContainer

OverwriteContainer defines the container spec used for overwriting.

FieldDescriptionSchemeRequired
nameName of the container to overwrite.OverwriteableContainerNametrue
resourcesResources is the container resource to be overwritten.*ResourceRequirementsApplyConfigurationfalse

Back to Custom Resources

PersistentVolumeClaim

PersistentVolumeClaim is a user's request for and claim to a persistent volume. This is slightly modified from corev1.PersistentVolumeClaim.

FieldDescriptionSchemeRequired
metadataStandard object's metadata.ObjectMetatrue
specSpec defines the desired characteristics of a volume requested by a pod author.PersistentVolumeClaimSpecApplyConfigurationtrue

Back to Custom Resources

PodTemplateSpec

PodTemplateSpec describes the data a pod should have when created from a template. This is slightly modified from corev1.PodTemplateSpec.

FieldDescriptionSchemeRequired
metadataStandard object's metadata. The name in this metadata is ignored.ObjectMetafalse
specSpecification of the desired behavior of the pod. The name of the MySQL server container in this spec must be mysqld.PodSpecApplyConfigurationtrue
overwriteContainersOverwriteContainers overwrites the container definitions provided by default by the system.[]OverwriteContainerfalse

Back to Custom Resources

ReconcileInfo

ReconcileInfo is the type to record the last reconciliation information.

FieldDescriptionSchemeRequired
generationGeneration is the metadata.generation value of the last reconciliation. See also https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#status-subresourceint64false
reconcileVersionReconcileVersion is the version of the operator reconciler.inttrue

Back to Custom Resources

RestoreSpec

RestoreSpec represents a set of parameters for Point-in-Time Recovery.

FieldDescriptionSchemeRequired
sourceNameSourceName is the name of the source MySQLCluster.stringtrue
sourceNamespaceSourceNamespace is the namespace of the source MySQLCluster.stringtrue
restorePointRestorePoint is the target date and time to restore data. The format is RFC3339. e.g. "2006-01-02T15:04:05Z"metav1.Timetrue
jobConfigSpecifies parameters for restore Pod.JobConfigtrue

Back to Custom Resources

ServiceTemplate

ServiceTemplate defines the desired spec and annotations of Service

FieldDescriptionSchemeRequired
metadataStandard object's metadata. Only annotations and labels are valid.ObjectMetafalse
specSpec is the ServiceSpec*ServiceSpecApplyConfigurationfalse

Back to Custom Resources

BucketConfig

BucketConfig is a set of parameter to access an object storage bucket.

FieldDescriptionSchemeRequired
bucketNameThe name of the bucketstringtrue
regionThe region of the bucket. This can also be set through AWS_REGION environment variable.stringfalse
endpointURLThe API endpoint URL. Set this for non-S3 object storages.stringfalse
usePathStyleAllows you to enable the client to use path-style addressing, i.e., https?://ENDPOINT/BUCKET/KEY. By default, a virtual-host addressing is used (https?://BUCKET.ENDPOINT/KEY).boolfalse
backendTypeBackendType is an identifier for the object storage to be used.stringfalse
caCertPath to SSL CA certificate file used in addition to system default.stringfalse

Back to Custom Resources

JobConfig

JobConfig is a set of parameters for backup and restore job Pods.

FieldDescriptionSchemeRequired
serviceAccountNameServiceAccountName specifies the ServiceAccount to run the Pod.stringtrue
bucketConfigSpecifies how to access an object storage bucket.BucketConfigtrue
workVolumeWorkVolume is the volume source for the working directory. Since the backup or restore task can use a lot of bytes in the working directory, You should always give a volume with enough capacity.\n\nThe recommended volume source is a generic ephemeral volume. https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumesVolumeSourceApplyConfigurationtrue
threadsThreads is the number of threads used for backup or restoration.intfalse
cpuCPU is the amount of CPU requested for the Pod.*resource.Quantityfalse
maxCpuMaxCPU is the amount of maximum CPU for the Pod.*resource.Quantityfalse
memoryMemory is the amount of memory requested for the Pod.*resource.Quantityfalse
maxMemoryMaxMemory is the amount of maximum memory for the Pod.*resource.Quantityfalse
envFromList of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig[]EnvFromSourceApplyConfigurationfalse
envList of environment variables to set in the container.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig[]EnvVarApplyConfigurationfalse
affinityIf specified, the pod's scheduling constraints.*AffinityApplyConfigurationfalse
volumesVolumes defines the list of volumes that can be mounted by containers in the Pod.[]VolumeApplyConfigurationfalse
volumeMountsVolumeMounts describes a list of volume mounts that are to be mounted in a container.[]VolumeMountApplyConfigurationfalse

Back to Custom Resources

Custom Resources

Sub Resources

BackupPolicy

BackupPolicy is a namespaced resource that should be referenced from MySQLCluster.

FieldDescriptionSchemeRequired
metadatametav1.ObjectMetafalse
specBackupPolicySpectrue

Back to Custom Resources

BackupPolicyList

BackupPolicyList contains a list of BackupPolicy

FieldDescriptionSchemeRequired
metadatametav1.ListMetafalse
items[]BackupPolicytrue

Back to Custom Resources

BackupPolicySpec

BackupPolicySpec defines the configuration items for MySQLCluster backup.\n\nThe following fields will be copied to CronJob.spec:\n\n- Schedule - StartingDeadlineSeconds - ConcurrencyPolicy - SuccessfulJobsHistoryLimit - FailedJobsHistoryLimit\n\nThe following fields will be copied to CronJob.spec.jobTemplate.\n\n- ActiveDeadlineSeconds - BackoffLimit

FieldDescriptionSchemeRequired
scheduleThe schedule in Cron format for periodic backups. See https://en.wikipedia.org/wiki/Cronstringtrue
jobConfigSpecifies parameters for backup Pod.JobConfigtrue
startingDeadlineSecondsOptional deadline in seconds for starting the job if it misses scheduled time for any reason. Missed jobs executions will be counted as failed ones.*int64false
concurrencyPolicySpecifies how to treat concurrent executions of a Job. Valid values are: - "Allow" (default): allows CronJobs to run concurrently; - "Forbid": forbids concurrent runs, skipping next run if previous run hasn't finished yet; - "Replace": cancels currently running job and replaces it with a new onebatchv1.ConcurrencyPolicyfalse
activeDeadlineSecondsSpecifies the duration in seconds relative to the startTime that the job may be continuously active before the system tries to terminate it; value must be positive integer. If a Job is suspended (at creation or through an update), this timer will effectively be stopped and reset when the Job is resumed again.*int64false
backoffLimitSpecifies the number of retries before marking this job failed. Defaults to 6*int32false
successfulJobsHistoryLimitThe number of successful finished jobs to retain. This is a pointer to distinguish between explicit zero and not specified. Defaults to 3.*int32false
failedJobsHistoryLimitThe number of failed finished jobs to retain. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1.*int32false

Back to Custom Resources

BucketConfig

BucketConfig is a set of parameter to access an object storage bucket.

FieldDescriptionSchemeRequired
bucketNameThe name of the bucketstringtrue
regionThe region of the bucket. This can also be set through AWS_REGION environment variable.stringfalse
endpointURLThe API endpoint URL. Set this for non-S3 object storages.stringfalse
usePathStyleAllows you to enable the client to use path-style addressing, i.e., https?://ENDPOINT/BUCKET/KEY. By default, a virtual-host addressing is used (https?://BUCKET.ENDPOINT/KEY).boolfalse
backendTypeBackendType is an identifier for the object storage to be used.stringfalse
caCertPath to SSL CA certificate file used in addition to system default.stringfalse

Back to Custom Resources

JobConfig

JobConfig is a set of parameters for backup and restore job Pods.

FieldDescriptionSchemeRequired
serviceAccountNameServiceAccountName specifies the ServiceAccount to run the Pod.stringtrue
bucketConfigSpecifies how to access an object storage bucket.BucketConfigtrue
workVolumeWorkVolume is the volume source for the working directory. Since the backup or restore task can use a lot of bytes in the working directory, You should always give a volume with enough capacity.\n\nThe recommended volume source is a generic ephemeral volume. https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumesVolumeSourceApplyConfigurationtrue
threadsThreads is the number of threads used for backup or restoration.intfalse
cpuCPU is the amount of CPU requested for the Pod.*resource.Quantityfalse
maxCpuMaxCPU is the amount of maximum CPU for the Pod.*resource.Quantityfalse
memoryMemory is the amount of memory requested for the Pod.*resource.Quantityfalse
maxMemoryMaxMemory is the amount of maximum memory for the Pod.*resource.Quantityfalse
envFromList of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig[]EnvFromSourceApplyConfigurationfalse
envList of environment variables to set in the container.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig[]EnvVarApplyConfigurationfalse
affinityIf specified, the pod's scheduling constraints.*AffinityApplyConfigurationfalse
volumesVolumes defines the list of volumes that can be mounted by containers in the Pod.[]VolumeApplyConfigurationfalse
volumeMountsVolumeMounts describes a list of volume mounts that are to be mounted in a container.[]VolumeMountApplyConfigurationfalse

Back to Custom Resources

Commands

kubectl moco plugin

kubectl-moco is a kubectl plugin for MOCO.

kubectl moco [global options] <subcommand> [sub options] args...

Global options

Global options are compatible with kubectl. For example, the following options are available.

Global optionsDefault valueDescription
--kubeconfig$HOME/.kube/configPath to the kubeconfig file to use for CLI requests.
-n, --namespacedefaultIf present, the namespace scope for this CLI request.

MySQL users

You can choose one of the following user for --mysql-user option value.

NameDescription
moco-readonlyA read-only user.
moco-writableA user that can edit users, databases, and tables.
moco-adminThe super-user.

kubectl moco mysql [options] CLUSTER_NAME [-- mysql args...]

Run mysql command in a specified MySQL instance.

OptionsDefault valueDescription
-u, --mysql-usermoco-readonlyLogin as the specified user
--indexindex of the primaryIndex of the target mysql instance
-i, --stdinfalsePass stdin to the mysql container
-t, --ttyfalseStdin is a TTY

Examples

This executes SELECT VERSION() on the primary instance in mycluster in foo namespace:

$ kubectl moco -n foo mysql mycluster -- -N -e 'SELECT VERSION()'

To execute SQL from a file:

$ cat sample.sql | kubectl moco -n foo mysql -u moco-writable -i mycluster

To run mysql interactively for the instance 2 in mycluster in the default namespace:

$ kubectl moco mysql --index 2 -it mycluster

kubectl moco credential [options] CLUSTER_NAME

Fetch the credential information of a specified user

OptionsDefault valueDescription
-u, --mysql-usermoco-readonlyFetch the credential of the specified user
--formatplainOutput format: plain or mycnf

kubectl moco switchover CLUSTER_NAME

Switch the primary instance to one of the replicas.

moco-controller

moco-controller controls MySQL clusters on Kubernetes.

Environment variables

NameRequiredDescription
POD_NAMESPACEYesThe namespace name where moco-controller runs.

Command line flags

Flags:
      --add_dir_header                    If true, adds the file directory to the header of the log messages
      --agent-image string                The image of moco-agent sidecar container
      --alsologtostderr                   log to standard error as well as files (no effect when -logtostderr=true)
      --apiserver-qps-throttle int        The maximum QPS to the API server. (default 20)
      --backup-image string               The image of moco-backup container
      --cert-dir string                   webhook certificate directory
      --check-interval duration           Interval of cluster maintenance (default 1m0s)
      --fluent-bit-image string           The image of fluent-bit sidecar container
      --grpc-cert-dir string              gRPC certificate directory (default "/grpc-cert")
      --health-probe-addr string          Listen address for health probes (default ":8081")
  -h, --help                              help for moco-controller
      --leader-election-id string         ID for leader election by controller-runtime (default "moco")
      --log_backtrace_at traceLocation    when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                    If non-empty, write log files in this directory (no effect when -logtostderr=true)
      --log_file string                   If non-empty, use this log file (no effect when -logtostderr=true)
      --log_file_max_size uint            Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                       log to standard error instead of files (default true)
      --max-concurrent-reconciles int     The maximum number of concurrent reconciles which can be run (default 8)
      --metrics-addr string               Listen address for metric endpoint (default ":8080")
      --mysqld-exporter-image string      The image of mysqld_exporter sidecar container
      --one_output                        If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
      --pprof-addr string                 Listen address for pprof endpoints. pprof is disabled by default
      --skip_headers                      If true, avoid header prefixes in the log messages
      --skip_log_headers                  If true, avoid headers when opening log files (no effect when -logtostderr=true)
      --stderrthreshold severity          logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=false) (default 2)
  -v, --v Level                           number for the log level verbosity
      --version                           version for moco-controller
      --vmodule moduleSpec                comma-separated list of pattern=N settings for file-filtered logging
      --webhook-addr string               Listen address for the webhook endpoint (default ":9443")
      --zap-devel                         Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
      --zap-encoder encoder               Zap log encoding (one of 'json' or 'console')
      --zap-log-level level               Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
      --zap-stacktrace-level level        Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
      --zap-time-encoding time-encoding   Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

moco-backup

moco-backup command is used in ghcr.io/cybozu-go/moco-backup container. Normally, users need not take care of this command.

Environment variables

moco-backup takes configurations of S3 API from environment variables. For details, read documentation of EnvConfig in github.com/aws/aws-sdk-go-v2/config.

It also requires MYSQL_PASSWORD environment variable to be set.

Global command-line flags

Global Flags:
      --endpoint string   S3 API endpoint URL
      --region string     AWS region
      --threads int       The number of threads to be used (default 4)
      --use-path-style    Use path-style S3 API
      --work-dir string   The writable working directory (default "/work")
      --ca-cert string    Path to SSL CA certificate file used in addition to system default

Subcommands

backup subcommand

Usage: moco-backup backup BUCKET NAMESPACE NAME

  • BUCKET: The bucket name.
  • NAMESPACE: The namespace of the MySQLCluster.
  • NAME: The name of the MySQLCluster.

`restore subcommand

Usage: moco-backup restore BUCKET SOURCE_NAMESPACE SOURCE_NAME NAMESPACE NAME YYYYMMDD-hhmmss

  • BUCKET: The bucket name.
  • SOURCE_NAMESPACE: The source MySQLCluster's namespace.
  • SOURCE_NAME: The source MySQLCluster's name.
  • NAMESPACE: The target MySQLCluster's namespace.
  • NAME: The target MySQLCluster's name.
  • YYYYMMDD-hhmmss: The point-in-time to restore data. e.g. 20210523-150423

Metrics

moco-controller

moco-controller provides the following kind of metrics in Prometheus format. Aside from the standard Go runtime and process metrics, it exposes metrics related to controller-runtime, MySQL clusters, and backups.

MySQL clusters

All these metrics are prefixed with moco_cluster_ and have name and namespace labels.

NameDescriptionType
checks_totalThe number of times MOCO checked the clusterCounter
errors_totalThe number of times MOCO encountered errors when managing the clusterCounter
available1 if the cluster is available, 0 otherwiseGauge
healthy1 if the cluster is running without any problems, 0 otherwiseGauge
switchover_totalThe number of times MOCO changed the live primary instanceCounter
failover_totalThe number of times MOCO changed the failed primary instanceCounter
replicasThe number of mysqld instances in the clusterGauge
ready_replicasThe number of ready mysqld Pods in the clusterGauge
clustering_stopped1 if the cluster is clustering stopped, 0 otherwiseGauge
reconciliation_stopped1 if the cluster is reconciliation stopped, 0 otherwiseGauge
errant_replicasThe number of mysqld instances that have errant transactionsGauge
processing_time_secondsThe length of time in seconds processing the clusterHistogram
volume_resized_totalThe number of successful volume resizesCounter
volume_resized_errors_totalThe number of failed volume resizesCounter
statefulset_recreate_totalThe number of successful StatefulSet recreatesCounter
statefulset_recreate_errors_totalThe number of failed StatefulSet recreatesCounter

Backup

All these metrics are prefixed with moco_backup_ and have name and namespace labels.

NameDescriptionType
timestampThe number of seconds since January 1, 1970 UTC of the last successful backupGauge
elapsed_secondsThe number of seconds taken for the last backupGauge
dump_bytesThe size of compressed full backup dataGauge
binlog_bytesThe size of compressed binlog filesGauge
workdir_usage_bytesThe maximum usage of the working directoryGauge
warningsThe number of warnings in the last successful backupGauge

MySQL instance

For each mysqld instance, moco-agent exposes a set of metrics. Read github.com/cybozu-go/moco-agent/blob/main/docs/metrics.md for details.

Also, if you give a set of collector flag names to spec.collectors of MySQLCluster, a sidecar container running mysqld_exporter exposes the collected metrics for each mysqld instance.

Scrape rules

This is an example kubernetes_sd_config for Prometheus to collect all MOCO & MySQL metrics.

scrape_configs:
- job_name: 'moco-controller'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_label_app_kubernetes_io_component,__meta_kubernetes_pod_container_port_name]
    action: keep
    regex: moco-system;moco-controller;metrics

- job_name: 'moco-agent'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_container_port_name,__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: keep
    regex: mysql;agent-metrics;moco-.*
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace

- job_name: 'moco-mysql'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_container_port_name,__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: keep
    regex: mysql;mysqld-metrics;moco-.*
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
    action: replace
    target_label: name
  - source_labels: [__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
    action: replace
    target_label: index
    regex: .*-([0-9])
  - source_labels: [__meta_kubernetes_pod_label_moco_cybozu_com_role]
    action: replace
    target_label: role

The collected metrics should have these labels:

  • namespace: MySQLCluster's metadata.namespace
  • name: MySQLCluster's metadata.name
  • index: The ordinal of MySQL instance Pod

Design notes

Design notes

The purpose of this document is to describe the backgrounds and the goals of MOCO. Implementation details are described in other documents.

Motivation

We are creating our own Kubernetes operator for clustering MySQL instances for the following reasons:

Firstly, our application requires strict-compatibility to the traditional MySQL. Although recent MySQL provides an advanced clustering solution called group replication that is based on Paxos, we cannot use it because of various limitations from group replication.

Secondly, we want to have a Kubernetes native and the simplest operator. For example, we can use Kubernetes Service to load-balance read queries to multiple replicas. Also, we do not want to support non-GTID based replications.

Lastly, none of the existing operators could satisfy our requirements.

Goals

  • Manage primary-replica style clustering of MySQL instances.
    • The primary instance is the only instance that allows writes.
    • Replica instances replicate data from the primary and are read-only.
  • Support replication from an external MySQL instance.
  • Support all the four transaction isolation levels.
  • No split-brain.
  • Allow large transactions.
  • Upgrade the operator without restarting MySQL Pods.
  • Safe and automatic upgrading of MySQL version.
  • Support automatic primary selection and switchover.
  • Support automatic failover.
  • Backup and restore features.
    • Support point-in-time recovery (PiTR).
  • Tenant users can specify the following parameters:
    • The version of MySQL instances.
    • The number of processor cores for each MySQL instance.
    • The amount of memory for each MySQL instance.
    • The amount of backing storage for each MySQL instance.
    • The number of replicas in the MySQL cluster.
    • Custom configuration parameters.
  • Allow CREATE / DROP TEMPORARY TABLE during a transaction.

Non-goals

  • Support for older MySQL versions (5.6, 5.7)

    As a late comer, we focus our development effort on the latest MySQL. This simplifies things and allows us to use advanced mechanisms such as CLONE INSTANCE.

  • Node fencing

    Fencing is a technique to safely isolated a failed Node. MOCO does not rely on Node fencing as it should be done externally.

    We can still implement failover in a safe way by configuring semi-sync parameters appropriately.

How MOCO reconciles MySQLCluster

MOCO creates and updates a StatefulSet and related resources for each MySQLCluster custom resource. This document describes how and when MOCO updates them.

Reconciler versions

MOCO's reconciliation routine should be consistent to avoid frequent updates.

That said, we may need to modify the reconciliation process in the future. To avoid updating the StatefulSet, MOCO has multiple versions of reconcilers.

For example, if a MySQLCluster is reconciled with version 1 of the reconciler, MOCO will keep using the version 1 reconciler to reconcile the MySQLCluster.

If the user edits MySQLCluster's spec field, MOCO can reconcile the MySQLCluster with the latest reconciler, for example version 2, because the user shall be ready for mysqld restarts.

The update policy of moco-agent container

We shall try to avoid updating moco-agent as much as possible.

The figure below illustrates the overview of resources related to clustering MySQL instances.

Overview of clustering related resources

StatefulSet

MOCO tries not to update the StatefulSet frequently. It updates the StatefulSet only when the update is a must.

The conditions for StatefulSet update

The StatefulSet will be updated when:

  • Some fields under spec of MySQLCluster are modified.
  • my.cnf for mysqld is updated.
  • the version of the reconciler used to reconcile the StatefulSet is obsoleted.
  • the image of moco-agent given to the controller is updated.
  • the image of mysqld_exporter given to the controller is updated.

When the StatefulSet is not updated

  • the image of fluent-bit given to the controller is changed.
    • because the controller does not depend on fluent-bit.

The fluent-bit sidecar container is updated only when some fields under spec of MySQLCluster are modified.

Status about StatefulSet

  • In MySQLCluster.Status.Condition, there is a condition named StatefulSetReady.
  • This indicates the readieness of StatefulSet.
  • The condition will be True when the rolling update of StatefulSet completely finishes.

Secrets

MOCO generates random passwords for users that MOCO uses to access MySQL.

The generated passwords are stored in two Secrets. One is in the same namespace as moco-controller, and the other is in the namespace of MySQLCluster.

Certificate

MOCO creates a Certificate in the same namespace as moco-controller to issue a TLS certificate for moco-agent.

After cert-manager issues a TLS certificate and creates a Secret for it, MOCO copies the Secret to the namespace of MySQLCluster. For details, read security.md.

Service

MOCO creates three Services for each MySQLCluster, that is:

  • A headless Service, required for every StatefulSet
  • A Service for the primary mysqld instance
  • A Service for replica mysql instances

The Services' labels, annotations, and spec fields can be customized with MySQLCluster's spec.primaryServiceTemplate and spec.replicaServiceTemplate field. The spec.primaryServiceTemplate configures the Service for the primary mysqld instance and the spec.replicaServiceTemplate configures the Service for the replica mysqld instances.

The following fields in Service spec may not be customized, though.

  • clusterIP
  • ports
  • selector

ConfigMap

MOCO creates and updates a ConfigMap for my.cnf. The name of this ConfigMap is calculated from the contents of my.cnf that may be changed by users.

MOCO deletes old ConfigMaps of my.cnf after a new ConfigMap for my.cnf is created.

If the cluster does not disable a sidecar container for slow query logs, MOCO creates a ConfigMap for the sidecar.

PodDisruptionBudget

MOCO creates a PodDisruptionBudget for each MySQLCluster to prevent too few semi-sync replica servers.

The spec.maxUnavailable value is calculated from MySQLCluster's spec.replicas as follows:

`spec.maxUnavailable` = floor(`spec.replicas` / 2)

If spec.replicas is 1, MOCO does not create a PDB.

ServiceAccount

MOCO creates a ServiceAccount for Pods of the StatefulSet. The ServiceAccount is not bound to any Roles/ClusterRoles.

See backup.md for the overview of the backup and restoration mechanism.

CronJob

This is the only resource created when backup is enabled for MySQLCluster.

If the backup is disabled, the CronJob is deleted.

Job

To restore data from a backup, MOCO creates a Job. MOCO deletes the Job after the Job finishes successfully.

If the Job fails, MOCO leaves the Job.

Status of Reconcliation

  • In MySQLCluster.Status.Condition, there is a condition named ReconcileSuccess.
  • This indicates the status of reconcilation.
  • The condition will be True when the reconcile function successfully finishes.

How MOCO maintains MySQL clusters

For each MySQLCluster, MOCO creates and maintains a set of mysqld instances. The set contains one primary instance and may contain multiple replica instances depending on the spec.replicas value of MySQLCluster.

This document describes how MOCO does this job safely.

Terminology

  • Replication: GTID-based replication between mysqld instances.
  • Cluster: a group of mysqld instances that replicate data between them.
  • Primary (instance): a single source instance of mysqld in a cluster.
  • Replica (instance): a read-only instance of mysqld that synchronizes data with the primary instance.
  • Intermediate primary: a special primary instance that replicates data from an external mysqld.
  • Errant transaction: a transaction that exists only on a replica instance.
  • Errant replica: a replica instance that has errant transactions.
  • Switchover: operation to change a live primary to a replica and promote a replica to the new primary.
  • Failover: operation to replace a dead primary with a replica.

Prerequisites

MySQLCluster allows positive odd numbers for spec.replicas value. If 1, MOCO runs a single mysqld instance without configuring replication. If 3 or greater, MOCO chooses a mysqld instance as a primary, writable instance and configures all other instances as replicas of the primary instance.

status.currentPrimaryIndex in MySQLCluster is used to record the current chosen primary instance. Initially, status.currentPrimaryIndex is zero and therefore the index of the primary instance is zero.

As a special case, if spec.replicationSourceSecretName is set for MySQLCluster, the primary instance is configured as a replica of an external MySQL server. In this case, the primary instance will not be writable. We call this type of primary instance intermediate primary.

If spec.replicationSourceSecretName is not set, MOCO configures semisynchronous replication between the primary and replicas. Otherwise, the replication is asynchronous.

For semi-synchronous replication, MOCO configures rpl_semi_sync_master_timeout long enough so that it never degrades to asynchronous replication.

Likewise, MOCO configures rpl_semi_sync_master_wait_for_slave_count to (spec.replicas - 1 / 2) to make sure that at least half of replica instances have the same commit as the primary. e.g., If spec.replicas is 5, rpl_semi_sync_master_wait_for_slave_count will be set to 2.

MOCO also disables relay_log_recovery because enabling it would drop the relay logs on replicas.

mysqld always starts with super_read_only=1 to prevent erroneous writes, and with skip_slave_start to prevent misconfigured replication.

moco-agent, a sidecar container for MOCO, initializes MySQL users and plugins. At the end of the initialization, it issues RESET MASTER to clear executed GTID set.

moco-agent also provides a readiness probe for mysqld container. If a replica instance does not start replication threads or is too delayed to execute transactions, the container and the Pod will be determined as unready.

Limitations

Currently, MOCO does not re-initialize data after the primary instance fails.

After failover to a replica instance, the old primary may have errant transactions because it may recover unacknowledged transactions in its binary log. This is an inevitable limitation in MySQL semi-synchronous replication.

If this happens, MOCO detects the errant transaction and will not allow the old primary to rejoin the cluster as a replica.

Users need to delete the volume data (PersistentVolumeClaim) and the pod of the old primary to re-initialize it.

Possible states

MySQLCluster

MySQLCluster can be one of the following states.

The initial state is Cloning if spec.replicationSourceSecretName is set, or Restoring if spec.restore is set. Otherwise, the initial state is Incomplete.

Note that, if the primary Pod is ready, the mysqld is assured writable. Likewise, if a replica Pod is ready, the mysqld is assured read-only and running replication threads w/o too much delay.

  1. Healthy
    • All Pods are ready.
    • All replicas have no errant transactions.
    • All replicas are read-only and connected to the primary.
    • For intermediate primary instance, the primary works as a replica for an external mysqld and is read-only.
  2. Cloning
    • spec.replicationSourceSecretName is set.
    • status.cloned is false.
    • The cloning result exists and is not "Completed" or there is no cloning result and the instance has no data.
    • (note: if the primary has some data and has no cloning result, the instance was used to be a replica and then promoted to the primary.)
  3. Restoring
    • spec.restore is set.
    • status.restoredTime is not set.
  4. Degraded
    • The primary Pod is ready and does not lose data.
    • For intermediate primary instance, the primary works as a replica for an external mysqld and is read-only.
    • Half or more replicas are ready, read-only, connected to the primary, and have no errant transactions. For example, if spec.replicas is 5, two or more such replicas are needed.
    • At least one replica has some problems.
  5. Failed
    • The primary instance is not running or lost data.
    • More than half of replicas are running and have data without errant transactions. For example, if spec.replicas is 5, three or more such replicas are needed.
  6. Lost
    • The primary instance is not running or lost data.
    • Half or more replicas are not running or lost data or have errant transactions.
  7. Incomplete
    • None of the above states applies.

MOCO can recover the cluster to Healthy from Degraded, Failed, or Incomplete if all Pods are running and there are no errant transactions.

MOCO can recover the cluster to Degraded from Failed when not all Pods are running. Recovering from Failed is called failover.

MOCO cannot recover the cluster from Lost. Users need to restore data from backups.

Pod

mysqld is run as a container in a Pod. Therefore, MOCO needs to be aware of the following conditions.

  1. Missing: the Pod does not exist.
  2. Exist: the Pod exists and not Terminating or Demoting.
  3. Terminating: The Pod exists and metadata.deletionTimestamp is not null.
  4. Demoting: The Pod exists and has moco.cybozu.com/demote: true annotation.

If there are missing Pods, MOCO does nothing for the MySQLCluster.

If a primary instance Pod is Terminating or Demoting, MOCO controller changes the primary to one of the replica instances. This operation is called switchover.

MySQL data

MOCO checks replica instances whether they have errant transactions compared to the primary instance. If it detects such an instance, MOCO records the instance with MySQLCluster and excludes it from the cluster.

The user needs to delete the Pod and the volume manually and let the StatefulSet controller to re-create them. After a newly initialized instance gets created, MOCO will allow it to rejoin the cluster.

Invariants

  • By definition, the primary instance recorded in MySQLCluster has no errant transactions. It is always the single source of truth.
  • Errant replicas are not treated as ready even if their Pod status is ready.

The maintenance flow

MOCO runs the following infinite loop for each MySQLCluster. It stops when MySQLCluster resource is deleted.

  1. Gather the current status
  2. Update status of MySQLCluster
  3. Determine what MOCO should do for the cluster
  4. If there is nothing to do, wait a while and go to 1
  5. Do the determined operation then go to 1

Read the following sub-sections about 1 to 3.

Gather the current status

MOCO gathers the information from kube-apiserver and mysqld as follows:

  • MySQLCluster resource
  • Pod resources
    • If some of the Pods are missing, MOCO does nothing.
  • mysqld
    • SHOW SLAVE HOSTS (on the primary)
    • SHOW SLAVE STATUS (on the replicas)
    • Global variables such as gtid_executed or super_read_only
    • Result of CLONE from performance_schema.clone_status table

If MOCO cannot connect to an instance for a certain period, that instance is determined as failed.

Update status of MySQLCluster

In this phase, MOCO updates status field of MySQLCluster as follows:

  1. Determine the current MySQLCluster state.
  2. Add or update type=Initialized condition to status.conditions as
    • True if the cluster state is not Cloning.
    • otherwise, False.
  3. Add or update type=Available condition to status.conditions as
    • True if the cluster state is Healthy or Degraded.
    • otherwise, False.
  4. Add or update type=Healthy condition to status.conditions as
    • True if the cluster state is Healthy.
    • otherwise, False.
    • The Reason field is set to the cluster state such as "Failed" or "Incomplete".
  5. Set the number of ready replica Pods to status.syncedReplicas.
  6. Add newly found errant replicas to status.errantReplicaList.
  7. Remove re-initialized and/or no-longer errant replicas from status.errantReplicaList
  8. Set status.errantReplicas to the length of status.errantReplicaList.
  9. Set status.cloned to true if spec.replicationSourceSecret is not nil and the state is not Cloning.

Determine what MOCO should do for the cluster

The operation depends on the current cluster state.

The operation and its result are recorded as Events of MySQLCluster resource.

cf. Application Introspection and Debugging

Healthy

If the primary instance Pod is Terminating or Demoting, switch the primary instance to another replica. Otherwise, just wait a while.

The switchover is done as follows. It takes at least several seconds for a new primary to become writable.

  1. Make the primary instance super_read_only=1.
  2. Kill all existing connections except ones from localhost and ones for MOCO.
  3. Wait for a replica to catch up the executed GTID set of the primary instance.
  4. Set status.currentPrimaryIndex to the replica's index.
  5. If the old primary is Demoting, remove moco.cybozu.com/demote annotation from the Pod.

Cloning

Execute CLONE INSTANCE on the intermediate primary instance to clone data from an external MySQL instance.

If the cloning goes successful, do the same as Intermediate case.

Restoring

Do nothing.

Degraded

First, check if the primary instance Pod is Terminating or Demoting, and if it is, do the switchover just like Healthy case.

Then, do the same as Intermediate case to try to fix the problems. It is not possible to recover the cluster to Healthy if there are errant or stopped replicas, though.

Failed

MOCO chooses the most advanced instance as the new primary instance. The most advanced means that its retrieved GTID set is the superset of all other replicas except for those have errant transactions.

To prevent accidental writes to the old primary instance (so-called split-brain), MOCO stops replication IO_THREAD for all replicas. This way, the old primary cannot get necessary acks from replicas to write further transactions.

The failover is done as follows:

  1. Stop IO_THREAD on all replicas.
  2. Choose the most advanced replica as the new primary. Errant replicas recorded in MySQLCluster are excluded from the candidates.
  3. Wait for the replica to execute all retrieved GTID set.
  4. Update status.currentPrimaryIndex to the new primary's index.

Lost

There is nothing can be done.

Intermediate

  • On the primary that was an intermediate primary, wait for all the retrieved GTID set to be executed.
  • Start replication between the primary and non-errant replicas.
    • If a replication has no data, MOCO clones the primary data to the replica first.
  • Stop replication of errant replicas.
  • Set super_read_only=1 for replica instances that are writable.
  • Adjust moco.cybozu.com/role label to Pods according to their roles.
    • For errant replicas, the label is removed to prevent users from reading inconsistent data.
  • Finally, make the primary mysqld writable if the primary is not an intermediate primary.

Backup and restore

This document describes how MOCO takes a backup of MySQLCluster data and restores a cluster from a backup.

Overview

A MySQLCluster can be configured to take backups regularly by referencing a BackupPolicy in spec.backupPolicyName. For each MySQLCluster associated with a BackupPolicy, moco-controller creates a CronJob. The CronJob creates a Job to take a full backup periodically. The Job also takes a backup of binary logs for Point-in-Time Recovery (PiTR). The backups are stored in a S3-compatible object storage bucket.

This figure illustrates how MOCO takes a backup of a MySQLCluster.

Backup

  1. moco-controller creates a CronJob and Role/RoleBinding to allow access to MySQLCluster for the Job Pod.
  2. At each configured interval, CronJob creates a Job.
  3. The Job dumps all data from a mysqld using MySQL shell's dump instance utility.
  4. The Job creates a tarball of the dumped data and put it in a bucket of S3 compatible object storage.
  5. The Job also dumps binlogs since the last backup and put it in the same bucket (with a different name, of course).
  6. The Job finally updates MySQLCluster status to record the last successful backup.

To restore from a backup, users need to create a new MySQLCluster with spec.restore filled with necessary information such as the bucket name of the object storage, the object key, and so on.

The next figure illustrates how MOCO restores MySQL cluster from a backup.

Restore

  1. moco-controller creates a Job and Role/RoleBinding for restoration.
  2. The Job downloads a tarball of dumped files of the specified backup.
  3. The Job loads data into an empty mysqld using MySQL shell's dump loading utility.
  4. If the user wanted to restore data at a point-in-time, the Job downloads saved binlogs.
  5. The Job applies binlogs up to the specified point-in-time using mysqlbinlog.
  6. The Job finally updates MySQLCluster status to record the restoration time.

Design goals

Must:

  • Users must be able to configure different backup policies for each MySQLCluster.
  • Users must be able to restore MySQL data at a point-in-time from backups.
  • Users must be able to restore MySQL data without the original MySQLCluster resource.
  • moco-controller must export metrics about backups.

Should:

  • Backup data should be compressed to save the storage space.
  • Backup data should be stored in an object storage.
  • Backups should be taken from a replica instance as much as possible.

These "should's" are mostly in terms of money or performance.

Implementation

Backup file keys

Backup files are stored in an object storage bucket with the following keys.

  • Key for a tarball of a fully dumped MySQL: moco/<namespace>/<name>/YYYYMMDD-hhmmss/dump.tar
  • Key for a compressed tarball of binlog files: moco/<namespace>/<name>/YYYYMMDD-hhmmss/binlog.tar.zst

<namespace> is the namespace of MySQLCluster, and <name> is the name of MySQLCluster. YYYYMMDD-hhmmss is the date and time of the backup where YYYY is the year, MM is two-digit month, DD is two-digit day, hh is two-digit hour in 24-hour format, mm is two-digit minute, and ss is two-digit second.

Example: moco/foo/bar/20210515-230003/dump.tar

This allows multiple MySQLClusters to share the same bucket.

Timestamps

Internally, the time for PiTR is formatted in UTC timezone.

The restore Job runs mysqlbinlog with TZ=Etc/UTC timezone.

Backup

As described in Overview, the backup process is implemented with CronJob and Job. In addition, users need to provide a ServiceAccount for the Job.

The ServiceAccount is often used to grant access to the object storage bucket where the backup files will be stored. For instance, Amazon Elastic Kubernetes Service (EKS) has a feature to create such a ServiceAccount. Kubernetes itself is also developing such an enhancement called Container Object Storage Interface (COSI).

To allow the backup Job to update MySQLCluster status, MOCO creates Role and RoleBinding. The RoleBinding grants the access to the given ServiceAccount.

By default, MOCO uses the Amazon S3 API, the most popular object storage API. Therefore, it also works with object storage that has an S3-compatible API, such as MinIO and Ceph. Object storage that uses non-S3 compatible APIs is only partially supported.

Currently supported object storage includes:

  • Amazon S3-compatible API
  • Google Cloud Storage API

For the first time, the backup Job chooses a replica instance as the backup source if available. For the second and subsequent backups, the Job will choose the last chosen instance as long as it is still a replica and available.

The backups are divided into two: a full dump and binlogs. A full dump is a snapshot of the entire MySQL database. Binlogs are records of transactions. With mysqlbinlog, binlogs can be used to apply transactions to a database restored from a full dump for PiTR.

For the first time, MOCO only takes a full dump of a MySQL instance, and records the GTID at the backup. For the second and subsequent backups, MOCO will retrieve binlogs since the GTID of the last backup until now.

To take a full dump, MOCO uses MySQL shell's dump instance utility. It performs significantly faster than mysqldump or mysqlpump. The dump is compressed with zstd compression algorithm.

MOCO then creates a tarball of the dump and puts it to an object storage bucket.

To retrieve transactions since the last backup until now, mysqlbinlog is used with these flags:

The retrieved binlog files are packed into a tarball and compressed with zstd, then put to an object storage bucket.

Finally, the Job updates MySQLCluster status field with the following information:

  • The time of backup
  • The time spent on the backup
  • The ordinal of the backup source instance
  • server_uuid of the instance (to check whether the instance was re-initialized or not)
  • The binlog filename in SHOW MASTER STATUS output.
  • The size of the tarball of the dumped files
  • The size of the tarball of the binlog files
  • The maximum usage of the working directory
  • Warnings, if any

When executing an incremental backup, the backup source must be a pod whose server_uuid has not changed since the last backup. If the server_uuid has changed, the pod may be missing some of the binlogs generated since the last backup.

The following is how to choose a pod to be the backup source.

flowchart TD
A{"first time?"}
A -->|"yes"| B
A -->|"no"| C["x ← Get the indexes of the pod whose server_uuid has not changed"] --> D

B{Are replicas available?}
B -->|"yes"| B1["return\nreplicaIdx\ndoBackupBinlog=false"]
style B1 fill:#c1ffff
B -->|"no"| B2["return\nprimaryIdx\ndoBackupBinlog=false"]
style B2 fill:#ffffc1

D{"Is x empty?"}
D -->|"yes"| E["add warning to bm.warnings"] --> F
style E fill:#ffc1c1
D -->|"no"| G

F{"Are replicas available?"}
F -->|"yes"| F1["return\nreplicaIdx\ndoBackupBinlog=false"]
style F1 fill:#ffc1c1
F -->|"no"| F2["return\nprimaryIdx\ndoBackupBinlog=false"]
style F2 fill:#ffc1c1

G{"Are there replica indexes in x?"}
G -->|"yes"| H
G -->|"no"| G1["return\nprimaryIdx\ndoBackupBinlog=true"]
style G1 fill:#ffffc1

H{"Is lastIndex included in x?"}
H -->|"yes"| I
H -->|"no"| H1["return\nreplicaIdx\ndoBackupBinlog=true"]
style H1 fill:#c1ffff

I{"Is lastIndex primary?"}
I -->|"yes"| I1["return\nreplicaIdx\ndoBackupBinlog=true"]
style I1 fill:#c1ffff
I -->|"no"| I2["return\nlastIdx\ndoBackupBinlog=true"]
style I2 fill:#c1ffff

Restore

To restore MySQL data from a backup, users need to create a new MySQLCluster with appropriate spec.restore field. spec.restore needs to provide at least the following information:

  • The bucket name
  • Namespace and name of the original MySQLCluster
  • A point-in-time in RFC3339 format

After moco-controller identifies mysqld is running, it creates a Job to retrieve backup files and load them into mysqld.

The Job looks for the most recent tarball of the dumped files that is older than the specified point-in-time in the bucket, and retrieves it. The dumped files are then loaded to mysqld using MySQL shell's load dump utility.

If the point-in-time is different from the time of the dump file, and if there is a compressed tarball of binlog files, then the Job retrieves binlog files and applies transactions up to the point-in-time.

After restoration process finishes, the Job updates MySQLCluster status to record the restoration time. moco-controller then configures the clustering as usual.

If the Job fails, moco-controller leaves the Job as is. The restored MySQL cluster will also be left read-only. If some of the data have been restored, they can be read from the cluster.

If a failed Job is deleted, moco-controller will create a new Job to give it another chance. Users can safely delete a successful Job.

Caveats

  • No automatic deletion of backup files

    MOCO does not delete old backup files from object storage. Users should configure a bucket lifecycle policy to delete old backups automatically.

  • Duplicated backup Jobs

    CronJob may create two or more Jobs at a time. If this happens, only one Job can update MySQLCluster status.

  • Lost binlog files

    If binlog_expire_logs_seconds or expire_logs_days is set to a shorter value than the interval of backups, MOCO cannot save binlogs correctly. Users are responsible to configure binlog_expire_logs_seconds appropriately.

Considered options

There were many design choices and alternative methods to implement backup/restore feature for MySQL. Here are descriptions of why we determined the current design.

Why do we use S3-compatible object storage to store backups?

Compared to file systems, object storage is generally more cost-effective. It also has many useful features such as object lifecycle management.

AWS S3 API is the most prevailing API for object storages.

What object storage is supported?

MOCO currently supports the following object storage APIs:

  • Amazon S3
  • Google Cloud Storage

MOCO uses the Amazon S3 API by default. You can specify BackupPolicy.spec.jobConfig.bucketConfig.backendType to specify the object storage API to use. Currently, two identifiers can be specified, backendType for s3 or gcs. If not specified, it will be defaults to s3.

The following is an example of a backup setup using Google Cloud Storage:

apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
...
spec:
  schedule: "@daily"
  jobConfig:
    serviceAccountName: backup-owner
    env:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: <dummy>
    bucketConfig:
      bucketName: moco
      endpointURL: https://storage.googleapis.com
      backendType: gcs
    workVolume:
      emptyDir: {}

Why do we use Jobs for backup and restoration?

Backup and restoration can be a CPU- and memory-consuming task. Running such a task in moco-controller is dangerous because moco-controller manages a lot of MySQLCluster.

moco-agent is also not a safe place to run backup job because it is a sidecar of mysqld Pod. If backup is run in mysqld Pod, it would interfere with the mysqld process.

Why do we prefer mysqlsh to mysqldump?

The biggest reason is the difference in how these tools lock the instance.

mysqlsh uses LOCK INSTANCE FOR BACKUP which blocks DDL until the lock is released. mysqldump, on the other hand, allows DDL to be executed. Once DDL is executed and acquire meta data lock, which means that any DML for the table modified by DDL will be blocked.

Blocking DML during backup is not desirable, especially when the only available backup source is the primary instance.

Another reason is that mysqlsh is much faster than mysqldump / mysqlpump.

Why don't we do continuous backup?

Continuous backup is a technique to save executed transactions in real time. For MySQL, this can be done with mysqlbinlog --stop-never. This command continuously retrieves transactions from binary logs and outputs them to stdout.

MOCO does not adopt this technique for the following reasons:

  • We assume MOCO clusters have replica instances in most cases.

    When the data of the primary instance is lost, one of replicas can be promoted as a new primary.

  • It is troublesome to control the continuous backup process on Kubernetes.

    The process needs to be kept running between full backups. If we do so, the entire backup process should be a persistent workload, not a (Cron)Job.

Upgrading mysqld

This document describes how mysqld upgrades its data and what MOCO has to do about it.

Preconditions

MySQL data

Beginning with 8.0.16, mysqld can update all data that need to be updated when it starts running. This means that MOCO needs nothing to do with MySQL data.

One thing that we should care about is that the update process may take a long time. The startup probe of mysqld container should be configured to wait for mysqld to complete updating data.

ref: https://dev.mysql.com/doc/refman/8.0/en/upgrading-what-is-upgraded.html

Downgrading

MySQL 8.0 does not support any kind of downgrading.

ref: https://dev.mysql.com/doc/refman/8.0/en/downgrading.html

Internally, MySQL has a version called "data dictionary (DD) version". If two MySQL versions have the same DD version, they are considered to have data compatibility.

ref: https://github.com/mysql/mysql-server/blob/mysql-8.0.24/sql/dd/dd_version.h#L209

Nevertheless, DD versions do change from time to time between revisions of MySQL 8.0. Therefore, the simplest way to avoid DD version mismatch is to not downgrade MySQL.

Upgrading a replication setup

In a nutshell, replica MySQL instances should be the same or newer than the source MySQL instance.

refs:

  • https://dev.mysql.com/doc/refman/8.0/en/replication-compatibility.html
  • https://dev.mysql.com/doc/refman/8.0/en/replication-upgrade.html

StatefulSet behavior

When the Pod template of a StatefulSet is updated, Kubernetes updates the Pods. With the default update strategy RollingUpdate, the Pods are updated one by one from the largest ordinal to the smallest.

StatefulSet controller keeps the old Pod template until it completes the rolling update. If a Pod that is not being updated are deleted, StatefulSet controller restores the Pod from the old template.

This means that, if the cluster is Healthy, MySQL is assured to be updated one by one from the instance of the largest ordinal to the smallest.

refs:

  • https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates
  • https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update

Automatic switchover

MOCO switches the primary instance when the Pod of the instance is being deleted. Read clustering.md for details.

MOCO implementation

With the preconditions listed above, MOCO can upgrade mysqld in MySQLCluster safely as follows.

  1. Set .spec.updateStrategy field in StatefulSet to RollingUpdate.
  2. Choose the lowest ordinal Pod as the next primary upon a switchover.
  3. Configure the startup probe of mysqld container to wait long enough.
    • By default, MOCO configures the probe to wait up to one hour.
    • Users can adjust the duration for each MySQLCluster.

Example

Suppose that we are updating a three-instance cluster. The mysqld instances in the cluster have ordinals 0, 1, and 2, and the current primary instance is instance 1.

After MOCO updates the Pod template of the StatefulSet created for the cluster, Kubernetes start re-creating Pods starting from instance 2.

Instance 2 is a replica and therefore is safe for an update.

Next to instance 2, the instance 1 Pod is deleted. The deletion triggers an automatic switchover so that MOCO changes the primary to the instance 0 because it has the lowest ordinal. Because instance 0 is running an old mysqld, the preconditions are kept.

Finally, instance 0 is re-created in the same way. This time, MOCO switches the primary to instance 1. Since both instance 1 and 2 has been updated and instance 0 is being deleted, the preconditions are kept.

Limitations

If an instance is down during an upgrade, MOCO may choose an already updated instance as the new primary even though some instances are still running an old version.

If this happens, users may need to manually delete the old replica data and re-initialize the replica to restore the cluster health.

User's responsibility

Security considerations

gRPC API

moco-agent, a sidecar container in mysqld Pod, provides gRPC API to execute CLONE INSTANCE and required operations after CLONE. More importantly, the request contains credentials to access the source database.

To protect the credentials and prevent abuse of API, MOCO configures mTLS between moco-agent and moco-controller as follows:

  1. Create an Issuer resource in moco-system namespace as the Certificate Authority.
  2. Create a Certificate resource to issue the certificate for moco-controller.
  3. moco-controller issues certificates for each MySQLCluster by creating Certificate resources.
  4. moco-controller copies Secret resources created by cert-manager to the namespaces of MySQLCluster.
  5. Both moco-controller and moco-agent verifies the certificate with the CA certificate.
    • The CA certificate is embedded in the Secret resources.
  6. moco-agent additionally verifies the certificate from moco-controller if it's Common Name is moco-controller.

MySQL passwords

MOCO generates its user passwords randomly with the OS random device. The passwords then stored as Secret resources.

As to communication between moco-controller and mysqld, it is not (yet) over TLS. That said, the password is encrypted anyway thanks to caching_sha2_password authentication.