MOCO documentation
This is the documentation site for MOCO. MOCO is a Kubernetes operator for MySQL created and maintained by Cybozu.
Getting started
Setup
Quick setup
You can choose between two installation methods.
MOCO depends on cert-manager. If cert-manager is not installed on your cluster, install it as follows:
$ curl -fsLO https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml
$ kubectl apply -f cert-manager.yaml
Install using raw manifests:
$ curl -fsLO https://github.com/cybozu-go/moco/releases/latest/download/moco.yaml
$ kubectl apply -f moco.yaml
Install using Helm chart:
$ helm repo add moco https://cybozu-go.github.io/moco/
$ helm repo update
$ helm install --create-namespace --namespace moco-system moco moco/moco
Customize manifests
If you want to edit the manifest, config/
directory contains the source YAML for kustomize.
Next step
Read usage.md
and create your first MySQL cluster!
MOCO Helm Chart
How to use MOCO Helm repository
You need to add this repository to your Helm repositories:
$ helm repo add moco https://cybozu-go.github.io/moco/
$ helm repo update
Quick start
Installing cert-manager
$ curl -fsL https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml | kubectl apply -f -
Installing the Chart
NOTE:
This installation method requires cert-manager to be installed beforehand. To install the chart with the release name
moco
using a dedicated namespace(recommended):
$ helm install --create-namespace --namespace moco-system moco moco/moco
Specify parameters using --set key=value[,key=value]
argument to helm install
.
Alternatively a YAML file that specifies the values for the parameters can be provided like this:
$ helm install --create-namespace --namespace moco-system moco -f values.yaml moco/moco
Values
Key | Type | Default | Description |
---|---|---|---|
replicaCount | number | 2 | Number of controller replicas. |
image.repository | string | "ghcr.io/cybozu-go/moco" | MOCO image repository to use. |
image.pullPolicy | string | IfNotPresent | MOCO image pulling policy. |
image.tag | string | {{ .Chart.AppVersion }} | MOCO image tag to use. |
imagePullSecrets | list | [] | Secrets for pulling MOCO image from private repository. |
resources | object | {"requests":{"cpu":"100m","memory":"20Mi"}} | resources used by moco-controller. |
crds.enabled | bool | true | Install and update CRDs as part of the Helm chart. |
extraArgs | list | [] | Additional command line flags to pass to moco-controller binary. |
nodeSelector | object | {} | nodeSelector used by moco-controller. |
affinity | object | {} | affinity used by moco-controller. |
tolerations | list | [] | tolerations used by moco-controller. |
topologySpreadConstraints | list | [] | topologySpreadConstraints used by moco-controller. |
priorityClassName | string | "" | PriorityClass used by moco-controller. |
Generate Manifests
You can use the helm template
command to render manifests.
$ helm template --namespace moco-system moco moco/moco
CRD considerations
Installing or updating CRDs
MOCO Helm Chart installs or updates CRDs by default. If you want to manage CRDs on your own, turn off the crds.enabled
parameter.
Removing CRDs
Helm does not remove the CRDs due to the helm.sh/resource-policy: keep
annotation.
When uninstalling, please remove the CRDs manually.
Migrate to v0.11.0 or higher
Chart version v0.11.0 introduces the crds.enabled
parameter.
When updating to a new chart from chart v0.10.x or lower, you MUST leave this parameter true
(the default value).
If you turn off this option when updating, the CRD will be removed, causing data loss.
Migrate to v0.3.0
Chart version v0.3.0 has breaking changes.
The .metadata.name
of the resource generated by Chart is changed.
e.g.
{{ template "moco.fullname" . }}-foo-resources
->moco-foo-resources
Related Issue: cybozu-go/moco#426
If you are using a release name other than moco
, you need to migrate.
The migration steps involve deleting and recreating each MOCO resource once, except CRDs. Since the CRDs are not deleted, the pods running existing MySQL clusters are not deleted, so there is no downtime. However, the migration process should be completed in a short time since the moco-controller will be temporarily deleted and no control over the cluster will be available.
migration steps
-
Show the installed chart
$ helm list -n <YOUR NAMESPACE> NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION moco moco-system 1 2022-08-17 11:28:23.418752 +0900 JST deployed moco-0.2.3 0.12.1
-
Render the manifests
$ helm template --namespace moco-system --version <YOUR CHART VERSION> <YOUR INSTALL NAME> moco/moco > render.yaml
-
Setup kustomize
$ cat > kustomization.yaml <<'EOF' resources: - render.yaml patches: - crd-patch.yaml EOF $ cat > crd-patch.yaml <<'EOF' $patch: delete apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: backuppolicies.moco.cybozu.com --- $patch: delete apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: mysqlclusters.moco.cybozu.com EOF
-
Delete resources
$ kustomize build ./ | kubectl delete -f - serviceaccount "moco-controller-manager" deleted role.rbac.authorization.k8s.io "moco-leader-election-role" deleted clusterrole.rbac.authorization.k8s.io "moco-backuppolicy-editor-role" deleted clusterrole.rbac.authorization.k8s.io "moco-backuppolicy-viewer-role" deleted clusterrole.rbac.authorization.k8s.io "moco-manager-role" deleted clusterrole.rbac.authorization.k8s.io "moco-mysqlcluster-editor-role" deleted clusterrole.rbac.authorization.k8s.io "moco-mysqlcluster-viewer-role" deleted rolebinding.rbac.authorization.k8s.io "moco-leader-election-rolebinding" deleted clusterrolebinding.rbac.authorization.k8s.io "moco-manager-rolebinding" deleted service "moco-webhook-service" deleted deployment.apps "moco-controller" deleted certificate.cert-manager.io "moco-controller-grpc" deleted certificate.cert-manager.io "moco-grpc-ca" deleted certificate.cert-manager.io "moco-serving-cert" deleted issuer.cert-manager.io "moco-grpc-issuer" deleted issuer.cert-manager.io "moco-selfsigned-issuer" deleted mutatingwebhookconfiguration.admissionregistration.k8s.io "moco-mutating-webhook-configuration" deleted validatingwebhookconfiguration.admissionregistration.k8s.io "moco-validating-webhook-configuration" deleted
-
Delete Secret
$ kubectl delete secret sh.helm.release.v1.<YOUR INSTALL NAME>.v1 -n <YOUR NAMESPACE>
-
Re-install the v0.3.0 chart
$ helm install --create-namespace --namespace moco-system --version 0.3.0 moco moco/moco
Release Chart
See RELEASE.md.
Installing kubectl-moco
kubectl-moco is a plugin for kubectl
to control MySQL clusters of MOCO.
Pre-built binaries are available on GitHub releases for Windows, Linux, and MacOS.
Installing using Krew
Krew is the plugin manager for kubectl command-line tool.
See the documentation for how to install Krew.
$ kubectl krew update
$ kubectl krew install moco
Installing manually
-
Set
OS
to the operating system nameOS is one of
linux
,windows
, ordarwin
(MacOS).If Go is available,
OS
can be set automatically as follows:$ OS=$(go env GOOS)
-
Set
ARCH
to the architecture nameARCH is one of
amd64
orarm64
.If Go is available,
ARCH
can be set automatically as follows:$ ARCH=$(go env GOARCH)
-
Set
VERSION
to the MOCO versionSee the MOCO release page: https://github.com/cybozu-go/moco/releases
$ VERSION=< The version you want to install >
-
Download the binary and put it in a directory of your
PATH
.The following is an example to install the plugin in
/usr/local/bin
.$ curl -L -sS https://github.com/cybozu-go/moco/releases/download/${VERSION}/kubectl-moco_${VERSION}_${OS}_${ARCH}.tar.gz \ | tar xz -C /usr/local/bin kubectl-moco
-
Check the installation by running
kubectl moco -h
.$ kubectl moco -h the utility command for MOCO. Usage: kubectl-moco [command] Available Commands: credential Fetch the credential of a specified user help Help about any command mysql Run mysql command in a specified MySQL instance switchover Switch the primary instance ...
How to use MOCO
After setting up MOCO, you can create MySQL clusters with a custom resource called MySQLCluster.
- Basics
- Limitations
- Creating clusters
- Configurations
- Using the cluster
- Backup and restore
- Deleting the cluster
- Status, metrics, and logs
- Maintenance
Basics
MOCO creates a cluster of mysqld instances for each MySQLCluster. A cluster can consists of 1, 3, or 5 mysqld instances.
MOCO configures semi-synchronous GTID-based replication between mysqld instances in a cluster if the cluster size is 3 or 5. A 3-instance cluster can tolerate up to 1 replica failure, and a 5-instance cluster can tolerate up to 2 replica failures.
In a cluster, there is only one instance called primary. The primary instance is the source of truth. It is the only writable instance in the cluster, and the source of the replication. All other instances are called replica. A replica is a read-only instance and replicates data from the primary.
Limitations
Errant replicas
An inherent limitation of GTID-based semi-synchronous replication is that a failed instance would have errant transactions. If this happens, the instance needs to be re-created by removing all data.
MOCO does not re-create such an instance. It only detects instances having errant transactions and excludes them from the cluster. Users need to monitor them and re-create the instances.
Read-only primary
MOCO from time to time sets the primary mysqld instance read-only for a switchover or other reasons. Applications that use MOCO MySQL need to be aware of this.
Creating clusters
Creating an empty cluster
An empty cluster always has a writable instance called the primary. All other instances are called replicas. Replicas are read-only and replicate data from the primary.
The following YAML is to create a three-instance cluster. It has an anti-affinity for Pods so that all instances will be scheduled to different Nodes. It also sets the limits for memory and CPU to make the Pod Guaranteed.
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: default
name: test
spec:
# replicas is the number of mysqld Pods. The default is 1.
replicas: 3
podTemplate:
spec:
# Make the data directory writable. If moco-init fails with "Permission denied", uncomment the following settings.
# securityContext:
# fsGroup: 10000
# fsGroupChangePolicy: "OnRootMismatch" # available since k8s 1.20
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- mysql
- key: app.kubernetes.io/instance
operator: In
values:
- test
topologyKey: "kubernetes.io/hostname"
containers:
# At least a container named "mysqld" must be defined.
- name: mysqld
image: ghcr.io/cybozu-go/moco/mysql:8.4.2
# By limiting CPU and memory, Pods will have Guaranteed QoS class.
# requests can be omitted; it will be set to the same value as limits.
resources:
limits:
cpu: "10"
memory: "10Gi"
volumeClaimTemplates:
# At least a PVC named "mysql-data" must be defined.
- metadata:
name: mysql-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
By default, MOCO uses preferredDuringSchedulingIgnoredDuringExecution
to prevent Pods from being placed on the same Node.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: moco-<MYSQLCLSTER_NAME>
namespace: default
...
spec:
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- mysql
- key: app.kubernetes.io/created-by
operator: In
values:
- moco
- key: app.kubernetes.io/instance
operator: In
values:
- <MYSQLCLSTER_NAME>
topologyKey: kubernetes.io/hostname
weight: 100
...
There are other example manifests in examples
directory.
The complete reference of MySQLCluster is crd_mysqlcluster_v1beta2.md
.
Creating a cluster that replicates data from an external mysqld
Let's call the source mysqld instance donor.
First, make sure partial_revokes
is enabled on the donor; Replicating data from the donor with partial_revokes
disabled will result in replication inconsistencies or errors since MOCO uses partial_revokes
functionality.
We use the clone plugin to copy the whole data quickly. After the cloning, MOCO needs to create some user accounts and install plugins.
On the donor, you need to install the plugin and create two user accounts as follows:
mysql> INSTALL PLUGIN clone SONAME 'mysql_clone.so';
mysql> CREATE USER 'clone-donor'@'%' IDENTIFIED BY 'xxxxxxxxxxx';
mysql> GRANT BACKUP_ADMIN, REPLICATION SLAVE ON *.* TO 'clone-donor'@'%';
mysql> CREATE USER 'clone-init'@'localhost' IDENTIFIED BY 'yyyyyyyyyyy';
mysql> GRANT ALL ON *.* TO 'clone-init'@'localhost' WITH GRANT OPTION;
mysql> GRANT PROXY ON ''@'' TO 'clone-init'@'localhost' WITH GRANT OPTION;
You may change the user names and should change their passwords.
Then create a Secret in the same namespace as MySQLCluster:
$ kubectl -n <namespace> create secret generic donor-secret \
--from-literal=HOST=<donor-host> \
--from-literal=PORT=<donor-port> \
--from-literal=USER=clone-donor \
--from-literal=PASSWORD=xxxxxxxxxxx \
--from-literal=INIT_USER=clone-init \
--from-literal=INIT_PASSWORD=yyyyyyyyyyy
You may change the secret name.
Finally, create MySQLCluster with spec.replicationSourceSecretName
set to the Secret name as follows.
The mysql image must be the same version as the donor's.
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: foo
name: test
spec:
replicationSourceSecretName: donor-secret
podTemplate:
spec:
containers:
- name: mysqld
image: ghcr.io/cybozu-go/moco/mysql:8.4.2 # must be the same version as the donor
volumeClaimTemplates:
- metadata:
name: mysql-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
To stop the replication from the donor, update MySQLCluster with spec.replicationSourceSecretName: null
.
Bring your own image
We provide pre-built MySQL container images at ghcr.io/cybozu-go/moco/mysql.
If you want to build and use your own image, read custom-mysqld.md
.
Configurations
The default and constant configuration values for mysqld
are available on pkg.go.dev.
The settings in ConstMycnf
cannot be changed while the settings in DefaultMycnf
can be overridden.
You can change the default values or set undefined values by creating a ConfigMap in the same namespace as MySQLCluster, and setting spec.mysqlConfigMapName
in MySQLCluster to the name of the ConfigMap as follows:
apiVersion: v1
kind: ConfigMap
metadata:
namespace: foo
name: mycnf
data:
long_query_time: "5"
innodb_buffer_pool_size: "10G"
---
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: foo
name: test
spec:
# set this to the name of ConfigMap
mysqlConfigMapName: mycnf
...
InnoDB buffer pool size
If innodb_buffer_pool_size
is not specified, MOCO sets it automatically to 70% of the value of resources.requests.memory
(or resources.limits.memory
) for mysqld
container.
If both resources.request.memory
and resources.limits.memory
are not set, innodb_buffer_pool_size
will be set to 128M
.
Opaque configuration
Some configuration variables cannot be fully configured with ConfigMap values.
For example, --performance-schema-instrument
needs to be specified multiple times.
You may set them through a special config key _include
.
The value of _include
will be included in my.cnf
as opaque.
apiVersion: v1
kind: ConfigMap
metadata:
namespace: foo
name: mycnf
data:
_include: |
performance-schema-instrument='memory/%=ON'
performance-schema-instrument='wait/synch/%/innodb/%=ON'
performance-schema-instrument='wait/lock/table/sql/handler=OFF'
performance-schema-instrument='wait/lock/metadata/sql/mdl=OFF'
Care must be taken not to overwrite critical configurations such as log_bin
since MOCO does not check the contents from _include
.
Using the cluster
kubectl moco
From outside of your Kubernetes cluster, you can access MOCO MySQL instances using kubectl-moco
.
kubectl-moco
is a plugin for kubectl
.
Pre-built binaries are available on GitHub releases.
The following is an example to run mysql
command interactively to access the primary instance of test
MySQLCluster in foo
namespace.
$ kubectl moco -n foo mysql -it test
Read the reference manual of kubectl-moco
for further details and examples.
MySQL users
MOCO prepares a set of users.
moco-readonly
can read all tables of all databases.moco-writable
can create users, databases, or tables.moco-admin
is the super user.
The exact privileges that moco-readonly
has are:
- PROCESS
- REPLICATION CLIENT
- REPLICATION SLAVE
- SELECT
- SHOW DATABASES
- SHOW VIEW
The exact privileges that moco-writable
has are:
- ALTER
- ALTER ROUTINE
- CREATE
- CREATE ROLE
- CREATE ROUTINE
- CREATE TEMPORARY TABLES
- CREATE USER
- CREATE VIEW
- DELETE
- DROP
- DROP ROLE
- EVENT
- EXECUTE
- INDEX
- INSERT
- LOCK TABLES
- PROCESS
- REFERENCES
- REPLICATION CLIENT
- REPLICATION SLAVE
- SELECT
- SHOW DATABASES
- SHOW VIEW
- TRIGGER
- UPDATE
moco-writable
cannot edit tables in mysql
database, though.
You can create other users and grant them certain privileges as either moco-writable
or moco-admin
.
$ kubectl moco mysql -u moco-writable test -- -e "CREATE USER 'foo'@'%' IDENTIFIED BY 'bar'"
$ kubectl moco mysql -u moco-writable test -- -e "CREATE DATABASE db1"
$ kubectl moco mysql -u moco-writable test -- -e "GRANT ALL ON db1.* TO 'foo'@'%'"
Connecting to mysqld
over network
MOCO prepares two Services for each MySQLCluster.
For example, a MySQLCluster named test
in foo
Namespace has the following Services.
Service Name | DNS Name | Description |
---|---|---|
moco-test-primary | moco-test-primary.foo.svc | Connect to the primary instance. |
moco-test-replica | moco-test-replica.foo.svc | Connect to replica instances. |
moco-test-replica
can be used only for read access.
The type of these Services is usually ClusterIP. The following is an example to change Service type to LoadBalancer and add an annotation for MetalLB.
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: foo
name: test
spec:
primaryServiceTemplate:
metadata:
annotations:
metallb.universe.tf/address-pool: production-public-ips
spec:
type: LoadBalancer
...
Backup and restore
MOCO can take full and incremental backups regularly. The backup data are stored in Amazon S3 compatible object storages.
You can restore data from a backup to a new MySQL cluster.
Object storage bucket
Bucket is a management unit of objects in S3. MOCO stores backups in a specified bucket.
MOCO does not remove backups. To remove old backups automatically, you can set a lifecycle configuration to the bucket.
ref: Setting lifecycle configuration on a bucket
A bucket can be shared safely across multiple MySQLClusters.
Object keys are prefixed with moco/
.
BackupPolicy
BackupPolicy is a custom resource to define a policy for taking backups.
The following is an example BackupPolicy to take a backup every day and store data in MinIO:
apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
metadata:
namespace: backup
name: daily
spec:
# Backup schedule. Any CRON format is allowed.
schedule: "@daily"
jobConfig:
# An existing ServiceAccount name is required.
serviceAccountName: backup-owner
env:
- name: AWS_ACCESS_KEY_ID
value: minioadmin
- name: AWS_SECRET_ACCESS_KEY
value: minioadmin
# bucketName is required. Other fields are optional.
# If backendType is s3 (default), specify the region of the bucket via region filed or AWS_REGION environment variable.
bucketConfig:
bucketName: moco
region: us-east-1
endpointURL: http://minio.default.svc:9000
usePathStyle: true
# MOCO uses a filesystem volume to store data temporarily.
workVolume:
# Using emptyDir as a working directory is NOT recommended.
# The recommended way is to use generic ephemeral volume with a provisioner
# that can provide enough capacity.
# https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes
emptyDir: {}
To enable backup for a MySQLCluster, reference the BackupPolicy name like this:
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: default
name: foo
spec:
backupPolicyName: daily # The policy name
...
Note: If you want to specify the ObjectBucket name in a ConfigMap or Secret, you can use
envFrom
and specify the environment variable name injobConfig.bucketConfig.bucketName
as follows. This behavior is tested.
apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
metadata:
namespace: backup
name: daily
spec:
jobConfig:
bucketConfig:
bucketName: "$(BUCKET_NAME)"
region: us-east-1
endpointURL: http://minio.default.svc:9000
usePathStyle: true
envFrom:
- configMapRef:
name: bucket-name
...
---
apiVersion: v1
kind: ConfigMap
metadata:
namespace: backup
name: bucket-name
data:
BUCKET_NAME: moco
MOCO creates a CronJob for each MySQLCluster that has spec.backupPolicyName
.
The CronJob's name is moco-backup-
+ the name of MySQLCluster.
For the above example, a CronJob named moco-backup-foo
is created in default
namespace.
The following podAntiAffinity is set by default for CronJob.
If you want to override it, set BackupPolicy.spec.jobConfig.affinity
.
apiVersion: batch/v1
kind: CronJob
metadata:
name: moco-backup-foo
spec:
...
jobTemplate:
spec:
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- mysql-backup
- key: app.kubernetes.io/created-by
operator: In
values:
- moco
topologyKey: kubernetes.io/hostname
weight: 100
...
Credentials to access S3 bucket
Depending on your Kubernetes service provider and object storage, there are various ways to give credentials to access the object storage bucket.
For Amazon's Elastic Kubernetes Service (EKS) and S3 users, the easiest way is probably to use IAM Roles for Service Accounts (IRSA).
ref: IAM ROLES FOR SERVICE ACCOUNTS
Another popular way is to set AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables as shown in the above example.
Taking an emergency backup
You can take an emergency backup by creating a Job from the CronJob for backup.
$ kubectl create job --from=cronjob/moco-backup-foo emergency-backup
Restore
To restore data from a backup, create a new MyQLCluster with spec.restore
field as follows:
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: backup
name: target
spec:
# restore field is not editable.
# to modify parameters, delete and re-create MySQLCluster.
restore:
# The source MySQLCluster's name and namespace
sourceName: source
sourceNamespace: backup
# The restore point-in-time in RFC3339 format.
restorePoint: "2021-05-26T12:34:56Z"
# jobConfig is the same in BackupPolicy
jobConfig:
serviceAccountName: backup-owner
env:
- name: AWS_ACCESS_KEY_ID
value: minioadmin
- name: AWS_SECRET_ACCESS_KEY
value: minioadmin
bucketConfig:
bucketName: moco
region: us-east-1
endpointURL: http://minio.default.svc:9000
usePathStyle: true
workVolume:
emptyDir: {}
...
Further details
Read backup.md for further details.
Deleting the cluster
By deleting MySQLCluster, all resources including PersistentVolumeClaims generated from the templates are automatically removed.
If you want to keep the PersistentVolumeClaims, remove metadata.ownerReferences
from them before you delete a MySQLCluster.
Status, metrics, and logs
Cluster status
You can see the health and availability status of MySQLCluster as follows:
$ kubectl get mysqlcluster
NAME AVAILABLE HEALTHY PRIMARY SYNCED REPLICAS ERRANT REPLICAS
test True True 0 3
- The cluster is available when the primary Pod is running and ready.
- The cluster is healthy when there is no problems.
PRIMARY
is the index of the current primary instance Pod.SYNCED REPLICAS
is the number of ready Pods.ERRANT REPLICAS
is the number of instances having errant transactions.
You can also use kubectl describe mysqlcluster
to see the recent events on the cluster.
Pod status
MOCO adds mysqld containers a liveness probe and a readiness probe to check the replication status in addition to the process status.
A replica Pod is ready only when it is replicating data from the primary without a significant delay. The default threshold of the delay is 60 seconds. The threshold can be configured as follows.
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: foo
name: test
spec:
maxDelaySeconds: 180
...
Unready replica Pods are automatically excluded from the load-balancing targets so that users will not see too old data.
Metrics
MOCO provides a built-in support to collect and expose mysqld
metrics using mysqld_exporter.
This is an example YAML to enable mysqld_exporter
.
spec.collectors
is a list of mysqld_exporter
flag names without collect.
prefix.
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: foo
name: test
spec:
collectors:
- engine_innodb_status
- info_schema.innodb_metrics
podTemplate:
...
See metrics.md
for all available metrics and how to collect them using Prometheus.
Logs
Error logs from mysqld
can be viewed as follows:
$ kubectl logs moco-test-0 mysqld
Slow logs from mysqld
can be viewed as follows:
$ kubectl logs moco-test-0 slow-log
Maintenance
Increasing the number of instances in the cluster
Edit spec.replicas
field of MySQLCluster:
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: foo
name: test
spec:
replicas: 5
...
You can only increase the number of instances in a MySQLCluster from 1 to 3 or 5, or from 3 to 5. Decreasing the number of instances is not allowed.
Switchover
Switchover is an operation to change the live primary to one of the replicas.
MOCO automatically switch the primary when the Pod of the primary instance is to be deleted.
Users can manually trigger a switchover with kubectl moco switchover CLUSTER_NAME
.
Read kubectl-moco.md
for details.
Failover
Failover is an operation to replace the dead primary with the most advanced replica. MOCO automatically does this as soon as it detects that the primary is down.
The most advanced replica is a replica who has retrieved the most up-to-date transaction from the dead primary. Since MOCO configures loss-less semi-synchronous replication, the failover is guaranteed not to lose any user data.
After a failover, the old primary may become an errant replica as described.
Upgrading mysql version
You can upgrade the MySQL version of a MySQL cluster as follows:
- Check that the cluster is healthy.
- Check release notes of MySQL for any incompatibilities between the current and the new versions.
- Edit the Pod template of the MySQLCluster and update
mysqld
container image:
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: default
name: test
spec:
containers:
- name: mysqld
# Edit the next line
image: ghcr.io/cybozu-go/moco/mysql:8.4.2
You are advised to make backups and/or create a replica cluster before starting the upgrade process.
Read upgrading.md
for further details.
Re-initializing an errant replica
Delete the PVC and Pod of the errant replica, like this:
$ kubectl delete --wait=false pvc mysql-data-moco-test-0
$ kubectl delete --grace-period=1 pods moco-test-0
Depending on your Kubernetes version, StatefulSet controller may create a pending Pod before PVC gets deleted. Delete such pending Pods until PVC is actually removed.
Stop Clustering and Reconciliation
In MOCO, you can optionally stop the clustering and reconciliation of a MySQLCluster.
To stop clustering and reconciliation, use the following commands.
$ kubectl moco stop clustering <CLSUTER_NAME>
$ kubectl moco stop reconciliation <CLSUTER_NAME>
To resume the stopped clustering and reconciliation, use the following commands.
$ kubectl moco start clustering <CLSUTER_NAME>
$ kubectl moco start reconciliation <CLSUTER_NAME>
You could use this feature in the following cases:
- To stop the replication of a MySQLCluster and perform a manual operation to align the GTID
- Run the
kubectl moco stop clustering
command on the MySQLCluster where you want to stop the replication
- Run the
- To suppress the full update of MySQLCluster that occurs during the upgrade of MOCO
- Run the
kubectl moco stop reconciliation
command on the MySQLCluster on which you want to suppress the update
- Run the
To check whether clustering and reconciliation are stopped, use kubectl get mysqlcluster
.
Moreover, while clustering is stopped, AVAILABLE
and HEALTHY
values will be Unknown
.
$ kubectl get mysqlcluster
NAME AVAILABLE HEALTHY PRIMARY SYNCED REPLICAS ERRANT REPLICAS CLUSTERING ACTIVE RECONCILE ACTIVE LAST BACKUP
test Unknown Unknown 0 3 False False <no value>
The MOCO controller outputs the following metrics to indicate that clustering has been stopped. 1 if the cluster is clustering or reconciliation stopped, 0 otherwise.
moco_cluster_clustering_stopped{name="mycluster", namespace="mynamesapce"} 1
moco_cluster_reconciliation_stopped{name="mycluster", namespace="mynamesapce"} 1
During the stop of clustering, monitoring of the cluster from MOCO will be halted, and the value of the following metrics will become NaN.
moco_cluster_available{name="test",namespace="default"} NaN
moco_cluster_healthy{name="test",namespace="default"} NaN
moco_cluster_ready_replicas{name="test",namespace="default"} NaN
moco_cluster_errant_replicas{name="test",namespace="default"} NaN
Set to Read Only
When you want to set MOCO's MySQL to read-only, use the the following commands.
MOCO makes the primary instance writable in the clustering process. Therefore, please be sure to stop clustering when you set it to read-only.
$ kubectl moco stop clustering <CLSUTER_NAME>
$ kubectl moco mysql -u moco-admin <CLSUTER_NAME> -- -e "SET GLOBAL super_read_only=1"
You can check whether the cluster is read-only with the following command.
$ kubectl moco mysql -it <CLSUTER_NAME> -- -e "SELECT @@super_read_only"
+-------------------+
| @@super_read_only |
+-------------------+
| 1 |
+-------------------+
If you want to leave read-only mode, restart clustering as follows. Then, MOCO will make the cluster writable.
$ kubectl moco start clustering <CLSUTER_NAME>
Advanced topics
Building custom image of mysqld
There are pre-built mysqld
container images for MOCO on ghcr.io/cybozu-go/moco/mysql
.
Users can use one of these images to supply mysqld
container in MySQLCluster like:
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
spec:
podTemplate:
spec:
containers:
- name: mysqld
image: ghcr.io/cybozu-go/moco/mysql:8.4.2
If you want to build and use your own mysqld
, read the rest of this document.
Dockerfile
The easiest way to build a custom mysqld
for MOCO is to copy and edit our Dockerfile.
You can find it under containers/mysql
directory in github.com/cybozu-go/moco
.
You should keep the following points:
ENTRYPOINT
should be["mysqld"]
USER
should be10000:10000
sleep
command must exist in one of thePATH
directories.
How to build mysqld
On Ubuntu 22.04, you can build the source code as follows:
$ sudo apt-get update
$ sudo apt-get -y --no-install-recommends install build-essential libssl-dev \
cmake libncurses5-dev libjemalloc-dev libnuma-dev libaio-dev pkg-config
$ curl -fsSL -O https://dev.mysql.com/get/Downloads/MySQL-8.4/mysql-8.4.2.tar.gz
$ tar -x -z -f mysql-8.4.2.tar.gz
$ cd mysql-8.4.2
$ mkdir bld
$ cd bld
$ cmake .. -DBUILD_CONFIG=mysql_release -DCMAKE_BUILD_TYPE=Release \
-DWITH_NUMA=1 -DWITH_JEMALLOC=1
$ make -j $(nproc)
$ make install
Customize default container
MOCO has containers that are automatically added by the system in addition to containers added by the user.
(e.g. agent
, moco-init
etc...)
The MySQLCluster.spec.podTemplate.overwriteContainers
field can be used to overwrite such containers.
Currently, only container resources can be overwritten.
overwriteContainers
is only available in MySQLCluster v1beta2.
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: default
name: test
spec:
podTemplate:
spec:
containers:
- name: mysqld
image: ghcr.io/cybozu-go/moco/mysql:8.0.30
overwriteContainers:
- name: agent
resources:
requests:
cpu: 50m
System containers
The following is a list of system containers used by MOCO.
Specifying container names in overwriteContainers
that are not listed here will result in an error in API validation.
Name | Default CPU Requests/Limits | Default Memory Requests/Limits | Description |
---|---|---|---|
agent | 100m / 100m | 100Mi / 100Mi | MOCO's agent container running in sidecar. refs: https://github.com/cybozu-go/moco-agent |
moco-init | 100m / 100m | 300Mi / 300Mi | Initializes MySQL data directory and create a configuration snippet to give instance specific configuration values such as server_id and admin_address. |
slow-log | 100m / 100m | 20Mi / 20Mi | Sidecar container for outputting slow query logs. |
mysqld-exporter | 200m / 200m | 100Mi / 100Mi | MySQL server exporter sidecar container. |
Change the volumeClaimTemplates
MOCO supports MySQLCluster .spec.volumeClaimTemplates
changes.
When .spec.volumeClaimTemplates
is changed, moco-controller will try to recreate the StatefulSet.
This is because modification of volumeClaimTemplates
in StatefulSet is currently not allowed.
Re-creation StatefulSet is done with the same behavior as kubectl delete sts moco-xxx --cascade=orphan
, without removing the working Pod.
NOTE: It may be possible to edit the StatefulSet directly in the future.
ref: https://github.com/kubernetes/enhancements/issues/661
When re-creating a StatefulSet, moco-controller supports no operation except for volume expansion as described below.
It simply re-creates the StatefulSet.
However, by specifying the --pvc-sync-annotation-keys
and --pvc-sync-label-keys
flags in the controller, you can designate the annotations and labels to be synchronized from .spec.volumeClaimTemplates
to PVC during the recreation of the StatefulSet.
For all other labels and annotations, given the potential side effects, such updates must be performed by the user themselves. This guideline is essential to prevent potential side-effects if entities other than the moco-controller are manipulating the PVC's metadata.
Metrics
The success or failure of the re-creating a StatefulSet is notified to the user in the following metrics:
moco_cluster_statefulset_recreate_total{name="mycluster", namespace="mynamesapce"} 3
moco_cluster_statefulset_recreate_errors_total{name="mycluster", namespace="mynamesapce"} 1
If a StatefulSet fails to recreate, the metrics in moco_cluster_statefulset_recreate_errors_total
is incremented after each reconcile,
so users can notice anomalies by monitoring this metrics.
See the metrics documentation for more details.
Volume expansion
moco-controller automatically resizes the PVC when the size of the MySQLCluster volume claim is extended. If the volume plugin supports online file system expansion, the PVs used by the Pod will be expanded online.
If volume is to be expanded, .allowVolumeExpansion
of the StorageClass must be true
.
moco-controller will validate with the admission webhook and reject the request if volume expansion is not allowed.
If the volume plugin does not support online file system expansion, the Pod must be restarted for the volume expansion to reflect. This must be done manually by the user.
When moco-controller resizes a PVC, there may be a discrepancy between the PVC defined in the MySQLCluster and the actual PVC size. For example, if you are using github.com/topolvm/pvc-autoresizer. In this case, moco-controller will only update if the actual PVC size is smaller than the PVC size after the change.
Metrics
The success or failure of the PVC resizing is notified to the user in the following metrics:
moco_cluster_volume_resized_total{name="mycluster", namespace="mynamesapce"} 4
moco_cluster_volume_resized_errors_total{name="mycluster", namespace="mynamesapce"} 1
This metrics is incremented if the volume size change succeeds or fails.
If fails to volume size changed, the metrics in moco_cluster_volume_resized_errors_total
is incremented after each reconcile,
so users can notice anomalies by monitoring this metrics.
See the metrics documentation for more details.
Volume reduction
MOCO supports PVC reduction, but unlike PVC expansion, the user must perform the operation manually.
The steps are as follows:
- The user modifies the
.spec.volumeClaimTemplates
of the MySQLCluster and sets a smaller volume size. - MOCO updates the
.spec.volumeClaimTemplates
of the StatefulSet. This does not propagate to existing Pods, PVCs, or PVs. - The user manually deletes the MySQL Pod & PVC.
- Wait for the Pod & PVC to be recreated by the statefulset-controller, and for MOCO to clone the data.
- Once the cluster becomes Healthy, the user deletes the next Pod and PVC.
- It is completed when all Pods and PVCs are recreated.
1. The user modifies the .spec.volumeClaimTemplates
of the MySQLCluster and sets a smaller volume size
For example, the user modifies the .spec.volumeClaimTemplates
of the MySQLCluster as follows:
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: default
name: test
spec:
replicas: 3
podTemplate:
spec:
containers:
- name: mysqld
image: ghcr.io/cybozu-go/moco/mysql:8.4.2
volumeClaimTemplates:
- metadata:
name: mysql-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
- storage: 1Gi
+ storage: 500Mi
2. MOCO updates the .spec.volumeClaimTemplates
of the StatefulSet. This does not propagate to existing Pods, PVCs, or PVs
The moco-controller will update the .spec.volumeClaimTemplates
of the StatefulSet.
The actual modification of the StatefulSet's .spec.volumeClaimTemplates
is not allowed,
so this change is achieved by recreating the StatefulSet.
At this time, only the recreation of StatefulSet is performed, without deleting the Pods and PVCs.
3. The user manually deletes the MySQL Pod & PVC
The user manually deletes the PVC and Pod. Use the following command to delete them:
$ kubectl delete --wait=false pvc <pvc-name>
$ kubectl delete --grace-period=1 <pod-name>
4. Wait for the Pod & PVC to be recreated by the statefulset-controller, and for MOCO to clone the data
The statefulset-controller recreates Pods and PVCs, creating a new PVC with a reduced size. Once the MOCO successfully starts a Pod, it begins cloning the data.
$ kubectl get mysqlcluster,po,pvc
NAME AVAILABLE HEALTHY PRIMARY SYNCED REPLICAS ERRANT REPLICAS LAST BACKUP
mysqlcluster.moco.cybozu.com/test True False 0 2 <no value>
NAME READY STATUS RESTARTS AGE
pod/moco-test-0 3/3 Running 0 2m14s
pod/moco-test-1 3/3 Running 0 114s
pod/moco-test-2 0/3 Init:1/2 0 7s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/mysql-data-moco-test-0 Bound pvc-03c73525-0d6d-49de-b68a-f8af4c4c7faa 1Gi RWO standard 2m14s
persistentvolumeclaim/mysql-data-moco-test-1 Bound pvc-73c26baa-3432-4c85-b5b6-875ffd2456d9 1Gi RWO standard 114s
persistentvolumeclaim/mysql-data-moco-test-2 Bound pvc-779b5b3c-3efc-4048-a549-a4bd2d74ed4e 500Mi RWO standard 7s
5. Once the cluster becomes Healthy, the user deletes the next Pod and PVC
The user waits until the MySQLCluster state becomes Healthy, and then deletes the next Pod and PVC.
$ kubectl get mysqlcluster
NAME AVAILABLE HEALTHY PRIMARY SYNCED REPLICAS ERRANT REPLICAS LAST BACKUP
mysqlcluster.moco.cybozu.com/test True True 1 3 <no value>
6. It is completed when all Pods and PVCs are recreated
Repeat steps 3 to 5 until all Pods and PVCs are recreated.
References
RollingUpdate strategy
MOCO manages MySQLCluster pods using StatefulSets.
MySQLCluster/test
└─StatefulSet/moco-test
├─ControllerRevision/moco-test-554c56f456
├─ControllerRevision/moco-test-5794c57c7c
├─Pod/moco-test-0
├─Pod/moco-test-1
└─Pod/moco-test-2
By default, StatefulSet's standard rolling update does not consider whether MySQLCluster is Healthy during pod updates. This can sometimes cause problems, as a rolling update may proceed even if MySQLCluster becomes UnHealthy during the process.
To address this issue, MOCO controls StatefulSet partitions to perform rolling updates. This behavior is enabled by default.
Partitions
By setting a number in .spec.updateStrategy.rollingUpdate.partition
of a StatefulSet, you can divide the rolling update into partitions.
When a partition is specified, pods with a pod number equal to or greater than the partition value are updated.
Pods with a pod number smaller than the partition value are not updated, and even if those pods are deleted, they will be recreated with the previous version.
Architecture
When Creating a StatefulSet
When creating a StatefulSet, MOCO updates the partition of the StatefulSet to the same value as the replica using MutatingAdmissionWebhook.
When Updating a StatefulSet
When a StatefulSet is updated, MOCO determines the contents of the StatefulSet update and controls partitions using AdmissionWebhook.
-
If the StatefulSet update is only the partition number
- The MutatingAdmissionWebhook does nothing.
-
If fields other than the partition of the StatefulSet are updated
- The MutatingAdmissionWebhook updates the partition of the StatefulSet to the same value as the replica using MutatingAdmissionWebhook.
replicas: 3 ... updateStrategy: type: RollingUpdate rollingUpdate: partition: 3 ...
Updating Partitions
MOCO monitors the rollout status of the StatefulSet and the status of MySQLCluster. If the update of pods based on the current partition value is completed successfully and the containers are Running, and the status of MySQLCluster is Healthy, MOCO decrements the partition of the StatefulSet by 1. This operation is repeated until the partition value reaches 0.
Forcefully Rolling Out
By setting the annotation moco.cybozu.com/force-rolling-update
to true
, you can update the StatefulSet without partition control.
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
namespace: default
name: test
annotations:
moco.cybozu.com/force-rolling-update: "true"
...
When creating or updating a StatefulSet with the annotation moco.cybozu.com/force-rolling-update
set, MOCO deletes the partition setting using MutatingAdmissionWebhook.
Metrics
MOCO outputs the following metrics related to rolling updates:
moco_cluster_current_replicas
- The same as
.status.currentReplicas
of the StatefulSet.
- The same as
moco_cluster_updated_replicas
- The same as
.status.updatedReplicas
of the StatefulSet.
- The same as
moco_cluster_last_partition_updated
- The time the partition was last updated.
By setting an alert with the condition that moco_cluster_updated_replicas
is not equal to moco_cluster_replicas
and a certain amount of time has passed since moco_cluster_last_partition_updated
, you can detect MySQLClusters where the rolling update is stopped.
Known issues
This document lists the known issues of MOCO.
Multi-threaded replication
Status: not fixed as of MOCO v0.9.5
If you use MOCO with MySQL version 8.0.25 or earlier, you should not configure the replicas with replica_parallel_workers
> 1.
Multi-threaded replication will cause the replica to fail to resume after the crash.
This issue is registered as #322 and will be addressed at no distant date.
Custom resources
Custom Resources
Sub Resources
- BackupStatus
- MySQLClusterList
- MySQLClusterSpec
- MySQLClusterStatus
- ObjectMeta
- OverwriteContainer
- PersistentVolumeClaim
- PodTemplateSpec
- ReconcileInfo
- RestoreSpec
- ServiceTemplate
- BucketConfig
- JobConfig
BackupStatus
BackupStatus represents the status of the last successful backup.
Field | Description | Scheme | Required |
---|---|---|---|
time | The time of the backup. This is used to generate object keys of backup files in a bucket. | metav1.Time | true |
elapsed | Elapsed is the time spent on the backup. | metav1.Duration | true |
sourceIndex | SourceIndex is the ordinal of the backup source instance. | int | true |
sourceUUID | SourceUUID is the server_uuid of the backup source instance. | string | true |
uuidSet | UUIDSet is the server_uuid set of all candidate instances for the backup source. | map[string]string | true |
binlogFilename | BinlogFilename is the binlog filename that the backup source instance was writing to at the backup. | string | true |
gtidSet | GTIDSet is the GTID set of the full dump of database. | string | true |
dumpSize | DumpSize is the size in bytes of a full dump of database stored in an object storage bucket. | int64 | true |
binlogSize | BinlogSize is the size in bytes of a tarball of binlog files stored in an object storage bucket. | int64 | true |
workDirUsage | WorkDirUsage is the max usage in bytes of the woking directory. | int64 | true |
warnings | Warnings are list of warnings from the last backup, if any. | []string | true |
MySQLCluster
MySQLCluster is the Schema for the mysqlclusters API
Field | Description | Scheme | Required |
---|---|---|---|
metadata | metav1.ObjectMeta | false | |
spec | MySQLClusterSpec | false | |
status | MySQLClusterStatus | false |
MySQLClusterList
MySQLClusterList contains a list of MySQLCluster
Field | Description | Scheme | Required |
---|---|---|---|
metadata | metav1.ListMeta | false | |
items | []MySQLCluster | true |
MySQLClusterSpec
MySQLClusterSpec defines the desired state of MySQLCluster
Field | Description | Scheme | Required |
---|---|---|---|
replicas | Replicas is the number of instances. Available values are positive odd numbers. | int32 | false |
podTemplate | PodTemplate is a Pod template for MySQL server container. | PodTemplateSpec | true |
volumeClaimTemplates | VolumeClaimTemplates is a list of PersistentVolumeClaim templates for MySQL server container. A claim named "mysql-data" must be included in the list. | []PersistentVolumeClaim | true |
primaryServiceTemplate | PrimaryServiceTemplate is a Service template for primary. | *ServiceTemplate | false |
replicaServiceTemplate | ReplicaServiceTemplate is a Service template for replica. | *ServiceTemplate | false |
mysqlConfigMapName | MySQLConfigMapName is a ConfigMap name of MySQL config. | *string | false |
replicationSourceSecretName | ReplicationSourceSecretName is a Secret name which contains replication source info. If this field is given, the MySQLCluster works as an intermediate primary. | *string | false |
collectors | Collectors is the list of collector flag names of mysqld_exporter. If this field is not empty, MOCO adds mysqld_exporter as a sidecar to collect and export mysqld metrics in Prometheus format.\n\nSee https://github.com/prometheus/mysqld_exporter/blob/master/README.md#collector-flags for flag names.\n\nExample: ["engine_innodb_status", "info_schema.innodb_metrics"] | []string | false |
serverIDBase | ServerIDBase, if set, will become the base number of server-id of each MySQL instance of this cluster. For example, if this is 100, the server-ids will be 100, 101, 102, and so on. If the field is not given or zero, MOCO automatically sets a random positive integer. | int32 | false |
maxDelaySeconds | MaxDelaySeconds configures the readiness probe of mysqld container. For a replica mysqld instance, if it is delayed to apply transactions over this threshold, the mysqld instance will be marked as non-ready. The default is 60 seconds. Setting this field to 0 disables the delay check in the probe. | *int | false |
startupWaitSeconds | StartupWaitSeconds is the maximum duration to wait for mysqld container to start working. The default is 3600 seconds. | int32 | false |
logRotationSchedule | LogRotationSchedule specifies the schedule to rotate MySQL logs. If not set, the default is to rotate logs every 5 minutes. See https://pkg.go.dev/github.com/robfig/cron/v3#hdr-CRON_Expression_Format for the field format. | string | false |
backupPolicyName | The name of BackupPolicy custom resource in the same namespace. If this is set, MOCO creates a CronJob to take backup of this MySQL cluster periodically. | *string | false |
restore | Restore is the specification to perform Point-in-Time-Recovery from existing cluster. If this field is not null, MOCO restores the data as specified and create a new cluster with the data. This field is not editable. | *RestoreSpec | false |
disableSlowQueryLogContainer | DisableSlowQueryLogContainer controls whether to add a sidecar container named "slow-log" to output slow logs as the containers output. If set to true, the sidecar container is not added. The default is false. | bool | false |
agentUseLocalhost | AgentUseLocalhost configures the mysqld interface to bind and be accessed over localhost instead of pod name. During container init moco-agent will set mysql admin interface is bound to localhost. The moco-agent will also communicate with mysqld over localhost when acting as a sidecar. | bool | false |
offline | Offline sets the cluster offline, releasing compute resources. Data is not removed. | bool | false |
MySQLClusterStatus
MySQLClusterStatus defines the observed state of MySQLCluster
Field | Description | Scheme | Required |
---|---|---|---|
conditions | Conditions is an array of conditions. | []metav1.Condition | false |
currentPrimaryIndex | CurrentPrimaryIndex is the index of the current primary Pod in StatefulSet. Initially, this is zero. | int | true |
syncedReplicas | SyncedReplicas is the number of synced instances including the primary. | int | false |
errantReplicas | ErrantReplicas is the number of instances that have errant transactions. | int | false |
errantReplicaList | ErrantReplicaList is the list of indices of errant replicas. | []int | false |
backup | Backup is the status of the last successful backup. | BackupStatus | true |
restoredTime | RestoredTime is the time when the cluster data is restored. | *metav1.Time | false |
cloned | Cloned indicates if the initial cloning from an external source has been completed. | bool | false |
reconcileInfo | ReconcileInfo represents version information for reconciler. | ReconcileInfo | true |
ObjectMeta
ObjectMeta is metadata of objects. This is partially copied from metav1.ObjectMeta.
Field | Description | Scheme | Required |
---|---|---|---|
name | Name is the name of the object. | string | false |
labels | Labels is a map of string keys and values. | map[string]string | false |
annotations | Annotations is a map of string keys and values. | map[string]string | false |
OverwriteContainer
OverwriteContainer defines the container spec used for overwriting.
Field | Description | Scheme | Required |
---|---|---|---|
name | Name of the container to overwrite. | OverwriteableContainerName | true |
resources | Resources is the container resource to be overwritten. | *ResourceRequirementsApplyConfiguration | false |
PersistentVolumeClaim
PersistentVolumeClaim is a user's request for and claim to a persistent volume. This is slightly modified from corev1.PersistentVolumeClaim.
Field | Description | Scheme | Required |
---|---|---|---|
metadata | Standard object's metadata. | ObjectMeta | true |
spec | Spec defines the desired characteristics of a volume requested by a pod author. | PersistentVolumeClaimSpecApplyConfiguration | true |
PodTemplateSpec
PodTemplateSpec describes the data a pod should have when created from a template. This is slightly modified from corev1.PodTemplateSpec.
Field | Description | Scheme | Required |
---|---|---|---|
metadata | Standard object's metadata. The name in this metadata is ignored. | ObjectMeta | false |
spec | Specification of the desired behavior of the pod. The name of the MySQL server container in this spec must be mysqld . | PodSpecApplyConfiguration | true |
overwriteContainers | OverwriteContainers overwrites the container definitions provided by default by the system. | []OverwriteContainer | false |
ReconcileInfo
ReconcileInfo is the type to record the last reconciliation information.
Field | Description | Scheme | Required |
---|---|---|---|
generation | Generation is the metadata.generation value of the last reconciliation. See also https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#status-subresource | int64 | false |
reconcileVersion | ReconcileVersion is the version of the operator reconciler. | int | true |
RestoreSpec
RestoreSpec represents a set of parameters for Point-in-Time Recovery.
Field | Description | Scheme | Required |
---|---|---|---|
sourceName | SourceName is the name of the source MySQLCluster . | string | true |
sourceNamespace | SourceNamespace is the namespace of the source MySQLCluster . | string | true |
restorePoint | RestorePoint is the target date and time to restore data. The format is RFC3339. e.g. "2006-01-02T15:04:05Z" | metav1.Time | true |
jobConfig | Specifies parameters for restore Pod. | JobConfig | true |
schema | Schema is the name of the schema to restore. If empty, all schemas are restored. This is used for mysqlbinlog option --database . Thus, this option changes behavior depending on binlog_format. For more information, please read the following documentation. https://dev.mysql.com/doc/refman/8.0/en/mysqlbinlog.html#option_mysqlbinlog_database | string | false |
ServiceTemplate
ServiceTemplate defines the desired spec and annotations of Service
Field | Description | Scheme | Required |
---|---|---|---|
metadata | Standard object's metadata. Only annotations and labels are valid. | ObjectMeta | false |
spec | Spec is the ServiceSpec | *ServiceSpecApplyConfiguration | false |
BucketConfig
BucketConfig is a set of parameter to access an object storage bucket.
Field | Description | Scheme | Required |
---|---|---|---|
bucketName | The name of the bucket | string | true |
region | The region of the bucket. This can also be set through AWS_REGION environment variable. | string | false |
endpointURL | The API endpoint URL. Set this for non-S3 object storages. | string | false |
usePathStyle | Allows you to enable the client to use path-style addressing, i.e., https?://ENDPOINT/BUCKET/KEY. By default, a virtual-host addressing is used (https?://BUCKET.ENDPOINT/KEY). | bool | false |
backendType | BackendType is an identifier for the object storage to be used. | string | false |
caCert | Path to SSL CA certificate file used in addition to system default. | string | false |
JobConfig
JobConfig is a set of parameters for backup and restore job Pods.
Field | Description | Scheme | Required |
---|---|---|---|
serviceAccountName | ServiceAccountName specifies the ServiceAccount to run the Pod. | string | true |
bucketConfig | Specifies how to access an object storage bucket. | BucketConfig | true |
workVolume | WorkVolume is the volume source for the working directory. Since the backup or restore task can use a lot of bytes in the working directory, You should always give a volume with enough capacity.\n\nThe recommended volume source is a generic ephemeral volume. https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes | VolumeSourceApplyConfiguration | true |
threads | Threads is the number of threads used for backup or restoration. | int | false |
cpu | CPU is the amount of CPU requested for the Pod. | *resource.Quantity | false |
maxCpu | MaxCPU is the amount of maximum CPU for the Pod. | *resource.Quantity | false |
memory | Memory is the amount of memory requested for the Pod. | *resource.Quantity | false |
maxMemory | MaxMemory is the amount of maximum memory for the Pod. | *resource.Quantity | false |
envFrom | List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig | []EnvFromSourceApplyConfiguration | false |
env | List of environment variables to set in the container.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig | []EnvVarApplyConfiguration | false |
affinity | If specified, the pod's scheduling constraints. | *AffinityApplyConfiguration | false |
volumes | Volumes defines the list of volumes that can be mounted by containers in the Pod. | []VolumeApplyConfiguration | false |
volumeMounts | VolumeMounts describes a list of volume mounts that are to be mounted in a container. | []VolumeMountApplyConfiguration | false |
Custom Resources
Sub Resources
BackupPolicy
BackupPolicy is a namespaced resource that should be referenced from MySQLCluster.
Field | Description | Scheme | Required |
---|---|---|---|
metadata | metav1.ObjectMeta | false | |
spec | BackupPolicySpec | true |
BackupPolicyList
BackupPolicyList contains a list of BackupPolicy
Field | Description | Scheme | Required |
---|---|---|---|
metadata | metav1.ListMeta | false | |
items | []BackupPolicy | true |
BackupPolicySpec
BackupPolicySpec defines the configuration items for MySQLCluster backup.\n\nThe following fields will be copied to CronJob.spec:\n\n- Schedule - StartingDeadlineSeconds - ConcurrencyPolicy - SuccessfulJobsHistoryLimit - FailedJobsHistoryLimit\n\nThe following fields will be copied to CronJob.spec.jobTemplate.\n\n- ActiveDeadlineSeconds - BackoffLimit
Field | Description | Scheme | Required |
---|---|---|---|
schedule | The schedule in Cron format for periodic backups. See https://en.wikipedia.org/wiki/Cron | string | true |
jobConfig | Specifies parameters for backup Pod. | JobConfig | true |
startingDeadlineSeconds | Optional deadline in seconds for starting the job if it misses scheduled time for any reason. Missed jobs executions will be counted as failed ones. | *int64 | false |
concurrencyPolicy | Specifies how to treat concurrent executions of a Job. Valid values are: - "Allow" (default): allows CronJobs to run concurrently; - "Forbid": forbids concurrent runs, skipping next run if previous run hasn't finished yet; - "Replace": cancels currently running job and replaces it with a new one | batchv1.ConcurrencyPolicy | false |
activeDeadlineSeconds | Specifies the duration in seconds relative to the startTime that the job may be continuously active before the system tries to terminate it; value must be positive integer. If a Job is suspended (at creation or through an update), this timer will effectively be stopped and reset when the Job is resumed again. | *int64 | false |
backoffLimit | Specifies the number of retries before marking this job failed. Defaults to 6 | *int32 | false |
successfulJobsHistoryLimit | The number of successful finished jobs to retain. This is a pointer to distinguish between explicit zero and not specified. Defaults to 3. | *int32 | false |
failedJobsHistoryLimit | The number of failed finished jobs to retain. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1. | *int32 | false |
BucketConfig
BucketConfig is a set of parameter to access an object storage bucket.
Field | Description | Scheme | Required |
---|---|---|---|
bucketName | The name of the bucket | string | true |
region | The region of the bucket. This can also be set through AWS_REGION environment variable. | string | false |
endpointURL | The API endpoint URL. Set this for non-S3 object storages. | string | false |
usePathStyle | Allows you to enable the client to use path-style addressing, i.e., https?://ENDPOINT/BUCKET/KEY. By default, a virtual-host addressing is used (https?://BUCKET.ENDPOINT/KEY). | bool | false |
backendType | BackendType is an identifier for the object storage to be used. | string | false |
caCert | Path to SSL CA certificate file used in addition to system default. | string | false |
JobConfig
JobConfig is a set of parameters for backup and restore job Pods.
Field | Description | Scheme | Required |
---|---|---|---|
serviceAccountName | ServiceAccountName specifies the ServiceAccount to run the Pod. | string | true |
bucketConfig | Specifies how to access an object storage bucket. | BucketConfig | true |
workVolume | WorkVolume is the volume source for the working directory. Since the backup or restore task can use a lot of bytes in the working directory, You should always give a volume with enough capacity.\n\nThe recommended volume source is a generic ephemeral volume. https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes | VolumeSourceApplyConfiguration | true |
threads | Threads is the number of threads used for backup or restoration. | int | false |
cpu | CPU is the amount of CPU requested for the Pod. | *resource.Quantity | false |
maxCpu | MaxCPU is the amount of maximum CPU for the Pod. | *resource.Quantity | false |
memory | Memory is the amount of memory requested for the Pod. | *resource.Quantity | false |
maxMemory | MaxMemory is the amount of maximum memory for the Pod. | *resource.Quantity | false |
envFrom | List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig | []EnvFromSourceApplyConfiguration | false |
env | List of environment variables to set in the container.\n\nYou can configure S3 bucket access parameters through environment variables. See https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/config#EnvConfig | []EnvVarApplyConfiguration | false |
affinity | If specified, the pod's scheduling constraints. | *AffinityApplyConfiguration | false |
volumes | Volumes defines the list of volumes that can be mounted by containers in the Pod. | []VolumeApplyConfiguration | false |
volumeMounts | VolumeMounts describes a list of volume mounts that are to be mounted in a container. | []VolumeMountApplyConfiguration | false |
Commands
kubectl moco plugin
kubectl-moco
is a kubectl plugin for MOCO.
kubectl moco [global options] <subcommand> [sub options] args...
Global options
Global options are compatible with kubectl. For example, the following options are available.
Global options | Default value | Description |
---|---|---|
--kubeconfig | $HOME/.kube/config | Path to the kubeconfig file to use for CLI requests. |
-n, --namespace | default | If present, the namespace scope for this CLI request. |
MySQL users
You can choose one of the following user for --mysql-user
option value.
Name | Description |
---|---|
moco-readonly | A read-only user. |
moco-writable | A user that can edit users, databases, and tables. |
moco-admin | The super-user. |
kubectl moco mysql [options] CLUSTER_NAME [-- mysql args...]
Run mysql
command in a specified MySQL instance.
Options | Default value | Description |
---|---|---|
-u, --mysql-user | moco-readonly | Login as the specified user |
--index | index of the primary | Index of the target mysql instance |
-i, --stdin | false | Pass stdin to the mysql container |
-t, --tty | false | Stdin is a TTY |
Examples
This executes SELECT VERSION()
on the primary instance in mycluster
in foo
namespace:
$ kubectl moco -n foo mysql mycluster -- -N -e 'SELECT VERSION()'
To execute SQL from a file:
$ cat sample.sql | kubectl moco -n foo mysql -u moco-writable -i mycluster
To run mysql
interactively for the instance 2 in mycluster
in the default namespace:
$ kubectl moco mysql --index 2 -it mycluster
kubectl moco credential [options] CLUSTER_NAME
Fetch the credential information of a specified user
Options | Default value | Description |
---|---|---|
-u, --mysql-user | moco-readonly | Fetch the credential of the specified user |
--format | plain | Output format: plain or mycnf |
kubectl moco switchover CLUSTER_NAME
Switch the primary instance to one of the replicas.
Stop or start clustering and reconciliation
Read Stop Clustering and Reconciliation.
kubectl moco stop clustering CLUSTER_NAME
Stop the clustering of the specified MySQLCluster.
kubectl moco start clustering CLUSTER_NAME
Start the clustering of the specified MySQLCluster.
kubectl moco stop reconciliation CLUSTER_NAME
Stop the reconciliation of the specified MySQLCluster.
kubectl moco start reconciliation CLUSTER_NAME
Start the reconciliation of the specified MySQLCluster.
moco-controller
moco-controller
controls MySQL clusters on Kubernetes.
Environment variables
Name | Required | Description |
---|---|---|
POD_NAMESPACE | Yes | The namespace name where moco-controller runs. |
Command line flags
Flags:
--add_dir_header If true, adds the file directory to the header of the log messages
--agent-image string The image of moco-agent sidecar container (default "ghcr.io/cybozu-go/moco-agent:0.12.1")
--alsologtostderr log to standard error as well as files (no effect when -logtostderr=true)
--apiserver-qps-throttle int The maximum QPS to the API server. (default 20)
--backup-image string The image of moco-backup container (default "ghcr.io/cybozu-go/moco-backup:0.23.2")
--cert-dir string webhook certificate directory
--check-interval duration Interval of cluster maintenance (default 1m0s)
--fluent-bit-image string The image of fluent-bit sidecar container (default "ghcr.io/cybozu-go/moco/fluent-bit:3.0.2.1")
--grpc-cert-dir string gRPC certificate directory (default "/grpc-cert")
--health-probe-addr string Listen address for health probes (default ":8081")
-h, --help help for moco-controller
--leader-election-id string ID for leader election by controller-runtime (default "moco")
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory (no effect when -logtostderr=true)
--log_file string If non-empty, use this log file (no effect when -logtostderr=true)
--log_file_max_size uint Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
--logtostderr log to standard error instead of files (default true)
--max-concurrent-reconciles int The maximum number of concurrent reconciles which can be run (default 8)
--metrics-addr string Listen address for metric endpoint (default ":8080")
--mysql-configmap-history-limit int The maximum number of MySQLConfigMap's history to be kept (default 10)
--mysqld-exporter-image string The image of mysqld_exporter sidecar container (default "ghcr.io/cybozu-go/moco/mysqld_exporter:0.15.1.2")
--one_output If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)
--pprof-addr string Listen address for pprof endpoints. pprof is disabled by default
--pvc-sync-annotation-keys strings The keys of annotations from MySQLCluster's volumeClaimTemplates to be synced to the PVC
--pvc-sync-label-keys strings The keys of labels from MySQLCluster's volumeClaimTemplates to be synced to the PVC
--skip_headers If true, avoid header prefixes in the log messages
--skip_log_headers If true, avoid headers when opening log files (no effect when -logtostderr=true)
--stderrthreshold severity logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=true) (default 2)
-v, --v Level number for the log level verbosity
--version version for moco-controller
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
--webhook-addr string Listen address for the webhook endpoint (default ":9443")
--zap-devel Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
--zap-encoder encoder Zap log encoding (one of 'json' or 'console')
--zap-log-level level Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
--zap-stacktrace-level level Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
--zap-time-encoding time-encoding Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.
moco-backup
moco-backup
command is used in ghcr.io/cybozu-go/moco-backup
container.
Normally, users need not take care of this command.
Environment variables
moco-backup
takes configurations of S3 API from environment variables.
For details, read documentation of EnvConfig
in github.com/aws/aws-sdk-go-v2/config.
It also requires MYSQL_PASSWORD
environment variable to be set.
Global command-line flags
Global Flags:
--endpoint string S3 API endpoint URL
--region string AWS region
--threads int The number of threads to be used (default 4)
--use-path-style Use path-style S3 API
--work-dir string The writable working directory (default "/work")
--ca-cert string Path to SSL CA certificate file used in addition to system default
Subcommands
backup
subcommand
Usage: moco-backup backup BUCKET NAMESPACE NAME
BUCKET
: The bucket name.NAMESPACE
: The namespace of the MySQLCluster.NAME
: The name of the MySQLCluster.
`restore subcommand
Usage: moco-backup restore BUCKET SOURCE_NAMESPACE SOURCE_NAME NAMESPACE NAME YYYYMMDD-hhmmss
BUCKET
: The bucket name.SOURCE_NAMESPACE
: The source MySQLCluster's namespace.SOURCE_NAME
: The source MySQLCluster's name.NAMESPACE
: The target MySQLCluster's namespace.NAME
: The target MySQLCluster's name.YYYYMMDD-hhmmss
: The point-in-time to restore data. e.g.20210523-150423
Metrics
moco-controller
moco-controller
provides the following kind of metrics in Prometheus format.
Aside from the standard Go runtime and process metrics, it exposes metrics related to controller-runtime, MySQL clusters, and backups.
MySQL clusters
All these metrics are prefixed with moco_cluster_
and have name
and namespace
labels.
Name | Description | Type |
---|---|---|
checks_total | The number of times MOCO checked the cluster | Counter |
errors_total | The number of times MOCO encountered errors when managing the cluster | Counter |
available | 1 if the cluster is available, 0 otherwise | Gauge |
healthy | 1 if the cluster is running without any problems, 0 otherwise | Gauge |
switchover_total | The number of times MOCO changed the live primary instance | Counter |
failover_total | The number of times MOCO changed the failed primary instance | Counter |
replicas | The number of mysqld instances in the cluster | Gauge |
ready_replicas | The number of ready mysqld Pods in the cluster | Gauge |
current_replicas | The number of current replicas | Gauge |
updated_replicas | The number of updated replicas | Gauge |
last_partition_updated | The timestamp of the last successful partition update | Gauge |
clustering_stopped | 1 if the cluster is clustering stopped, 0 otherwise | Gauge |
reconciliation_stopped | 1 if the cluster is reconciliation stopped, 0 otherwise | Gauge |
errant_replicas | The number of mysqld instances that have errant transactions | Gauge |
processing_time_seconds | The length of time in seconds processing the cluster | Histogram |
volume_resized_total | The number of successful volume resizes | Counter |
volume_resized_errors_total | The number of failed volume resizes | Counter |
statefulset_recreate_total | The number of successful StatefulSet recreates | Counter |
statefulset_recreate_errors_total | The number of failed StatefulSet recreates | Counter |
Backup
All these metrics are prefixed with moco_backup_
and have name
and namespace
labels.
Name | Description | Type |
---|---|---|
timestamp | The number of seconds since January 1, 1970 UTC of the last successful backup | Gauge |
elapsed_seconds | The number of seconds taken for the last backup | Gauge |
dump_bytes | The size of compressed full backup data | Gauge |
binlog_bytes | The size of compressed binlog files | Gauge |
workdir_usage_bytes | The maximum usage of the working directory | Gauge |
warnings | The number of warnings in the last successful backup | Gauge |
MySQL instance
For each mysqld
instance, moco-agent exposes a set of metrics.
Read github.com/cybozu-go/moco-agent/blob/main/docs/metrics.md for details.
Also, if you give a set of collector flag names to spec.collectors
of MySQLCluster, a sidecar container running mysqld_exporter exposes the collected metrics for each mysqld
instance.
Scrape rules
This is an example kubernetes_sd_config
for Prometheus to collect all MOCO & MySQL metrics.
scrape_configs:
- job_name: 'moco-controller'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_label_app_kubernetes_io_component,__meta_kubernetes_pod_container_port_name]
action: keep
regex: moco-system;moco-controller;metrics
- job_name: 'moco-agent'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_container_port_name,__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
action: keep
regex: mysql;agent-metrics;moco-.*
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- job_name: 'moco-mysql'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name,__meta_kubernetes_pod_container_port_name,__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
action: keep
regex: mysql;mysqld-metrics;moco-.*
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
action: replace
target_label: name
- source_labels: [__meta_kubernetes_pod_label_statefulset_kubernetes_io_pod_name]
action: replace
target_label: index
regex: .*-([0-9])
- source_labels: [__meta_kubernetes_pod_label_moco_cybozu_com_role]
action: replace
target_label: role
The collected metrics should have these labels:
namespace
: MySQLCluster'smetadata.namespace
name
: MySQLCluster'smetadata.name
index
: The ordinal of MySQL instance Pod
Design notes
Design notes
The purpose of this document is to describe the backgrounds and the goals of MOCO. Implementation details are described in other documents.
Motivation
We are creating our own Kubernetes operator for clustering MySQL instances for the following reasons:
Firstly, our application requires strict-compatibility to the traditional MySQL. Although recent MySQL provides an advanced clustering solution called group replication that is based on Paxos, we cannot use it because of various limitations from group replication.
Secondly, we want to have a Kubernetes native and the simplest operator. For example, we can use Kubernetes Service to load-balance read queries to multiple replicas. Also, we do not want to support non-GTID based replications.
Lastly, none of the existing operators could satisfy our requirements.
Goals
- Manage primary-replica style clustering of MySQL instances.
- The primary instance is the only instance that allows writes.
- Replica instances replicate data from the primary and are read-only.
- Support replication from an external MySQL instance.
- Support all the four transaction isolation levels.
- No split-brain.
- Allow large transactions.
- Upgrade the operator without restarting MySQL Pods.
- Safe and automatic upgrading of MySQL version.
- Support automatic primary selection and switchover.
- Support automatic failover.
- Backup and restore features.
- Support point-in-time recovery (PiTR).
- Tenant users can specify the following parameters:
- The version of MySQL instances.
- The number of processor cores for each MySQL instance.
- The amount of memory for each MySQL instance.
- The amount of backing storage for each MySQL instance.
- The number of replicas in the MySQL cluster.
- Custom configuration parameters.
- Allow
CREATE / DROP TEMPORARY TABLE
during a transaction.
Non-goals
-
Support for older MySQL versions (5.6, 5.7)
As a late comer, we focus our development effort on the latest MySQL. This simplifies things and allows us to use advanced mechanisms such as
CLONE INSTANCE
. -
Node fencing
Fencing is a technique to safely isolated a failed Node. MOCO does not rely on Node fencing as it should be done externally.
We can still implement failover in a safe way by configuring semi-sync parameters appropriately.
How MOCO reconciles MySQLCluster
MOCO creates and updates a StatefulSet and related resources for each MySQLCluster custom resource. This document describes how and when MOCO updates them.
- Reconciler versions
- The update policy of moco-agent container
- Clustering related resources
- Backup and restore related resources
- Status of Reconcliation
Reconciler versions
MOCO's reconciliation routine should be consistent to avoid frequent updates.
That said, we may need to modify the reconciliation process in the future. To avoid updating the StatefulSet, MOCO has multiple versions of reconcilers.
For example, if a MySQLCluster is reconciled with version 1 of the reconciler, MOCO will keep using the version 1 reconciler to reconcile the MySQLCluster.
If the user edits MySQLCluster's spec
field, MOCO can reconcile the MySQLCluster with the latest reconciler, for example version 2, because the user shall be ready for mysqld restarts.
The update policy of moco-agent container
We shall try to avoid updating moco-agent as much as possible.
Clustering related resources
The figure below illustrates the overview of resources related to clustering MySQL instances.
StatefulSet
MOCO tries not to update the StatefulSet frequently. It updates the StatefulSet only when the update is a must.
The conditions for StatefulSet update
The StatefulSet will be updated when:
- Some fields under
spec
of MySQLCluster are modified. my.cnf
for mysqld is updated.- the version of the reconciler used to reconcile the StatefulSet is obsoleted.
- the image of moco-agent given to the controller is updated.
- the image of mysqld_exporter given to the controller is updated.
When the StatefulSet is not updated
- the image of fluent-bit given to the controller is changed.
- because the controller does not depend on fluent-bit.
The fluent-bit sidecar container is updated only when some fields under spec
of MySQLCluster are modified.
Status about StatefulSet
- In
MySQLCluster.Status.Condition
, there is a condition namedStatefulSetReady
. - This indicates the readieness of StatefulSet.
- The condition will be
True
when the rolling update of StatefulSet completely finishes.
Secrets
MOCO generates random passwords for users that MOCO uses to access MySQL.
The generated passwords are stored in two Secrets.
One is in the same namespace as moco-controller
, and the other is in the namespace of MySQLCluster.
Certificate
MOCO creates a Certificate in the same namespace as moco-controller
to issue a TLS certificate for moco-agent
.
After cert-manager issues a TLS certificate and creates a Secret for it, MOCO copies the Secret to the namespace of MySQLCluster. For details, read security.md.
Service
MOCO creates three Services for each MySQLCluster, that is:
- A headless Service, required for every StatefulSet
- A Service for the primary mysqld instance
- A Service for replica mysql instances
The Services' labels, annotations, and spec
fields can be customized with MySQLCluster's spec.primaryServiceTemplate
and spec.replicaServiceTemplate
field.
The spec.primaryServiceTemplate
configures the Service for the primary mysqld instance
and the spec.replicaServiceTemplate
configures the Service for the replica mysqld instances.
The following fields in Service spec
may not be customized, though.
clusterIP
selector
The ports
field in the Service spec
is also customizable.
However, for the mysql
and mysqlx
ports, MOCO overwrites the fixed value to the port
, protocol
and targetPort
fields.
ConfigMap
MOCO creates and updates a ConfigMap for my.cnf
.
The name of this ConfigMap is calculated from the contents of my.cnf
that may be changed by users.
MOCO deletes old ConfigMaps of my.cnf
after a new ConfigMap for my.cnf
is created.
If the cluster does not disable a sidecar container for slow query logs, MOCO creates a ConfigMap for the sidecar.
PodDisruptionBudget
MOCO creates a PodDisruptionBudget for each MySQLCluster to prevent too few semi-sync replica servers.
The spec.maxUnavailable
value is calculated from MySQLCluster's
spec.replicas
as follows:
`spec.maxUnavailable` = floor(`spec.replicas` / 2)
If spec.replicas
is 1, MOCO does not create a PDB.
ServiceAccount
MOCO creates a ServiceAccount for Pods of the StatefulSet. The ServiceAccount is not bound to any Roles/ClusterRoles.
Backup and restore related resources
See backup.md for the overview of the backup and restoration mechanism.
CronJob
This is the only resource created when backup is enabled for MySQLCluster.
If the backup is disabled, the CronJob is deleted.
Job
To restore data from a backup, MOCO creates a Job. MOCO deletes the Job after the Job finishes successfully.
If the Job fails, MOCO leaves the Job.
Status of Reconcliation
- In
MySQLCluster.Status.Condition
, there is a condition namedReconcileSuccess
. - This indicates the status of reconcilation.
- The condition will be
True
when the reconcile function successfully finishes.
How MOCO maintains MySQL clusters
For each MySQLCluster, MOCO creates and maintains a set of mysqld
instances.
The set contains one primary instance and may contain multiple replica instances depending on the spec.replicas
value of MySQLCluster.
This document describes how MOCO does this job safely.
Terminology
- Replication: GTID-based replication between
mysqld
instances. - Cluster: a group of
mysqld
instances that replicate data between them. - Primary (instance): a single source instance of
mysqld
in a cluster. - Replica (instance): a read-only instance of
mysqld
that synchronizes data with the primary instance. - Intermediate primary: a special primary instance that replicates data from an external
mysqld
. - Errant transaction: a transaction that exists only on a replica instance.
- Errant replica: a replica instance that has errant transactions.
- Switchover: operation to change a live primary to a replica and promote a replica to the new primary.
- Failover: operation to replace a dead primary with a replica.
Prerequisites
MySQLCluster allows positive odd numbers for spec.replicas
value. If 1, MOCO runs a single mysqld
instance without configuring replication. If 3 or greater, MOCO chooses a mysqld
instance as a primary, writable instance and configures all other instances as replicas of the primary instance.
status.currentPrimaryIndex
in MySQLCluster is used to record the current chosen primary instance.
Initially, status.currentPrimaryIndex
is zero and therefore the index of the primary instance is zero.
As a special case, if spec.replicationSourceSecretName
is set for MySQLCluster, the primary instance is configured as a replica of an external MySQL server. In this case, the primary instance will not be writable. We call this type of primary instance intermediate primary.
If spec.replicationSourceSecretName
is not set, MOCO configures semisynchronous replication between the primary and replicas. Otherwise, the replication is asynchronous.
For semi-synchronous replication, MOCO configures rpl_semi_sync_master_timeout
long enough so that it never degrades to asynchronous replication.
Likewise, MOCO configures rpl_semi_sync_master_wait_for_slave_count
to (spec.replicas
- 1 / 2) to make sure that at least half of replica instances have the same commit as the primary. e.g., If spec.replicas
is 5, rpl_semi_sync_master_wait_for_slave_count
will be set to 2.
MOCO also disables relay_log_recovery
because enabling it would drop the relay logs on replicas.
mysqld
always starts with super_read_only=1
to prevent erroneous writes, and with skip_replica_start
to prevent misconfigured replication.
moco-agent
, a sidecar container for MOCO, initializes MySQL users and plugins. At the end of the initialization, it issues RESET MASTER | RESET BINARY LOGS AND GTIDS
to clear executed GTID set.
moco-agent
also provides a readiness probe for mysqld
container. If a replica instance does not start replication threads or is too delayed to execute transactions, the container and the Pod will be determined as unready.
Limitations
Currently, MOCO does not re-initialize data after the primary instance fails.
After failover to a replica instance, the old primary may have errant transactions because it may recover unacknowledged transactions in its binary log. This is an inevitable limitation in MySQL semi-synchronous replication.
If this happens, MOCO detects the errant transaction and will not allow the old primary to rejoin the cluster as a replica.
Users need to delete the volume data (PersistentVolumeClaim) and the pod of the old primary to re-initialize it.
Possible states
MySQLCluster
MySQLCluster can be one of the following states.
The initial state is Cloning if spec.replicationSourceSecretName
is set, or Restoring if spec.restore
is set.
Otherwise, the initial state is Incomplete.
Note that, if the primary Pod is ready, the mysqld
is assured writable.
Likewise, if a replica Pod is ready, the mysqld
is assured read-only and running replication threads w/o too much delay.
- Healthy
- All Pods are ready.
- All replicas have no errant transactions.
- All replicas are read-only and connected to the primary.
- For intermediate primary instance, the primary works as a replica for an external
mysqld
and is read-only.
- Cloning
spec.replicationSourceSecretName
is set.status.cloned
is false.- The cloning result exists and is not "Completed" or there is no cloning result and the instance has no data.
- (note: if the primary has some data and has no cloning result, the instance was used to be a replica and then promoted to the primary.)
- Restoring
spec.restore
is set.status.restoredTime
is not set.
- Degraded
- The primary Pod is ready and does not lose data.
- For intermediate primary instance, the primary works as a replica for an external
mysqld
and is read-only. - Half or more replicas are ready, read-only, connected to the primary, and have no errant transactions. For example, if
spec.replicas
is 5, two or more such replicas are needed. - At least one replica has some problems.
- Failed
- The primary instance is not running or lost data.
- More than half of replicas are running and have data without errant transactions. For example, if
spec.replicas
is 5, three or more such replicas are needed.
- Lost
- The primary instance is not running or lost data.
- Half or more replicas are not running or lost data or have errant transactions.
- Incomplete
- None of the above states applies.
MOCO can recover the cluster to Healthy from Degraded, Failed, or Incomplete if all Pods are running and there are no errant transactions.
MOCO can recover the cluster to Degraded from Failed when not all Pods are running. Recovering from Failed is called failover.
MOCO cannot recover the cluster from Lost. Users need to restore data from backups.
Pod
mysqld
is run as a container in a Pod.
Therefore, MOCO needs to be aware of the following conditions.
- Missing: the Pod does not exist.
- Exist: the Pod exists and not Terminating or Demoting.
- Terminating: The Pod exists and
metadata.deletionTimestamp
is not null. - Demoting: The Pod exists and has
moco.cybozu.com/demote: true
annotation.
If there are missing Pods, MOCO does nothing for the MySQLCluster.
If a primary instance Pod is Terminating or Demoting, MOCO controller changes the primary to one of the replica instances. This operation is called switchover.
MySQL data
MOCO checks replica instances whether they have errant transactions compared to the primary instance. If it detects such an instance, MOCO records the instance with MySQLCluster and excludes it from the cluster.
The user needs to delete the Pod and the volume manually and let the StatefulSet controller to re-create them. After a newly initialized instance gets created, MOCO will allow it to rejoin the cluster.
Invariants
- By definition, the primary instance recorded in MySQLCluster has no errant transactions. It is always the single source of truth.
- Errant replicas are not treated as ready even if their Pod status is ready.
The maintenance flow
MOCO runs the following infinite loop for each MySQLCluster. It stops when MySQLCluster resource is deleted.
- Gather the current status
- Update
status
of MySQLCluster - Determine what MOCO should do for the cluster
- If there is nothing to do, wait a while and go to 1
- Do the determined operation then go to 1
Read the following sub-sections about 1 to 3.
Gather the current status
MOCO gathers the information from kube-apiserver
and mysqld
as follows:
- MySQLCluster resource
- Pod resources
- If some of the Pods are missing, MOCO does nothing.
mysqld
SHOW REPLICAS
(on the primary)SHOW REPLICA STATUS
(on the replicas)- Global variables such as
gtid_executed
orsuper_read_only
- Result of CLONE from
performance_schema.clone_status
table
If MOCO cannot connect to an instance for a certain period, that instance is determined as failed.
Update status
of MySQLCluster
In this phase, MOCO updates status
field of MySQLCluster as follows:
- Determine the current MySQLCluster state.
- Add or update type=
Initialized
condition tostatus.conditions
asTrue
if the cluster state is not Cloning.- otherwise,
False
.
- Add or update type=
Available
condition tostatus.conditions
asTrue
if the cluster state is Healthy or Degraded.- otherwise,
False
.
- Add or update type=
Healthy
condition tostatus.conditions
asTrue
if the cluster state is Healthy.- otherwise,
False
. - The
Reason
field is set to the cluster state such as "Failed" or "Incomplete".
- Set the number of ready replica Pods to
status.syncedReplicas
. - Add newly found errant replicas to
status.errantReplicaList
. - Remove re-initialized and/or no-longer errant replicas from
status.errantReplicaList
- Set
status.errantReplicas
to the length ofstatus.errantReplicaList
. - Set
status.cloned
to true ifspec.replicationSourceSecret
is not nil and the state is not Cloning.
Determine what MOCO should do for the cluster
The operation depends on the current cluster state.
The operation and its result are recorded as Events of MySQLCluster resource.
cf. Application Introspection and Debugging
Healthy
If the primary instance Pod is Terminating or Demoting, switch the primary instance to another replica. Otherwise, just wait a while.
The switchover is done as follows. It takes at least several seconds for a new primary to become writable.
- Make the primary instance
super_read_only=1
. - Kill all existing connections except ones from
localhost
and ones for MOCO. - Wait for a replica to catch up the executed GTID set of the primary instance.
- Set
status.currentPrimaryIndex
to the replica's index. - If the old primary is Demoting, remove
moco.cybozu.com/demote
annotation from the Pod.
Cloning
Execute CLONE INSTANCE
on the intermediate primary instance to clone data from an external MySQL instance.
If the cloning goes successful, do the same as Intermediate case.
Restoring
Do nothing.
Degraded
First, check if the primary instance Pod is Terminating or Demoting, and if it is, do the switchover just like Healthy case.
Then, do the same as Intermediate case to try to fix the problems. It is not possible to recover the cluster to Healthy if there are errant or stopped replicas, though.
Failed
MOCO chooses the most advanced instance as the new primary instance. The most advanced means that its retrieved GTID set is the superset of all other replicas except for those have errant transactions.
To prevent accidental writes to the old primary instance (so-called split-brain), MOCO stops replication IO_THREAD for all replicas. This way, the old primary cannot get necessary acks from replicas to write further transactions.
The failover is done as follows:
- Stop IO_THREAD on all replicas.
- Choose the most advanced replica as the new primary. Errant replicas recorded in MySQLCluster are excluded from the candidates.
- Wait for the replica to execute all retrieved GTID set.
- Update
status.currentPrimaryIndex
to the new primary's index.
Lost
There is nothing can be done.
Intermediate
- On the primary that was an intermediate primary, wait for all the retrieved GTID set to be executed.
- Start replication between the primary and non-errant replicas.
- If a replication has no data, MOCO clones the primary data to the replica first.
- Stop replication of errant replicas.
- Set
super_read_only=1
for replica instances that are writable. - Adjust
moco.cybozu.com/role
label to Pods according to their roles.- For errant replicas, the label is removed to prevent users from reading inconsistent data.
- Finally, make the primary
mysqld
writable if the primary is not an intermediate primary.
Backup and restore
This document describes how MOCO takes a backup of MySQLCluster data and restores a cluster from a backup.
Overview
A MySQLCluster can be configured to take backups regularly by referencing a BackupPolicy in spec.backupPolicyName
. For each MySQLCluster associated with a BackupPolicy, moco-controller
creates a CronJob.
The CronJob creates a Job to take a full backup periodically.
The Job also takes a backup of binary logs for Point-in-Time Recovery (PiTR).
The backups are stored in a S3-compatible object storage bucket.
This figure illustrates how MOCO takes a backup of a MySQLCluster.
moco-controller
creates a CronJob and Role/RoleBinding to allow access to MySQLCluster for the Job Pod.- At each configured interval, CronJob creates a Job.
- The Job dumps all data from a
mysqld
using MySQL shell's dump instance utility. - The Job creates a tarball of the dumped data and put it in a bucket of S3 compatible object storage.
- The Job also dumps binlogs since the last backup and put it in the same bucket (with a different name, of course).
- The Job finally updates MySQLCluster status to record the last successful backup.
To restore from a backup, users need to create a new MySQLCluster with spec.restore
filled with necessary information such as the bucket name of the object storage, the object key, and so on.
The next figure illustrates how MOCO restores MySQL cluster from a backup.
moco-controller
creates a Job and Role/RoleBinding for restoration.- The Job downloads a tarball of dumped files of the specified backup.
- The Job loads data into an empty
mysqld
using MySQL shell's dump loading utility. - If the user wanted to restore data at a point-in-time, the Job downloads saved binlogs.
- The Job applies binlogs up to the specified point-in-time using
mysqlbinlog
. - The Job finally updates MySQLCluster status to record the restoration time.
Design goals
Must:
- Users must be able to configure different backup policies for each MySQLCluster.
- Users must be able to restore MySQL data at a point-in-time from backups.
- Users must be able to restore MySQL data without the original MySQLCluster resource.
moco-controller
must export metrics about backups.
Should:
- Backup data should be compressed to save the storage space.
- Backup data should be stored in an object storage.
- Backups should be taken from a replica instance as much as possible.
These "should's" are mostly in terms of money or performance.
Implementation
Backup file keys
Backup files are stored in an object storage bucket with the following keys.
- Key for a tarball of a fully dumped MySQL:
moco/<namespace>/<name>/YYYYMMDD-hhmmss/dump.tar
- Key for a compressed tarball of binlog files:
moco/<namespace>/<name>/YYYYMMDD-hhmmss/binlog.tar.zst
<namespace>
is the namespace of MySQLCluster, and <name>
is the name of MySQLCluster.
YYYYMMDD-hhmmss
is the date and time of the backup where YYYY
is the year, MM
is two-digit month, DD
is two-digit day, hh
is two-digit hour in 24-hour format, mm
is two-digit minute, and ss
is two-digit second.
Example: moco/foo/bar/20210515-230003/dump.tar
This allows multiple MySQLClusters to share the same bucket.
Timestamps
Internally, the time for PiTR is formatted in UTC timezone.
The restore Job runs mysqlbinlog
with TZ=Etc/UTC
timezone.
Backup
As described in Overview, the backup process is implemented with CronJob and Job. In addition, users need to provide a ServiceAccount for the Job.
The ServiceAccount is often used to grant access to the object storage bucket where the backup files will be stored. For instance, Amazon Elastic Kubernetes Service (EKS) has a feature to create such a ServiceAccount. Kubernetes itself is also developing such an enhancement called Container Object Storage Interface (COSI).
To allow the backup Job to update MySQLCluster status, MOCO creates Role and RoleBinding. The RoleBinding grants the access to the given ServiceAccount.
By default, MOCO uses the Amazon S3 API, the most popular object storage API. Therefore, it also works with object storage that has an S3-compatible API, such as MinIO and Ceph. Object storage that uses non-S3 compatible APIs is only partially supported.
Currently supported object storage includes:
- Amazon S3-compatible API
- Google Cloud Storage API
For the first time, the backup Job chooses a replica instance as the backup source if available. For the second and subsequent backups, the Job will choose the last chosen instance as long as it is still a replica and available.
The backups are divided into two: a full dump and binlogs.
A full dump is a snapshot of the entire MySQL database.
Binlogs are records of transactions.
With mysqlbinlog
, binlogs can be used to apply transactions to a database restored from a full dump for PiTR.
For the first time, MOCO only takes a full dump of a MySQL instance, and records the GTID at the backup. For the second and subsequent backups, MOCO will retrieve binlogs since the GTID of the last backup until now.
To take a full dump, MOCO uses MySQL shell's dump instance utility.
It performs significantly faster than mysqldump
or mysqlpump
.
The dump is compressed with zstd compression algorithm.
MOCO then creates a tarball of the dump and puts it to an object storage bucket.
To retrieve transactions since the last backup until now, mysqlbinlog
is used with these flags:
--read-from-remote-source=BINLOG-DUMP-GTIDS
--exclude-gtids=<the GTID of the last backup>
--to-last-log
The retrieved binlog files are packed into a tarball and compressed with zstd, then put to an object storage bucket.
Finally, the Job updates MySQLCluster status field with the following information:
- The time of backup
- The time spent on the backup
- The ordinal of the backup source instance
server_uuid
of the instance (to check whether the instance was re-initialized or not)- The binlog filename in
SHOW MASTER STATUS | SHOW BINARY LOG STATUS
output. - The size of the tarball of the dumped files
- The size of the tarball of the binlog files
- The maximum usage of the working directory
- Warnings, if any
When executing an incremental backup, the backup source must be a pod whose server_uuid has not changed since the last backup. If the server_uuid has changed, the pod may be missing some of the binlogs generated since the last backup.
The following is how to choose a pod to be the backup source.
flowchart TD
A{"first time?"}
A -->|"yes"| B
A -->|"no"| C["x ← Get the indexes of the pod whose server_uuid has not changed"] --> D
B{Are replicas available?}
B -->|"yes"| B1["return\nreplicaIdx\ndoBackupBinlog=false"]
style B1 fill:#c1ffff
B -->|"no"| B2["return\nprimaryIdx\ndoBackupBinlog=false"]
style B2 fill:#ffffc1
D{"Is x empty?"}
D -->|"yes"| E["add warning to bm.warnings"] --> F
style E fill:#ffc1c1
D -->|"no"| G
F{"Are replicas available?"}
F -->|"yes"| F1["return\nreplicaIdx\ndoBackupBinlog=false"]
style F1 fill:#ffc1c1
F -->|"no"| F2["return\nprimaryIdx\ndoBackupBinlog=false"]
style F2 fill:#ffc1c1
G{"Are there replica indexes in x?"}
G -->|"yes"| H
G -->|"no"| G1["return\nprimaryIdx\ndoBackupBinlog=true"]
style G1 fill:#ffffc1
H{"Is lastIndex included in x?"}
H -->|"yes"| I
H -->|"no"| H1["return\nreplicaIdx\ndoBackupBinlog=true"]
style H1 fill:#c1ffff
I{"Is lastIndex primary?"}
I -->|"yes"| I1["return\nreplicaIdx\ndoBackupBinlog=true"]
style I1 fill:#c1ffff
I -->|"no"| I2["return\nlastIdx\ndoBackupBinlog=true"]
style I2 fill:#c1ffff
Restore
To restore MySQL data from a backup, users need to create a new MySQLCluster with appropriate spec.restore
field.
spec.restore
needs to provide at least the following information:
- The bucket name
- Namespace and name of the original MySQLCluster
- A point-in-time in RFC3339 format
After moco-controller
identifies mysqld
is running, it creates a Job to retrieve backup files and load them into mysqld
.
The Job looks for the most recent tarball of the dumped files that is older than the specified point-in-time in the bucket, and retrieves it.
The dumped files are then loaded to mysqld
using MySQL shell's load dump utility.
If the point-in-time is different from the time of the dump file, and if there is a compressed tarball of binlog files, then the Job retrieves binlog files and applies transactions up to the point-in-time.
After restoration process finishes, the Job updates MySQLCluster status to record the restoration time.
moco-controller
then configures the clustering as usual.
If the Job fails, moco-controller
leaves the Job as is.
The restored MySQL cluster will also be left read-only.
If some of the data have been restored, they can be read from the cluster.
If a failed Job is deleted, moco-controller
will create a new Job to give it another chance.
Users can safely delete a successful Job.
Caveats
-
No automatic deletion of backup files
MOCO does not delete old backup files from object storage. Users should configure a bucket lifecycle policy to delete old backups automatically.
-
Duplicated backup Jobs
CronJob may create two or more Jobs at a time. If this happens, only one Job can update MySQLCluster status.
-
Lost binlog files
If
binlog_expire_logs_seconds
orexpire_logs_days
is set to a shorter value than the interval of backups, MOCO cannot save binlogs correctly. Users are responsible to configurebinlog_expire_logs_seconds
appropriately.
Considered options
There were many design choices and alternative methods to implement backup/restore feature for MySQL. Here are descriptions of why we determined the current design.
Why do we use S3-compatible object storage to store backups?
Compared to file systems, object storage is generally more cost-effective. It also has many useful features such as object lifecycle management.
AWS S3 API is the most prevailing API for object storages.
What object storage is supported?
MOCO currently supports the following object storage APIs:
- Amazon S3
- Google Cloud Storage
MOCO uses the Amazon S3 API by default.
You can specify BackupPolicy.spec.jobConfig.bucketConfig.backendType
to specify the object storage API to use.
Currently, two identifiers can be specified, backendType
for s3
or gcs
.
If not specified, it will be defaults to s3
.
The following is an example of a backup setup using Google Cloud Storage:
apiVersion: moco.cybozu.com/v1beta2
kind: BackupPolicy
...
spec:
schedule: "@daily"
jobConfig:
serviceAccountName: backup-owner
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: <dummy>
bucketConfig:
bucketName: moco
endpointURL: https://storage.googleapis.com
backendType: gcs
workVolume:
emptyDir: {}
Why do we use Jobs for backup and restoration?
Backup and restoration can be a CPU- and memory-consuming task.
Running such a task in moco-controller
is dangerous because moco-controller
manages a lot of MySQLCluster.
moco-agent
is also not a safe place to run backup job because it is a sidecar of mysqld
Pod.
If backup is run in mysqld
Pod, it would interfere with the mysqld
process.
Why do we prefer mysqlsh
to mysqldump
?
The biggest reason is the difference in how these tools lock the instance.
mysqlsh
uses LOCK INSTANCE FOR BACKUP
which blocks DDL until the lock is released. mysqldump
, on the other hand, allows DDL to be executed. Once DDL is executed and acquire meta data lock, which means that any DML for the table modified by DDL will be blocked.
Blocking DML during backup is not desirable, especially when the only available backup source is the primary instance.
Another reason is that mysqlsh
is much faster than mysqldump
/ mysqlpump
.
Why don't we do continuous backup?
Continuous backup is a technique to save executed transactions in real time.
For MySQL, this can be done with mysqlbinlog --stop-never
. This command continuously retrieves transactions from binary logs and outputs them to stdout.
MOCO does not adopt this technique for the following reasons:
-
We assume MOCO clusters have replica instances in most cases.
When the data of the primary instance is lost, one of replicas can be promoted as a new primary.
-
It is troublesome to control the continuous backup process on Kubernetes.
The process needs to be kept running between full backups. If we do so, the entire backup process should be a persistent workload, not a (Cron)Job.
Upgrading mysqld
This document describes how mysqld upgrades its data and what MOCO has to do about it.
Preconditions
MySQL data
Beginning with 8.0.16, mysqld
can update all data that need to be updated when it starts running.
This means that MOCO needs nothing to do with MySQL data.
One thing that we should care about is that the update process may take a long time.
The startup probe of mysqld
container should be configured to wait for mysqld
to
complete updating data.
ref: https://dev.mysql.com/doc/refman/8.0/en/upgrading-what-is-upgraded.html
Downgrading
MySQL 8.0 does not support any kind of downgrading.
ref: https://dev.mysql.com/doc/refman/8.0/en/downgrading.html
Internally, MySQL has a version called "data dictionary (DD) version". If two MySQL versions have the same DD version, they are considered to have data compatibility.
ref: https://github.com/mysql/mysql-server/blob/mysql-8.0.24/sql/dd/dd_version.h#L209
Nevertheless, DD versions do change from time to time between revisions of MySQL 8.0. Therefore, the simplest way to avoid DD version mismatch is to not downgrade MySQL.
Upgrading a replication setup
In a nutshell, replica MySQL instances should be the same or newer than the source MySQL instance.
refs:
- https://dev.mysql.com/doc/refman/8.0/en/replication-compatibility.html
- https://dev.mysql.com/doc/refman/8.0/en/replication-upgrade.html
StatefulSet behavior
When the Pod template of a StatefulSet is updated, Kubernetes updates the Pods.
With the default update strategy RollingUpdate
, the Pods are updated one by one
from the largest ordinal to the smallest.
StatefulSet controller keeps the old Pod template until it completes the rolling update. If a Pod that is not being updated are deleted, StatefulSet controller restores the Pod from the old template.
This means that, if the cluster is Healthy, MySQL is assured to be updated one by one from the instance of the largest ordinal to the smallest.
refs:
- https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates
- https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update
Automatic switchover
MOCO switches the primary instance when the Pod of the instance is being deleted.
Read clustering.md
for details.
MOCO implementation
With the preconditions listed above, MOCO can upgrade mysqld
in MySQLCluster safely
as follows.
- Set
.spec.updateStrategy
field in StatefulSet toRollingUpdate
. - Choose the lowest ordinal Pod as the next primary upon a switchover.
- Configure the startup probe of
mysqld
container to wait long enough.- By default, MOCO configures the probe to wait up to one hour.
- Users can adjust the duration for each MySQLCluster.
Example
Suppose that we are updating a three-instance cluster.
The mysqld
instances in the cluster have ordinals 0, 1, and 2, and the
current primary instance is instance 1.
After MOCO updates the Pod template of the StatefulSet created for the cluster, Kubernetes start re-creating Pods starting from instance 2.
Instance 2 is a replica and therefore is safe for an update.
Next to instance 2, the instance 1 Pod is deleted. The deletion triggers
an automatic switchover so that MOCO changes the primary to the instance 0
because it has the lowest ordinal. Because instance 0 is running an old
mysqld
, the preconditions are kept.
Finally, instance 0 is re-created in the same way. This time, MOCO switches the primary to instance 1. Since both instance 1 and 2 has been updated and instance 0 is being deleted, the preconditions are kept.
Limitations
If an instance is down during an upgrade, MOCO may choose an already updated instance as the new primary even though some instances are still running an old version.
If this happens, users may need to manually delete the old replica data and re-initialize the replica to restore the cluster health.
User's responsibility
- Make sure that the cluster is healthy before upgrading
- Check and prepare your installation for upgrade
- Do not attempt to downgrade MySQL
Security considerations
gRPC API
moco-agent, a sidecar container in mysqld Pod, provides gRPC API to
execute CLONE INSTANCE
and required operations after CLONE.
More importantly, the request contains credentials to access the source
database.
To protect the credentials and prevent abuse of API, MOCO configures mTLS between moco-agent and moco-controller as follows:
- Create an Issuer resource in
moco-system
namespace as the Certificate Authority. - Create a Certificate resource to issue the certificate for
moco-controller
. moco-controller
issues certificates for each MySQLCluster by creating Certificate resources.moco-controller
copies Secret resources created by cert-manager to the namespaces of MySQLCluster.- Both moco-controller and moco-agent verifies the certificate with the CA certificate.
- The CA certificate is embedded in the Secret resources.
- moco-agent additionally verifies the certificate from
moco-controller
if it's Common Name ismoco-controller
.
MySQL passwords
MOCO generates its user passwords randomly with the OS random device. The passwords then stored as Secret resources.
As to communication between moco-controller and mysqld, it is not (yet) over TLS. That said, the password is encrypted anyway thanks to caching_sha2_password authentication.