Multi-cluster Strategy

Prerequisites

Reference Documents

Reference Document Location
HELM Helm
EXTERNALDNS ExternalDNS in Multi-Cluster Setup
OBSERVABILITY Observability in Multi-Cluster Setup
MANIFESTS Services Deployment Manifests

Problem Description

In order to provide a faster Business Continuity Plan (BCP) response, a second Kubernetes (K8s) cluster for production is needed.

Background

Following a SEV-1 outage in Kubernetes cluster in production during the upgrade of Container Network Interface (CNI) plugin, it was determined that the Disaster Recovery Plan (DRP) was not clear and was not being applied in any drill previously. Also, it was not possible to apply DRP without tearing down the existing infrastructure, so it was determined recovering the existing cluster would be faster. It took three hours to restore the service.

Dictionary

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Application deployment

Applications are deployed in the cluster with Helm charts. There is a reconciliation pipeline that installs all the charts defined in a manifest in a given cluster.

Traffic management

Application pods are exposed via Service resource type ClusterIP or NodePort which receive all the traffic intended for the application that is represented by the Ingress. The Ingress controller configures and manages pods that contain instances of Nginx web server configured as a reverse proxy. The Ingress controller translates Ingress configuration into a standard Nginx configuration. Then it injects the configuration into the Nginx pods which route the traffic to the application pods. The Nginx pods are exposed for external traffic via Service resource type Load Balancer which provisions AWS Classic Load Balancer in EKS public subnets.

Application concurrency and state

Workloads MUST support concurrency or implement mechanisms for avoiding concurrency that do not rely on the cluster. They usually run in a Deployment object with replica count greater than 1 to be resilient to worker node failure.

Workloads MUST not hold state inside the cluster. Any state (like database, cache, etc.) will rely on AWS managed services (RDS, Elasticache, S3, etc.) outside the cluster.

Solution

Provide the second Standby cluster where Ebury Platform workloads can run and receive traffic, to be used in the event of an incident with the existing production Active cluster.

The second cluster will be created in the same AWS account, region, VPC and network as the current production one, with the same level of access to and from resources in Ebury Platform. We will avoid naming the clusters in a way one is primary and the other is secondary. Both clusters will be equal and may be Active or Standby at any time.

Infrastructure and configuration changes and upgrades will never be applied to both clusters at the same time.

Traffic steering

Each cluster will have its own set of Load Balancers, with traffic being redirected to the active one with weights in DNS records.

Workload deployment

All Kubernetes objects comprising the Ebury Platform will be created and updated in both clusters. The reconciliation pipeline for Ebury Platform applications will run in parallel for both clusters. See HELM and MANIFESTS for more details.

The Standby cluster will not start any Pods for applications by default nor will receive traffic as any Ingress defined will have DNS weight = 0. Mutation policies will enforce that behaviour.

Switchover

Standby cluster will run with minimal worker nodes. Cluster Autoscaler will manage increased demand in case it becomes the Active cluster.

Switchover is to be done without downtime provided both clusters are healthy.

Tools for switchover and a runbook will be delivered. Regular drills will be performed at an agreed upon cadence.

Observability

Both clusters will stream logs and metrics to the common observability platform.

Existing metrics, alerting rules and dashboards will be modified to be multi-cluster aware, see OBSERVABILITY.

Every metric/alerting rule will be assessed whether we want to evaluate it on a per cluster basis or as a aggregate over all clusters.

Cost

Cost for EKS cluster consists of cost for EKS control plane, EC2 instances for worker nodes, EBS volumes for worker nodes and Load Balancers for Ingresses. Standby cluster will have minimum number of worker nodes.

  • cost for EKS control plane - $72 / month
  • cost for worker nodes - $100 to $200 / month (rough estimate)
  • cost for Load Balancers - $150 / month (estimate)

Alternatives

Multi-cluster setup with a common Load Balancer

Provide the second Standby cluster where workloads will run and receive traffic. Each workload will have a single DNS record that steers traffic to a common Load Balancer with two target groups, one target group for each cluster. This solution provides faster switchover and represents a bigger change for existing architecture. Therefore it would be much harder to implement. See EXTERNALDNS.

  • cost for EKS control plane - $72 / month
  • cost for worker nodes - $300 to $500 / month (rough estimate)

Active-Active setup

Provide the second cluster where workloads will run and receive traffic. Both clusters will receive Internet traffic simultaneously. DNS round robin will split traffic equally between clusters. We are not sure whether all workloads are ready for Active-Active setup.

  • cost for EKS control plane - $72 / month
  • cost for worker nodes - $300 to $500 / month (rough estimate)
  • cost for Load Balancers - $150 / month (estimate)

Velero

Velero is an open source tool to backup and restore Kubernetes cluster resources.

Instead of deploying Helm charts, perform regular backups of clusters with Velero and restore that in the Standby cluster if needed. Additional storage for backups will be needed and switchover would take longer time.

We don't have practical experience with Velero. For cost calculation let's assume worker node storage will be backed up to S3 daily.

  • cost for S3 storage - $100 to $300 / month (rough estimate)

Improve DRP

Create DRP for Kubernetes cluster. Single Kubernetes cluster will remain a large failure domain for applying cluster-wide changes.

Caveats

Dependance on correct DNS caching

Each cluster will have a set of DNS records with low TTL for all its workloads. Switchover will activate a set of DNS records for the new Active cluster and then it will deactivate a set of DNS records for the new Standby cluster. Successful switchover depends on correct behaviour of DNS clients and DNS cache servers to query a DNS record again when its TTL expires. Correct implementation of DNS caching on the customer side is out of our control.

Underutilised resources

In a steady state, the Standby cluster will be idle and ready to take over traffic. Standby cluster resources will not be utilised most of the time.

How to determine Active cluster

Switchover can happen at any time. It would not be straightforward to determine which cluster is active at the moment.

Operation

Standby cluster will be created as code and will be managed by Platform teams.

Security Impact

Security for Standby cluster will be at the same level as for existing cluster.

Performance Impact

When switchover is triggered, traffic will be gradually steered to the new Active cluster. The switchover time depends on TTL for cluster workload DNS records.

Developer Impact

Deployment of applications to both clusters will be transparent for developers.

Existing alerting rules will be modified to be multi-cluster aware.

We will need to be careful of any alerting rules relating to "not enough traffic".

Data Contracts

N/A

Deployment

Deployment of applications to the Standby cluster will be staged in two phases:

  • First, the reconciliation pipeline will deploy only Example application to Standby cluster.
  • When traffic steering, alerting and metrics collection is tested for Example application, deployment of all other applications to Standy cluster will be enabled in the reconciliation pipeline.

Dependencies

N/A