Objective

This document aims to describe how Platform plans to design, deploy and manage production-ready Kubernetes clusters running on Amazon EKS, while also providing recommendations and open source tooling for developers.

Background

Kubernetes (K8s) has become the industry standard for container orchestration, although it can also be difficult to administer. As a result, most organisations choose a hosted and managed Kubernetes services to streamline their deployment and management processes.

We have decided to deploy our Kubernetes clusters with Amazon Elastic Kubernetes Service (Amazon EKS) — a fully managed service that allows us to run Kubernetes on AWS, while also handling the Kubernetes control plane. This makes it easy for us to deploy, manage, and scale containerised applications.

In this document, we want to provide an overview of the design and deployment of an Amazon EKS cluster. The goal behind this project is to migrate Ebury applications, such as EBO from ECS and BOS from virtual machines to container-based infrastructure for simpler scaling, better infrastructure management, and optimised resource usage delivering one operating platform.

Solution

Kubernetes cluster service

Currently, AWS offers two options to host a Kubernetes cluster:

Self-hosted: This option deploys Kubernetes on Amazon EC2 instances. This gives full flexibility in designing and deploying the cluster. We will however need to manage the entire control plane. We have ruled this out as an option to ease the effort involved in deploying the cluster. Self-hosting will require additional time, attention, and resources.

Elastic Kubernetes Service (EKS): This option deploys a Kubernetes cluster without the overhead of managing the control plane, as the control plane is fully hosted within AWS. The platform team only need to manage the K8s worker nodes, removing some of the complexity of a self-hosted solution.

We need to consider the that the upgrading process of EKS have manual steps

{ Irrespective of service options the worker nodes will be delivered in Auto Scaling Groups (ASGs), allowing us to treat nodes and groups instead of as individual instances. ASGs also offer features such as automatic node replacement, tagging inheritance, or scaling in and out. }

Multi-tenant or single-tenant Kubernetes cluster

This decision varies and is open to debate, though a large multi-tenant EKS cluster could become problematic due to its size. Kubernetes is flexible when deploying a multi-tenant cluster, but we want to avoid running into situations where multiple tenants (environments) running on the same infrastructure ended up obstructing each other.

We also want to future proof our clusters and keep within the 150 worker node limit we also want to ensure that we can scale out in the future should the need arise. There are other considerations to consider regarding multi-tenancy, such as:

  • Deploying critical business applications on the same infrastructure as non-critical apps may result in one consuming the other’s resources.

  • Incidents may not limit the blast radius, potentially taking down clusters along with the applications.

  • Increased setup complexity that results in expanding resource quotas, pod security policies, monitoring and logging, authentication and authorisation.

We plan to initially have two clusters in two separate AWS accounts. Production and Development this will address some of the above concerns.

We do have a question to solve. Potentially we will have ~4 customer facing environments PROD, Preview (for integrators and early access to new releases) and Demo. We need to make a decision on where to colocate these environments. Being customer facing the three support environments need to be treated in the same way as PROD so there is an argument to collocate them within the PROD K8s cluster.

Type of worker node instances

The type of node depends on applications that will be deployed on the cluster. Some apps consume a lot of CPU, while others prefer memory or other attributes. AWS provides several different types of EC2 compute nodes to support this requirement. Choosing multiple worker node flavors for smarter resource usage is an option, but it brings along with the following design considerations:

  • The need to manage multiple ASGs

  • Increased cost to host multiple instance types, harder to manage reservations

  • Added complexity when designing the cluster

EKS allows multiple ASGs with different instance flavors, giving each ASG specific tags that can be used to schedule pods via label selectors. As a guide we will use m5.2xlarge and standardise the instances. To optimise the costs we will look to reserve the 6 base instances for 1 year. Each instance will cost $197pm when using a 1 year no upfront reservation.

Designing the cluster

Ingress controller

There are a number of ingress controller solutions, many of which are open source. There are a few aspects to consider when selecting the controller solution:

  • Scaling requirements, available resources, and resource use patterns.

  • Traffic type that will be served (HTTP, gRPC, WebSocket, etc.)

  • Number of requests, networking policies, monitoring, and logging.

Ebury will avoid exposing services using NodePort or ExternalName.

Ebury will start with AWS Elastic Load Balancing (ELB) for kubernetes services of type: LoadBalancer. Standard Http and Https traffic will be routed via an ingress controller, to start with we will use the default upstream provided ingress controller and default backend versions. As part of the initial rollout ingress resources will only use standard annotations as per the upstream documentation.

Exposing private/public endpoints

Endpoints will be exposed outside the Kubernetes cluster to allow other applications to communicate with it. Standard Http and Https endpoints will be handle by an ingress controller, using a LoadBalancer type service.

Exposed HTTP endpoints will be SSL/TLS terminated and have a predictable, often static, hostname in the environment's domain, e.g. https://foo.ebury.com/. Other applications can use the public, encrypted route to reach the endpoint. Internal DNS for the endpoint's Service may also be used but may have different access requirements and restrictions.

We will need to create a communication link between our clusters and other AWS accounts hosting Kafka and other heritage services as well as the corporate network. We plan to use private subnets that are advertised to other accounts using AWS Transit Gateway service. This will enable any endpoint exposed on these subnets to be available to our internal network.

Deploying the cluster

Considerations when deploying Kubernetes clusters on Amazon EKS.

Cluster redundancy for a single region

This refers to deploying multiple Kubernetes clusters in a single region to limit downtime if a cluster should fail. Clusters may be deployed in two scenarios: active-active or failover.

  • Active-active: Multiple clusters serve the same traffic. If one of these clusters goes down, all processing power would be lost. Using the Kubernetes cluster autoscaler feature, available clusters may scale up to cover the loss.

  • Failover: One or more clusters is in standby to become available if the main cluster goes down. This is expensive on resources and costs.

Best practice is to employ multiple Kubernetes clusters; this facilitates not only resilience but upgrade paths. For example, we would run two Kubernetes clusters with 6 nodes each instead of a single Kubernetes cluster with 12 nodes.

This raises the problem of cluster management. There are a number of open source continuous delivery platforms that offer simpler application management, cluster deployments as well as upgrades and replacements.

VPC Peering or AWS Transit Gateway

VPC peering and Transit Gateway (TGW) require non-overlapping IP blocks across all VPCs this can raise the complexity. As part of the project we will deliver a clearly defined standard to manage VPC CIDR. Public communication instead of VPC peering/TGW is not an option in our environment.

We are looking to use two CIDR blocks attached to the same VPC:

  • The first block is used to deploy the infrastructure for Kubernetes.

  • The second block is used to create private subnets that expose internal endpoints.

The CIDR block used by Kubernetes is blocked within TGW, allowing us to use the same internal CIDR block for all clusters.

VPC considerations:

When creating an Amazon EKS cluster, we specify the VPC subnets for the cluster to use. Amazon EKS requires subnets in at least two Availability Zones. A VPC with public and private subnets allows Kubernetes to create public load balancers in the public subnets that load balance traffic to pods running on nodes that are in private subnets.

When we create the cluster, we will also need to specify all of the subnets that host resources for the cluster, such as nodes and load balancers needs.

Nodes must be able to communicate with the control plane and other AWS services. If the nodes are deployed in a private subnet, then will have to:

  • Setup a default route for the subnet and gateway. The NAT gateway must be assigned a public IP address to provide internet access for the nodes.

  • Configured necessary settings for the subnet to allow the private subnet to communicate outside the cluster.

If self-managed nodes are deployed to a public subnet, the subnet must be configured to auto-assign public IP addresses. It is a requirement that node instances are assigned a public IP address when they're launched. When managed nodes are deployed to a public subnet, the subnet must be configured to auto- assign public IP addresses. This is because, if they are not, then the nodes aren't assigned a public IP

The largest CIDR block that can be used in AWS for any VPC is /16. While it seems big enough for most scenarios it is not always the case due to some constraints with the AWS CNI. We are not expecting it to be a problem initially.

In EKS, worker nodes consume a considerable amount of IP addresses depending on the instance type. A single worker node can consume anywhere from one to over 200 IPs. Each Amazon EC2 node is deployed to one subnet. Each node is assigned a private IP address from a CIDR block assigned to the subnet. If the subnets were created using one of the Amazon EKS provided AWS CloudFormation templates, then nodes deployed to public subnets are automatically assigned a public IP address by the subnet. Each node is deployed with the Pod networking (CNI) which, by default, assigns each pod a private IP address from the CIDR block assigned to the subnet that the node is in and adds the IP address as a secondary IP address to one of the network interfaces attached to the instance

We have a requirement to segregate our networks to allow us to manage and contain the transit and use of data. For this reason traffic and resources will be separated into multiple subnet types.

Planned subnet categories:

  • Control plane subnets: EKS may handle the control plane, but it still needs to pull IP addresses from the VPC. This is crucial since we don’t manage the control plane instances, but we do provide the subnets where those machines are deployed. For this reason, we will create dedicated subnets for this component to create isolation between worker nodes and control plane worker nodes. These subnets will also been distributed across multiple availability zones for redundancy.

  • Worker node subnets: The largest subnet category that we require. This is where pods are going to be deployed. These will be as large as the setup allows.

  • Public subnets: This is where the publicly accessible endpoints are deployed. In this category, we will use smaller CIDR blocks as only load balancers will be deployed within this range.

  • Private subnets: This provides private accessible endpoints that offer an entrance to the cluster from the corporate network (VPN).

Managing services in the cluster

The specific topology deployment of applications on a cluster would be described in a separate document, however, this section outlines the underlying infrastructure required for it.

Namespace

A Namespace as a virtual cluster inside the Kubernetes cluster. We can have multiple namespaces inside a single Kubernetes cluster, and these are all logically isolated from each other. They can help teams with organisation, security, and even performance.

The EKS cluster comes out of the box with a Namespace called “default.” There are three namespaces that Kubernetes ships with: default, kube-system (used for Kubernetes components), and kube-public (used for public resources). kube-public isn’t used and is there for future EKS enhancements and will be left alone as will kube-system. This leaves the default Namespace as the place where services and apps can be created.

The “default” namespace has no special characteristics, except that the Kubernetes tooling is set up out of the box to use this namespace and it can’t be deleted. It is recommended that “default” is not used in anything other than the smallest of implementations. We will follow this guidance as it is very easy for a team to accidentally overwrite or disrupt another service without even realising it.

Namespace strategy.

We will create multiple namespaces and use them to segment our services into manageable chunks.

Basing namespaces around domains would appear to be optimal for ebury2.0 Create too many Namespaces and they will get in our way, but with too few and we risk domains clashing and having service events..

Proposed naming convention would be:

[environment]-[domain] or [environment]-[domain]-[subdomain]

For example:

  • production-payments-swift

  • preview-payments-aml

  • preview-fx

Namespace and RBAC.

Namespaces doe not provide workload or user isolation. K8s Role-based access control (RBAC): provides a way to define who can do what on the Kubernetes API. The authorisation can be applied to the cluster via a ClusterRole* or it can be bound to one Namespace via a Role.

Amazon EKS provides an integration of RBAC with IAM via the AWS IAM Authenticator for Kubernetes, allowing us to map IAM users and roles to RBAC groups. RBAC is a central component that can be used also to provide control over the other layers of isolation.

*ClusterRoles will be avoided as much as possible - we work on the least privilege principle.

Namespaces and Connection Control

Pods communicate over the network on the same cluster across different namespaces. Kubernetes provides Network Policies that allow users to define fine-grained control over the pod-to-pod communication. The implementation of the network policy is delegated to a network plugin. In a default EKS cluster, the pod-to-pod networking is delegated to the Amazon VPC CNI plugin, which supports Calico to enforce Kubernetes Network Policies. We will isolate tenants by namespace.

We will be using Network Policies to segregate namespaces following a similar pattern that is used today for connectivity between ECS clusters, mainly we will be restricting direct communication between namespaces that are externally exposed and those that run our core services such as payment gateways. Asynchronous communication such as SQS message and Kafka events is recommended instead.

Fine-Grained Connection Control and Service Authentication

Service meshes allow us to define an additional model of protection that can even span outside of the single EKS cluster.

Ebury requirement is for a service mesh to provide end to end communication encryption and handle service authentication on behalf of the apps using mTLS.

Istio is a popular open-source service mesh, that provides features as traffic management, security and observability. However it is well known to be complex to manage and deploy. A simpler solution is Linkerd, but has a main drawback to work only in kubernetes.

AWS App Mesh is a managed service mesh that gives you consistent visibility and network traffic controls for every service in an application. AWS App Mesh is based on Envoy and provides a fully managed experience of the control plane of the service mesh. To test it out, the documentation provides a quickstart for Amazon EKS.

Hashicorp Consul has many integrations with Kubernetes. Consul can be installed to Kubernetes using the Helm chart, sync services between Consul and Kubernetes, run Consul Connect Service Mesh, and more. There are a number of supported official integrations between Consul and Kubernetes(EKS).

https://learn.hashicorp.com/tutorials/consul/kubernetes-eks-aws?utm_source=consul.io&utm_medium=docs

We will initially rollout kubernetes without a service mesh but we will target to introduce it as soon as migration is completed in order to achieve authentication and encryption (via mTLS) between services. This would be on top and not instead of Network Policies.

Resource and Affinity controls

Kubernetes allows the control of requests and limits for CPU and Memory for Pods. To prevent contention of resources on a node and improve intelligent resource allocation by the scheduler. We enforce the requirement that developers always set them, we will provide a number of templates to guide best practice allocations. Resource Quotas allow users to limit the amount of resources or Kubernetes objects that can be consumed within one namespace. From a compute perspective, Resource Quotas allow users to define limits on CPU and memory natively. With Resource Quotas, workloads can be assigned a limited range of resources to prevent tenants from interfering with each other. It is possible to use Limit Ranges, which allow users to define a default, minimum and maximum request and limit per Pod or even container. Tenants will be isolated also from access to the underlying node instance using Kubernetes Pod Security Policy.

In order to avoid that a heavy load/memory burner affects cluster stability. K8S Limits must be configured at the beginning of the cluster life. And for every namespace must have a “CPU Limits,Memory Limits” and also a default limits values

Kubernetes allows teams to specify where the Pods can be scheduled, relatively to other pods or nodes. Pod Anti-Affinity allows for example, users to define that Pods with specific labels cannot be scheduled on the same node, this is often used to provide high availability for services, by ensuring not all pods from the same deployment are scheduled on the. Anti-Affinty requires that appropriate labels are applied to workloads so this will be enforced. We need to investigate how this configuration might become difficult to maintain at scale and manage. It is also possible to set what pods need to be together.

Inter-pod affinity and anti-affinity requires high workload on the control plane and is not recommended in clusters with several hundreds of nodes and above. We need to be aware of this as we grow and build appropriate monitoring

Storage Isolation

Tenants using a shared cluster will need different types of storage types provided. Kubernetes provides a set of tools to manage storage; Volume provides a way to connect a form of persistent storage to a Pod and manage its lifecycle. We will not be allowing access to a local volume. When BOS migrates into EKS we may have to approve an exception. (Physical Volumes are declared at cluster level as well as the StorageClass)

Managing Kubernetes clusters

Monitoring: Prometheus will be used as the monitoring tool for Kubernetes, In the longer term we may want to look at Thanos for a highly available and redundant setup.

Logging: We will continue to use the ELK stack, we will be reviewing the sharding to ensure it is optimised for our new configuration.

Allow access to AWS resources:

Secrets management: We will continue to use Vault as it provides an API-oriented architecture, which allows us to programmatically interact with secrets.

Upgrading clusters: We will use blue-green deployments by creating a new ASG with the new cluster version in the same cluster, this will allow us to perform any validations/test and then switch to the new ASG by performing a drain operation on the old worker nodes. This allows us a quick rollback to a previous version if an upgrade error is detected. This method balances cost with confidence in our ability to quickly upgrade and rollback. The other option is to perform rolling upgrades for the ASG by removing one node after another. This would keep spending lower but upgrading/rolling back a cluster will be time-consuming as the cluster grows, leading to longer potential downtime events.

As Code: We will continue to use configuration-driven tools for provisioning and managing Kubernetes infrastructure across AWS, leveraging YAML, Terraform and Python.

Service Discovery: Using Hashicorp Consul with sidecar will allow us to have a service discovery service that spans the K8s and EC2 world.

DNS: We will continue to use route53 as our Domain Name Service

SSL Certificates: We will use digicert or AWS as our certificate authority and we will not use self signed or run an internal certificate service.

Detailed Design

Details of design, should offload contents from the overview. Despite the proposed structure feel free to use whatever layout and style that you may see fit

Design Topic

optional: some specific component or topic that needs to be addressed. Could also be listing implementation alternatives (subheadings recommended for those but YMMV) and serve as an entrypoint for discussion and decision making

# description may also include some pseudo or real code
def somecode():
    pass

Caveats

Shortcuts, trade-offs and limits that have been taken into consideration when defining the design. E.g. only supports USD and EUR currencies; won’t handle ESPECIFIC_ERRORS from third party service; will only be available for a subset of users

Operation

To achieve what is set out on this RFC we propose the following migration delivery plan:

  1. A unique k8s cluster (eks controlplane) and workers (eks nodegroup).

  2. Use namespaces and CNI Networkpolicy in order to isolate the applications. With rules of ingress and egress for a pod and later we will introduce a service mesh to provide end to end encryption and authentication.

    1. Application communication inside the same k8s cluster will use an internal dns resolution and possibly a service mesh for discovery

    2. application access from the public (internet) using an ELB/ALB infornt of an ingress controller

    3. If the CNI selected is not the native of EKS (VPC CNI) then if is needed to contact a support service for it.

  3. Move the current rds/redis networking linked to current VPCs to a network available to the node worker's network.

  4. Use k8s ServiceAccount to allow a pod use an IAM Role (https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)

  5. Continue using Vault as a secrets store

  6. Logging using ELK ( ex. in our current k8s cluster the elasticsearch filebeat daemon is working).

Security

List any security aspects that this feature may affect and what measures are we considering to limit issues in this category

Legal

Will this be public code? What license? Should we call the lawyers? Add any relevant legal ramifications of this work

Future work

Confluence page with a details of implementation

References

optional: links to other relevant documents, tickets, etc