Demo and sandbox environments in Kubernetes
Problem Description
Besides Core and EMP platforms, Ebury has a number of permanent environments (Demo, Solutions, Sandbox) and temporary testing environments (Mobile, EMP Migration). Those environments are currently deployed in single EC2 instances using EBOX primitives and docker-compose for running workloads.
In addition, in the case of Sandbox environment, API services are not part of the EBOX, but are deployed in an ECS cluster that uses the EBOX as a backend.
The deployment and configuration in those environment differs a lot from what is needed to deploy in production, so the advantages of migrating from ECS to Kubernetes are less obvious if development teams still need to manage release to EBOX environments.
Background
EBOX deployments
Initially we had demo and sandbox with FXSuite, BOS, and EBO running as processes in an single EC2 instance. Deployment was time consuming and error prone, and required downtime. Those initial environments were replaced with new ones based on the EBOX concept, so deployment was consistent and did not require downtime. Later, a new Demo environment for Solutions was created, as well as other temporary environments.
Although adding new services to the environments is feasible, it is a long and tedious process with many divergences with the deployments in production. So far, only Account Details, Fee Tier, Document Service, Verify and API services are supported in addition to the traditional BOS/FXS/EBO core.
QFS is also supported but it is only running on Sandbox. An additional aspect for QFS is that we need to report any change in the IPs to the third party provider.
Software updates in the environments are done following a blue/green pattern by creating an EC2 instance from scratch, deploying the platform with docker-compose, and switch the traffic in the Load Balancer to the green instance. Data persistence is guaranteed because all the workloads in the cluster use a database hosted in an RDS outside the EC2 instance. The rest of AWS services, however, are running in the EBOX EC2 instances (Redis, Elasticsearch, DocumentFB, SQS, SNS, etc.) and are wiped out and created from scratch in the deployment process, so glitches and weird behaviors may happen during the upgrade, though it is not considered critical.
Namespaces
For Kubernetes in production we are creating a namespace for each project or service, which is a challenge for having the same workload in two different environments in the same cluster, as it is not possible to have two namespaces with the same name even if they belong to different hierarchy.
Virtual Private Clouds
Each demo environment has its own VPC, but all the VPCs are using the same CIDRs for their subnets. This means VPC Peering is not an option.
Multi-tenancy in Kubernetes
In order to host multiple environments (tenants) in Kubernetes, there are three models:
- Cluster as a service:
Each tenant gets its own full cluster, provisioned and maintained with its own lifecycle.
Caveats: A cluster for each tenant would be cost-intensive in both infrastructure and maintenance.
- Namespace as a service:
One namespace per tenant, which may include multiple sub-namespaces in a hierarchy.
- [Hierarchical Namespace Controller (HNC)](https://kubernetes.io/blog/2020/08/14/introducing-hierarchical-namespaces/)
- [Kiosk](https://github.com/loft-sh/kiosk)
Caveats: sub-namespaces names still need to be unique, and the hierarchy is more intended for policies and security.
- Control plane as a service
Also known as Virtual Clusters, where a tenant gets its own Control Plane but share worker node resources.
- [vCluster](https://github.com/kubernetes-sigs/multi-tenancy/tree/master/incubator/virtualcluster)
- [Kamaji](https://github.com/clastix/kamaji)
Caveats: We would need to re-visit and probably re-engineer all the capabilities provided by platform in terms of logging, monitoring, secrets management, ingress, etc. in order to work inside the virtual cluster or in conjunction with it. Preliminary tests shows that at least Vault injector would cause a lot of problems as it is a Mutation Webhook Controller.
Solution
Create a new Kubernetes cluster for running these kind of environments, with a deployment and configuration closer to production deployment. However, as we want the environments to be cheaper than production, a single cluster will host all the environments in a "Namespace as a Service" model.
The cluster will be deployed in a VPC isolated from the production VPCs.
There will be a namespace for each environment, and all the workloads belonging to an environment will be deployed in the same namespace.
Regarding AWS resources, for those that require compute (RDS, Elasticache, Elasticsearch, DocumentDB, MSK), we will create one instance of the smallest type and with the minimum number of replicas for each environment. The different workloads for each environment will share the instance, using different databases.
For other AWS services (SNS, SQS, Dynamo, S3, etc), we will replicate the resources existing in production for each environment, prefixing them with the environment name.
Connections during the migration
While migration is ongoing, services moved to Kubernetes will need access to the services still running on the EBOX. This will be done allowing traffic in the EBOX from NAT gateways in the VPC where the cluster lives. In the same way, services running inside EBOX will need access to services running in Kubernetes, also allowing traffic from Demo/Sandbox VPCs in the Kubernetes Ingress.
We will be using the current RDSs in the Demo/Sandbox VPCs (at least until EBOX is fully deprecated). As VPC Peering with VPCs that have the same CIDRs is not possible, we will create VPC endpoints exposing the RDS in the Demo VPCs to the new VPC.
The specific case of MSK, once a new minimal MSK cluster is available in the new VPC, environment configuration will need to be tweaked for using that cluster instead of the Kafka running inside the EBOX.
We have both public addresses provided to our customers and private addresses used internally. The private address will be different in Kubernetes, but the public ones will be kept the same, pointing them to the environment (EBOX or Kubernetes) where a given service is active.
Alternatives
-
Use a different multi-tenancy model like "Cluster as a Service" or "Control plane as a Service".
-
Give up completely the current EBOX environments, create new ones in Kubernetes multi-tenant cluster with different URLs, restrict access to old environment, schedule a maintenance for final replication of current databases into the new ones and provide the new URLs to the interested parties. There are some caveats with this approach:
- We would need to wait for EBO, Account Details, Verify and QFS to be migrated to Kubernetes at least in staging, and QFS is specially problematic.
- We would need to run BOS and FXS in Kubernetes. Although we are already able to do that in ED8K development environment, it would probably have limitations and unforeseen issues for anything more serious.
- Rollout would be big bang and would require coordination with different development teams, operations and customers.
-
Do not use AWS hosted Kafka, Redis, etc. and run them inside the cluster. That would be cheaper, but would also require platform to support and maintain two different stacks, in addition to having a stateful cluster where we do not have much experience.
Caveats
- Having a single namespace versus a namespace for each service/project in production is a big difference. Specifically, there may be name collisions in resources from different services which do not happen in staging or production.
Security Impact
Credentials (both credentials to third parties and to other Ebury systems and databases inside the environment) are currently managed as Jenkins credentials injected during deployment. This will be replaced with credentials injected by Vault operator at runtime.
Data in the environments is considered sensitive, specially in the Sandbox environment, as it is user data. The change proposed uses the same existing databases.
Public access to the different endpoints will go through the same load balancers currently in use, routing the traffic to the cluster instead of the EBOX, so there in no impact in terms of exposed endpoints.
The different environments are now deployed in separate EC2 instances in separate VPCs. With the approach proposed, all environment will be deployed in the same Kubernetes cluster. Though still isolated, the isolation level would be lower, being a trade-off between infrastructure and operation costs vs isolation. The environments, though still critical, are less critical than production environment.
In the case there are environments hollding sensitive data, multiple clusters can be provisioned anyway, one for long term environments like Demo and Sandbox and a diferent one for ad-hoc environments created for specific projects.
Performance Impact
No big performance impact is expected. Workload in these environments is low and performance is less critical.
Our Kubernetes platform already support automated scalabilty for worker nodes, so adding new environments or growing them can be done with minimal intervention.
Developer Impact
As part of ECS migration, development teams will need to: - Define deployment configuration in [Ebury manifest][https://github.com/Ebury/ebury-manifests/tree/master/k8s-environments] instead of Demo manifest. - Adapt their pipelines so deployment to Kubernetes is done instead of EBOX. - Create secrets in Vault for some credentials currently existing in Jenkins and injected as part of deployment.
Deployment
Proof of Concept will be conducted by creating a new cluster in development account in its own new VPC, with a demo-test tenant connected to the already existing demo-test EBOX environment.
Once the cluster is ready, fee-tier-service which is already Kubernetes-ready, will be deployed in the Kubernetes cluster and interaction between K8s and EBOX will be tested.
Afterwards, a new cluster will be created in production account (also in its own VPC), with three tenants (demo, solutions and sandbox), migrating fee-tier-service as well
As part of the ongoing ECS migration project, development teams will need to handle migration from EBOX to Kubernetes in the affected services.
In parallel, Ebury API services (also Kubernetes-ready) will be deployed as well in the demo-test tenant, then in sandbox.
Other already existing temporary test environments (like Mobile) will not be migrated by default, and it worth considering creation from scratch.