EMP Infrastructure Parity Plan

Problem Description

When the Ebury EMP environment was created it was built manually in a manner significantly different from how Core builds and manages their environments. This has resulted in - Significant difference between Core and EMP infrastructure - Different deployment mechanisms for applications between Core and EMP - There are some components running within Core that are EMP specific - A number of EMP components that were manually built and not part of IaC. - A DR scenario would involved significant time to rebuild the EMP environments manually.

Abbreviations, acronyms, and definitions

  • Ebury Core or Core: Current Ebury production data, infrastructure and services.
  • EMP, Ebury Mass Payments or Mass payments: Current Ebury mass payments production data and services (originally FrontierPay and/or TheFXFirm)
  • ECR: Elastic Container Registry
  • ECS: Elastic Container Service
  • EKS: Elastic Kubernetes Service.
  • IaC: Infrastructure as Code.
  • PQS: Payment Query Service.
  • EBO: Ebury Online
  • BOS: Back office Systems
  • RDS: AWS Relational Database System

Background

The EMP environment was originaly built at the start of the relationship between Ebury and FrontierPay. It was built to allow Frontierpay to utilise Eburys application stack. The build process was entirely manual and not done using IaC processes. It has never recieved any of the updates that were performed in Core, including containeristion and Kubernetes.

What is meant by parity?

EMP will be re-built on separate infrastructure in its own AWS project(s). EMP Infrastucture is to be built using IaC processes and will re-use as many Core components as possible (TF Modules, Ansible Playbooks, Jenkins Pipelines). Where it is not possible to use existing IaC components due to localisation or different business requirements, any custom EMP components are to be created as similar as possible to Core components and processes as possible. Application deployment is included in the parity plan and will follow the above principles

Solution

The overall plan is to build out new and dedicated EMP environments in EMP specific AWS accounts with the goal of having EMP segragated from Core infrastructure.
These environments will be setup to match their Core equivalents as much as practical and will utilise IaC and CI/CD processes.

EMP have request the creation of new dedicated environments. - Staging/pre-prod - Production

These will be split across the new AWS accounts with Staging/pre-prod being hosted in one and production hosted in the other.

Once the migration is finished there may be the possibility of removing (or running in an idle state) the staging/pre-prod environment depending on business needs.

Business Applications

The following application stacks will be hosted as part of the EMP platform. All applications instances will be dedicated to handing EMP related traffic only.

It is recommended that the applications should be built using existing Core CI infrastructure and then deployed into EMP environments. This will require some effort to either enable replication of artifacts into EMP or to set permissions for EMP to access.

There are a number of applications such as BOS, EBO and FXSuite that are used in both Core and EMP but have different release mechanisms. The recommended approach is to bring these processes as close to parity as practical. This will for the most part mean converting the EMP process over to use the Core build pipelines.

There are a number of applications that are in the process of being migrated from ECS to EKS. Recommendation is to use EKS if it taking any traffic in production within Ebury Core currently.

Ebury Online

Ebury online is fronted by WAF and Cloudfront CDN

EMP EBO architecture is different as it currently deploys to EC2 via Ebury/eburyonline-ansible playbooks. Core uses container driven ECS. Recommendation is to change EMP to use ECS

Deployment type: ECS NOTE: EBO does not use the standard ECS release pipeline

Dependencies: - Foundations - ECS - ALB - S3 - Route53 - CloudFront - Vault - WAF

Core TF source

terraform-publicfrontal/src/app_ebo.tf EBO TF

Core Release

Jenkins jobs development/online/ebo.deploy.prod

FXSuite

The current setup of FXS is pretty simple, only for quotes, they don't use any of the bridges to send payments to Swift, Fasterpayment or Sepa. Also they're not uploading QTM files

EMP currently deploys to EC2 via Ebury/fxsuite-ansible

Deployment type: ECS NOTE: FXS does not use the standard ECS release pipeline

Dependencies: - Foundations - ECS - SNS - SQS - ALB - Route53 - Elasticache - NLB - S3 - RDS Postgres - Audit / DocumentDB

Core TF source

terraform-backoffice/src/app_fxs.tf FXS TF

Core Release

Jenkins jobs development/online/fxs.deploy.prod

BOS

EMP BOS architecture is different as it currently deploys to EC2 via Ebury/bos-ansible-git playbooks to a number of ec2 instances that were created previously manually.
Core uses a process that creates AMI images and then deploys them to AWS via a blue/green switch over model. Core is in the process of building out a container driven ECS model however it is not yet ready for production deployment.

BOS code also includes a custom middleware component (ebury_audit.middleware.AuditMiddleware which links to the Ebury/ebury-audit repo) released and installed to store the "audited" events as celery tasks in to a specific redis database. Exist a specific celery worker get that task and store in a aws doc db

The recommendation for the initial parity project is to replicate the current Core AMI process. Once the Core container driven ECS project it should be relatively straight forward to adapt it for EMP. This may change if the containerisation process progresses significantly during the project

Dependencies: - Foundations - EC2 - Elasticache - RDS Postgres - Vault - ALB - Route53 - SQS - SNS - S3 - Elasticsearch - DocumentDB for audit

Core Ansible

In Core BOS is deployed via Ansible repos which deploys a prebuilt AMI image.
- iac-ami-creation - iac-bos - iac-ec2-creation - iac-launcher

There are a number of BOS services running in Ebury Core that do not run in EMP. Effort will need to be spent comparing build processes between Core and EMP. Given this it is unlikely that EMP will be able to re-use these repos and will need EMP specific ones.

Ansible Repos

There are a number of common Ansible Repos used by Core that handle the build and deployemnt of several apps (Primarily BOS) iac-ami-creation iac-bos iac-ec2-creation iac-launcher

Sherlock ###

Anti Money laundering lookup service

Deployment type: ECS Application has been targeted for EKS migration but is not yet at a stage where it can be considered ready for deployment.

Dependencies: - Foundations - ECS - SQS - S3 - ALB - MSK - Kafka Connect - RDS Postgres

Core TF source

terraform-backoffice/src/app_sherlock.tf BOS TF

Core Release

Jenkins jobs development/etc/sherlock

Verify

EMP infrastructure uses Ebury Verify service for 2FA and is required for BOS for sending emails verify - backend service 2fa tokens for EBO + authenticate payments. internal service. Used by BOS as well.

Deployment type: ECS Application has been targetted for EKS migration but is not yet at a stage where it can be considered ready for deployment.

Dependencies: - Foundations - ECS - S3 - secretmanager - ALB - Route53
- Kafka Connect

Core TF source

terraform-backoffice/src/app_verify.tf Verify TF

Core Release

Jenkins jobs development/ata/verify

Ebury API

Note: There is a desire from Steve McHugh to include API/API Gateway in the scope of the parity project however it will not be done as part of the initial migration and will be treated as a separate task after. To initially use API as a shared service, peering will need to be setup.

Ebury API has an endpoint that reaches directly to the BOS EMP frontapp instance, and to FXS.

Client traffic flow is: client-> Kong api gateway -> API

API deployment controlled through Ebury/ebury-manifests.
Is this going to work for EMP as it has references to specific environments in environments folder

Application composed of two repos; ebury-api-webapp and ebury-api-auth

API services managed by DXP or API teams

Deployment type: ECS/EKS

Dependencies: - Foundations - EKS

IAM

The initial integration of the IAM tool to our systems is the Ebury API and as such integrations will exist with BOS, BOS EMP and to Verify in a similar way the Ebury API does today. This service will for now be an exception for parity and will work exactly as the Ebury API which means peering will need to be setup. All EMP related functionality will live in separate realms completely isolated from core integrations, such that moving EMP integrations to a new separate deployment later should not be difficult.

IAM deployment controlled through Ebury/ebury-manifests.

Application composed of one repo; ebury-keycloak

Service managed by JAM.

Deployment type: EKS

Dependencies: - Foundations - EKS - RDS Postgres - Vault - Kong (only used to expose the service to the internet, could be replaced with anything sensible)

Core TF source

API services are deployed differently to other infrastructure components. They don't follow the structure as other terraform infrastructure modules. ebury-api-iac/terraform

API Gateway (Kong)

Note: There is a desire from Steve McHugh to include API/API Gateway in the scope of the parity project however it will not be done as part of the initial migration and will be treated as a separate task after. To initially use API as a shared service, peering will need to be setup.

Kong based API Gateway platform that feeds traffic into the Ebury API Service

Deployment type: ECS

Dependencies: - Foundations - ECS - Elasticache - RDS Postgres - WAF - ALB - Vault - Route53

Core TF source

terraform-publicfrontal/src/app_api_gateway.tf

Safeguarding

Core can generate this report through Kafka connect/debezium into s3 bucket Debezium reading from BOS Database

Deployment type: ???

Dependencies: - Foundations - MSK - Kafka Connect - S3 - RDS Postgres - ?

Core TF source

??

Smart Date Service

Deployed in ECS Core but not used by ECS Core

Deployment type: ECS

Dependencies: - Foundations - ECS - ALB - Route53 - Vault - Elasticache

Core TF source

terraform-natonly/src/app_smart_date_service.tf SDS TF

Payment Query Service

Deployed currently in EKS Core. Not used by Core

Repo: https://github.com/Ebury/payment-query-service

Deployment type: EKS

Dependencies: - Foundations - MSK - Kafka Connect - RDS Postgres

Core TF source

terraform-backoffice/src/app_pqs.tf - db build but not deployment

QuickFix Service (QFS)

Talks out to BarX (Barclays) over fixsession protocol

Deployment type: ECS

Dependencies: - Foundations - ECS - Secretmanager - Vault - Stunnel

Core TF source

terraform-natonly/src/app_qfs.tf image info

QuickFix Connect (QFC)

Service to handle FIX connections.

Deployment type: ECS

Dependencies: - Foundations - ECS - Secretmanager - Vault

Core TF source

terraform-natonly/src/app_qfc.tf image info

Infrastructure components

The following infrastucture components are used by applications

Component Used by Comment Core Implementation
Vault Foundation Currently defined terraform-publicfrontal
Prometheus Foundation Central prometheius built out as part of terraform-global, federated nodes built out in terraform-publicfrontal
Grafana Foundation TBC Foundation, Built out as part of terraform-global. Then Ebury/ansible-role-grafana-dashboard-deploy
AlertManager Foundation TBC Foundation, Built out as part of terraform-global
Kibana Foundation TBC Foundation, Built out as part of terraform-global
Jenkins All Full Jenkins master in emp staging with agents in staging and production TBC
MSK Sherlock, Safeguard, PQS terraform-kafka but clone an EMP specific one Ebury/terraform-kafka and Ebury/ansible-playbook-kafka-topics
Kafka Connect Sherlock, Safeguard, PQS Runs within ECS, Utilises IAM roles. There are some connectors in Core for EMP that will need to be ported over
RDS Postgres FXS, BOS, Sherlock, Safeguard, PQS, API Gateway Seperate DB for each app
ECS EBO, FXS, Sherlock, Verify, SDS, QFS, API Gateway TBC Built out as part of terraform-global. Might Also be a dependency on ebury-manifest repo to setup ECS cluster
EC2 BOS Through Ansible Repos via prebuild AMI deployment
EKS PQS EKS Under active development in terraform-internal repo terraform-kubernetes-clusters (?). Possibly need to create a dedicated emp one as heavy reference to global_vars terraform-kubernetes-clusters
WAF EBO, BOS, API Gateway Managed by Security example implementation in terraform-publicfrontal/src/app_ebo.tf (module "ebo_wafv2)
ElasticSearch BOS TBC
EC2 BOS TBC TBC
Cloudfront EBO TBC terraform-publicfrontal/src/app_ebo.tf (module "ebo_cloudfront")
DocumentDB BOS Think MongoDB is used in core? Used for Audit? TBC
AWX SRE managed terraform-global/accounts/prod/awx/terragrunt.hcl. Then also Ebury/ansible-role-awx
Aquasec ?? Is there a licence requirement for this? Is it needed terraform-publicfrontal/src/aquasec_enforcer.tf
SQS/SNS Sherlock, FXS, BOS ??
Elasticache FXS, BOS, SDS, API Gateway

Monitoring

Monitoring is done through a combination of Nagios and Prometheus/Alert Manager

Once the foundation infrastructure is setup the following Ansible repos can be used to setup monitoring for the environment. - Ebury/ansible-playbook-monitoring

Nagios is currently running in manually created EC2 instances (called management) in all the environments, Core or EMP. The code for the Nagios configuration is in https://github.com/Ebury/ebury-infrastructure-scripts In some cases it has been manually managed into the EC2 instances and never pushed to the git repository.

This process will need to be fixed to be fully pipline/automation controlled

Logging ##

Logging is shipped to a single central ELK stack. This was originally created manually and will need to be implemented through IaC

AWS Environments

EMP is currently hosted within its own AWS account. This account only contains the EMP production resources. For EMP the EMP infrastructure parity project it is anticipated that 2 new AWS accounts will be setup; one for development/staging and one for production.

Note: Need to discuss with DXP team how this could work in regards to credentials etc.

There are a number of VPCs create to contain various services. There is a VPC for: - publicfrontal - backoffice - natonly - legacy BOS - MSK - Global (contains support tools such as Jenkins/AWS/Prom)

If we maintain the VPC segradation of the environments it will look like this: EMP Infrastructure

Pipeline Automation

Where possible the plan is reuse as many Ebury Core TF modules as possible. This will assist with providing environment parity. There will need to be EMP specific TF modules created to retain environment independence

Required new EMP Modules/Configurations

There were two options looked at when it came to creating the Terraform modules and infrastructure configurations required for emp. The first option was to create emp specific modules based on fork/clones of the existing core ones. The second option was to reuse as much of the core modules as possible and insert emp related environment information into them.
It was decided to use the first option of creating emp specific modules mainly because it would allow for independence of the emp environment and also to avoid scenarios where an update to a shared module would trigger pipeline execution to both emp as well as core environments.
There will still be shared terraform modules that are dependencies of the infrastructure modules but these perform generic functions and are not environment specific. The notable exception to this is terrafrom-module-emp-globalvars. This module is used heavily by a number of other lower level Terrafom modules and would required cloning almost the entire Terrafrom module estate. EMP specific variables will be inserted into terrafrom-module-emp-globalvars using an emp_ prefix eg emp_devel, emp_prod etc.

terraform-module-globalvars

Not a new module but significant modifications to the existing one containing specific variables for EMP environments.

terraform-emp-global

Clone of terraform-global to setup the core environment components. This module is used to setup generic AWS account components such as aquasec/userpolicies/iamusers/iam roles)/avoka/ecr repos/.. that are global related to the aws account

ebury-manifest / platform-manifest ####

Mechanism to trigger API version deployments.

Investigation will need to be performed to see if this can be reused or needs to be duplicated.

terraform-emp-backoffice|natonly|publicfrontal

Used to build out the main environment configuration. Currently this is in three separate repos terraform-backoffice/natonly/publicfrontal Adhearing to the definition of parity above it is recommended to maintain the three repos.

terraform-emp-kafka

Used to build out an MSK instance

terraform-emp-kubernetes-clusters

Used to build out a Kubernetes cluster

Jenkins

To maintain independence from Ebury Core it is recommended to have a separate Jenkins instance. The Jenkins configuration will follow the core layout of the master residing in the AWS staging account with agents deployed in the staging and production zones.

Migration Process

The migration process will be broken up into multiple phases.

The first phase will deliver an initial EMP environment that has equivalent function to the current and has parity with the core environment in terms of operations

Phase 1 - Parity via IaaC

The first phase will deliver the core components of the environment including networking, operations and observability as well as the application components required to provide the equivalent environment.

Basic networking and Foundations

This is the core AWS networking setup, creating VPCs, VPC peerings and subnets. It installs a number of components required for the management of the environments: Vault, Prom+Grafana+AlertManager, Kibana resizing, Jenkins Master and agents, Database obfuscation process, Prometheus federated nodes in place, alert runbooks and BCP review

Cluster / Shared dependent components

Creation of shared infrastucture components required to support applications. - ECS cluster creation - Redis/Elasticache - RDS Postgres

Equivalence Applications

Creation and setup of applications for the initial equivalence - BOS - FXS - EBO

Phase 2 - Additional Infrastructure Work

The second phase will start to increase the capability of the EMP platform but adding new components.

Cluster / Shared dependent components

Creation of shared infrastucture components required to support phase 2 applications. - MSK - EKS - Kafka Connect

Phase 2 applications

The following applications are required based on other in progress EMP RFCs: - PQS + SDS - Sherlock - QFS/QFC

Clean up of Core

Any existing EMP components that are currenlty hosted in Core such as PQS, SDS will need to be decomissioned.

Alternatives

An initial attempt to merge the two environments withing Ebury was abandoned. Although there are several alternatives for how we can create and host this new environment, they can be considered implementation details compared to the overall requirement to have a managed, scalable infrastructure.