Technology Strategy

The main goals of the the Technology Strategy is to enable Ebury to scale and deliver business value:

  • Quickly - by providing the teams with the tools and autonomy they need to build and run their own services with minimal dependencies.

  • Often - with API first focus enabling teams to independently release their services.

  • Safely - well defined encapsulation improves testability and ownership of the entire life-cycle of a service improves performance and reliability.

Problem Description

The mid-term goal of the Technology Strategy is to achieve:

  • Mean Time To Repair of 2 hours
  • Maximum of 5 Severity-1 incidents a year
  • Deployment of a new service in 3 days
  • 99.95% Service Level Objective
  • Significant depreciation or elimination of Legacy Systems.
  • Migration to new UIs

As the Business scales, developers also need to be able work independently and be able to quickly master the services their teams are responsible for.

This is currently not possible with our legacy monolithic architecture.

Solution

The solution is to migrate Ebury technology to a de-coupled service-oriented architecture.

Monoliths are split into many independent stand alone services each with a well defined API.

Each service is owned by one team which is responsible for the entire lifecycle from infrastructure provision to supporting the service in production.

Team Structure Based on Business Capabilities

Team Structure

“Any organisation that designs a system will produce a design whose structure is a copy of the organisation's communication structure.” - Conway's Law

A system architecture is constrained by the communication channels of the people building it and will inevitably mirror an organisation's structure. Consequently, the organisation of teams is of paramount importance and will define the overall architecture.

Small autonomous teams expert in their domains are the most productive and create the highest quality services.

Teams specialise in a small number of business capabilities and have the tools, skills and knowledge to deliver new functionality quickly and safely. Development is localised within a team with minimal cross-team dependencies.

This structure enables Ebury to reduce the time it takes to release new functionality whilst improving performance and reliability.

Chapters

The Business comprises a wide range of Capabilities - which can be grouped into Business Domains. The Ebury Domains and Capabilities Map defines the current mapping.

Teams specialise in the services providing the capabilities in one or two Domains.

Teams are self-contained and contain people with complementary technical skills. Individuals with skills in one area (e.g. databases) can form Chapters across multiple teams to share ideas.

Service Teams provide the platform as a service to support product teams.

Team Responsibilities

A team owns the entire lifecycle of a service from design, infrastructure definition, implementation, testing, release and the support of the service in production.

A service is owned by one team and they are responsible for all aspects of the service, including the reliability and availability of the service in production

Service Based Architecture

Services

Componentisation

A component is a unit of software that is independently replaceable and upgradeable.

Sharing components via software libraries requires applications to be rebuilt and released when a component is upgraded.

Sharing components via services however means applications immediately benefit when a component is upgraded.

Running components in services enforces encapsulation and well defined public interfaces

Services

A Service is a cohesive component that does one thing really well. It is small enough to be owned and fully understood by one team. It has a strict API boundary and must be designed to handle failures.

A Service may support channels via a synchronous API and may access other Services in the same Domain with synchronous calls (but ideally should be as stand alone as possible). However, it should interact with Services outside its Domain using asynchronous events.

Loosely Coupled Asynchronous Services provide better performance and are more reliable.

A Service allows a team to select the most appropriate technology stack. A fixed API enables them to release new versions without coordination and iterate designs quickly.

A Service provides granularity that practically enables performance improvement through horizontal and vertical scaling.

Most importantly, the self-contained nature of Services and the teams developing them allows the number of services and teams to readily scale as the Business grows.

Service Size

Service Size

Services are ‘right sized’ on natural boundaries to support a Business Capability.

A Service should provide a cohesive unit of functionality and be easily and comfortably maintained by one team.

One team could support a larger service - or a collection of smaller services. A team should be very familiar with their services and be experts on running them in production.

Ebury Service Architecture

Service Architecture

Channels

These are the channels through which our clients and internal business users access the business capabilities of our systems (e.g. EBO, API, OPS UI, Mobile, etc). They represent our presentation layer and are fundamentally responsible for providing the user experience.

Domain Services

They are responsible for providing the Business Capabilities and exposing core business processes to the Channels.
These Services manage the life cycle of data and publish state changes as Domain Events which can be consumed by other services.

Kafka as the Enterprise Integration Platform

Kafka is a distributed event streaming platform and can be used as a highly scalable decentralised database to connect all systems.

Kafka is used to connect third party systems to Ebury.

Kafka is used to connect the Services providing internal Business Capabilities and the legacy systems in the process of being depreciated.

Decoupled Event Driven Architecture

Domains are decoupled and communicate via asynchronous events on Kafka.

Domains are self-contained as far as possible, maintaining local copies of the data they require.

Within a domain, tightly coupled Services may communicate synchronously but care must be taken to avoid deep call stacks.

Ambassadors and Gateways

Kafka is the core tool for integrating different systems.

External services provided by third parties are integrated using Gateways to Kafka.

Internal legacy systems are integrated using Ambassadors to Kafka.

Internal or External Services

The architecture provides Ebury with a choice of using external services or building new internal services.

Build new services that: * Differentiate Ebury * Add value to Ebury * Support our core competencies

Buy third party services that: * Are ‘off the shelf’ commoditized services * Are specialist services that fulfil our requirements (e.g. Treasury Reporting, Cards, etc.)

Reliability

The mid-term goal is to achieve a Mean Time To Repair of 2 hours and 99.95% availability.

To achieve this, failures must be detected quickly through real time monitoring at all levels of a service and real time alerts. The team that built the service is also responsible for the performance and reliability of the service in production.

Where possible, issues may be resolved by automatically restarting the components of a service. The team that built the service should have the necessary tools and knowledge to resolve incidents quickly.

Services should be isolated and have minimum dependencies. Services should have well defined behaviours when other services are not available.

Platform as a Service

In order for teams to fully own their services, they are provided with tools that allow them to ship quickly, often, and safely with minimal friction.

Platform tools give teams the autonomy to deliver and manage their services while controlling the risk.

Productivity: the platform must enable developers to be more productive. Autonomous teams own the infrastructure and the services they run. They define their required infrastructure declaratively for fulfilment by the platform-as-a-service.

Observability: performance issues must be easily identifiable and understood whilst maintaining the integrity of the platform and its data.

Availability: the platform must enable the reduction of the mean time to repair by supporting the automatic detection of issues and the automatic recovery of services in production.

Capability: the platform must provide the features the developers require, from running applications to disaster recovery. The platform ensures it is a secure environment for applications, taking care of network isolation and service authentication.

Reliability: the platform must enable the management of technical debt, and ensure capabilities are created – and removed – in a scalable way.

Performance: the platform must enable scalable performance by supporting load balancing and automatic scaling on demand of services in production.

Alternatives

The main alternative is to maintain the monolith - but to have stricter internal modularisation. Teams work on the same code-base - but their work is segregated with strict internal API boundaries between components.

This is not a realistic option because all parts of the monolith still share the same database and the existing code base makes extensive use of the database to share data between components. Python has poor support for modularisation and isolating the work of one team from impacting the capabilities being developed by other teams is difficult.

There are many other impediments to improving the performance and reliability of the monolith, for example outdated and unsupported technology, poor support for concurrency, lack of granularity, etc.

Performance Impact

The Technology Strategy improves reliability and performance. It also enables services to scale with the Business.

  • Teams specialising in their services can scale them horizontally or vertically.
  • The services provide the right granularity, enabling performance improvements to be focussed, localised and released independently.
  • Services are stand-alone and use asynchronous message passing which provides scalable concurrency.
  • Services are self-contained and keep local copies of the data they require. This reduces access latency and scales with the number of services. It also improves reliability as the service is decoupled from failures in other parts of the system.

Developer Impact

A goal of the Technology Strategy is to enable teams to rapidly become experts in the services they own and work autonomously with minimal dependencies on other teams.

This means an API first approach enabling developers to release service updates at any time with minimal coordination with other teams.

Teams can define the infrastructure for new services and choose their own technology stacks.

Teams support their services in production and the knowledge gained is used to improve reliability, performance and efficiency.

Teams work in one or two business Domains. The number of services owned by a team is constrained to avoid cognitive overload and minimise the cost of context switching. New members of a team become productive quickly.

Data Consumer Impact

Events published on Kafka are used as the main enterprise integration mechanism. Data Consumers subscribe to Events and persist them locally as required.

Data Schemas (Apache Avro) are used to define Events. The Schemas are used to ensure compatibility and manage data evolution.

Deprecated Systems

BOS and FXSuite are deprecated systems. A deprecated system is a component which we actively work to divest and retire from the platform. While this might take years to accomplish, any further development on a deprecated system must be limited to:

  • Addressing critical bugs and actions points from Incident post-mortems
  • Removing code that is no longer needed
  • Expose functionality to an Ambassador
  • Refactoring to call external services replacing legacy functionality
  • Tactical feature development with a strong justification (any new development in a deprecated system would require re-implementing at a later stage).

Tactical vs Strategic

Every new deliverable should be aligned with the Technology Strategy.

If this cannot be achieved (e.g. for time or regulatory reasons) an RFC is required stating the reasons why a Tactical solution not aligned with the Technology Strategy is justified.

The strategic solution should be defined at a high level as an alternate implementation to the tactical solution in the RFC. The tactical solution's deliverables should be clearly defined including requirements & rules so that no matter how throw away the tactical solution is, the strategic version can be built off of those requirements at a future date.

Implementing a tactical solution is accruing new technical debt. While this could be for very valid reasons, the plan has to be explicitly approved, and removing that debt has to be part of the plan. Also this would hopefully flesh out incorrect assumptions on tactical being more cost / resource effective than strategic or overstatement of urgency and priority of short term requirements over the long term solution benefits.

If a tactical solution is approved, then a strategic implementation should be put in the engineering plan.

References