Problem Description

There is an increasing demand from development teams for the platform to provide some capabilities in terms of security and observability that in a microservices architecture are usually solved with sidecar and service mesh patterns.

Background

Ebury Platform is undergoing a transformation in the last years:

  • A distributed architecture, reducing the blast radius of incidents and enabling the different development teams to be autonomous.
  • Provide a Platform as a Service with Kubernetes and Kafka at its core.

However, a distributed architecture comes with its own challenges:

  • Service to service authentication and authorization
  • Observability

There are different commercial and OSS solutions for implementing a Service Mesh. The offer was shortlisted to:

  • Istio
  • Kuma
  • Linkerd

Ultimately, Linkerd was chosen because of its simplicity and its license model where all features are included in the OSS version and we are only charged for support and tools. A Business Case has been already approved and a contract signed with Buoyant which includes support for initial deployment.

Solution

Deploy Linkerd service mesh in our staging and production clusters, and include all the workloads in the mesh by default but with any enforcement capability disabled. Services may opt-out from it if the mesh interferes with their operation.

development , demo/sandbox and tools clusters are not in the scope of initial deployment.

Deployment shall be observable and resilient.

Details on how capabilities provided by the mesh or built on top of the mesh will be used will be part of a different document.

Alternatives

  • Use any other implementation of service mesh.
  • Do nothing and let each service implement their own solutions.

Caveats

Service Mesh is an additional layer of abstraction, hence increasing the complexity that needs to be managed by the platform team. Malfunctioning and errors can potentially affect the whole platform and lead to SEV-1 incidents.

Operation

The service mesh deployment and operation will be managed with code, using Terraform Helm provider. Although there are several caveats associated with deploying Helm with Terraform, it is the supported tool for deploying new platform components.

Security Impact

Just by including workloads in the mesh, we will get authentication and encryption in transit for service-to-service traffic. It will also set the foundations for service-to-service authorization and zero trust.

Buoyant cloud third party service will be collecting data from the control plane in order to provide observability and support. This sheet from Buoyant is a summary of the data collected. In addition, access to the observability portal will be configured with Single Sign On (SSO).

Performance Impact

As the mesh adds a new hop for traffic, it is expected to have an impact in latency and throughput.

Deployment

Linkerd deployment will be first tested in a new cluster created ad-hoc for this purpose. Then, it will be deployed in staging and production, without any workload included in the mesh and without any enforcement.

Workloads will be added progressively to the mesh in staging and production, following a rollout plan.

Linkerd and cert-manager (as prerequisite for certificate rotation) shall be deployed as Helm charts that are used from Terraform code.

Integration with Buoyant Cloud shall be done by deploying the Helm chart, needed for exposing the necessary metadata. It requires a pair of clientID, clientSecret values that shall to be stored in a Kubernetes Secret object.

Certificate Management

Linkerd requires certificates to be set up prior to deploying the service mesh itself. This involves an initial (root.linkerd.cluster.local) self-signed certificate to act as a trusted CA anchor for all the rest of the certificates (identity.linkerd.cluster.local) that will be generated. In case we adopt the Linkerd multi-cluster feature, in order to establish workload identity across clusters, the same trusted CA anchor needs to be reused in all clusters.

  • The trust anchor needs to be stored permanently and will be used to establish identity even in communication across clusters, so it only needs to be generated once. It is recommended that it will be valid for a long duration (years). We will store it in Vault and expose it in all clusters as needed, using the External Secrets Operator.
  • The identity certificates are to be rotated frequently (days) and this shall be automated using cert-manager. This process is described in the Linkerd documentation.

The creation of secrets and automated rotation will be translated to Terraform to keep all the certificate management steps as code.

Alerts

Linkerd exposes Prometheus metrics that can be used to define custom alerts.

Buoyant Cloud alerting system provides integrations with Slack. An integration with Splunk On-Call is not currently available out of the box. If necessary, the Platform team along with Buoyant need to perform an analysis for an alternative way of connecting to Splunk On-Call.

Considerations for the Future

Additional features and capabilities provided by Linkerd will be considered in subsequent iterations:

  • Dynamic request routing
  • Circuit breaking
  • Ingress and egress control
  • Multi-cluster support
  • Service mesh extensions