Ebury Channels IAM

Identity and Access Management for Ebury Channels

Definitions

Backend: Part of ebury platform that needs to be protected. In general, in the sense of only being accessed by trusted users that have the right to do it.
Identity and Access Management system (IAM): Part of ebury architecture that ensures that only authorized users can access every part of the backend. It needs an Identity Provider and an Access Management system.
Identity: Individuals (or applications) external to the system that claim to have the right to access.
Identity Provider (IdP): System that creates, maintains, and manages identity information and also provides authentication services.
User pool: Inside the Identity Provider, every subset of identities that will be considered for an authentication intent. This is useful for having multitenancy in the Identity Provider. We can have several pools and use them selectively for each different channel. Also known as "Realm", since some implementations use this name.
Credentials: Set of secrets and configurations that will allow a legit user to prove their identity.
Federated Identity: Means of linking a user identity between different systems.
Authentication (Authn): Proved assertion about the identity of a user, in relationship with our system.
Authentication flow: Choreographed steps that will present the user with challenges to prove their claimed identity.
Multi-factor Authentication: The existence of more than one challenge in an authentication flow. Usually, providing a pre-shared secret is used as the first challenge, and then a computer-generated secret is requested as a second challenge. Also referred to as "Two-factor authentication". Also used with the meaning of the second challenge itself.
Mutual TLS (mTLS): The ability of TLS protocol to prove the HTTP client identity using client-side X.509 certificates. The underlying communication in HTTPS is encrypted with TLS protocol, and we will use this client-to-server authentication extension in some scenarios.
Mutual TLS Client Authentication (tls_client_auth): The ability to use the mTLS certificate as client credentials to get a valid token using OAuth2.0 authorization framework.
mTLS Certificate binding: When using mTLS as authentication method, the token generated is bound to the certificate. That means all the subsequent communication between HTTP client and server must keep using mTLS under the same certificate to be considered legitimate.
EU Revised Directive on Payment Services (PSD2): Regulators directive on payment service providers within the European Economic Area. Note: This document does not cover any specific requirements about PSD2.
Strong customer authentication (SCA): Requirement of PSD2, that ensures that electronic payments are performed with multi-factor authentication. Note: This document does not cover the details of SCA enforcement and reporting requirements.
One time password (OTP): A password (code, pin) that is valid for only one login session or transaction. Usually, requested as a second factor of authentication.
Authorization (Authz): The right to enter a system or access a resource, by a proven identity. In this document, it is used with the meaning of "Service-level authorization", as opposed to "Resource-level authorization".
Service-level authorization: Right to see or use a path to our system, if it can be enforced before accepting the request. E.g. Is this user authorized to consume this API?
Resource-level authorization: Right to read, or write, a piece of information inside our system, specially if it cannot be known before accepting the request. E.g. Is this user authorized to see this payment? Note: This document does not cover the details of this type of authorization: it will be enforced using the ownership information managed by services themselves.
Access Management: Definition of access right policies.
Access Control: Enforcement of access rights policies.
Gateway: Entrypoint for protected resources that will enforce Authentication and Authorization effectively protecting our backend. Authentication mechanisms will be exposed acting in conjunction with the Identity Provider. Authorization will be enforced by managing and executing Access Control policies against the identity claims.

Prerequisites

API flow requirements are covered in a technical document.

Reference Documents

Problem Description

Our implementations of authentication features, specially to match standards and regulations, are very difficult and error-prone.

What?

Ebury authentication mechanisms in place barely cover our regulatory compliance as a financial company. The system was mainly developed in-house to support a limited set of user authentication flows. Our capacity to evolve it to match the incoming requirements is really compromised. Even some current regulations, e.g. SCA for electronic payments in the API channel, are only met through 3rd parties.

Why?

A more technical explanation could cover how the logic is currently scattered between several services. Some current logic (especially in bos) contains outdated in-house solutions. These components, owned by at least four different teams, make the implementation a bit tangled and difficult to upgrade. To make things harder: different services offer their own version or bridge of the same features, e.g. password login in ebury-api-auth, bos, and eburyonline.

How?

As an example: for bos to validate a user password it needs to be encrypted with a shared secret. The secret is spread among a set of different ENV vars: BOS_API_PRIVATE_KEY in bos, BOS_API_AES_KEY in eburyonline, and ENCRYPTION_KEY in ebury-api-auth. Updating this secret is not even a planned case: we do not have a contingency plan in case it is leaked; we do not have a list of the places it needs to be updated,... To make things worse, the library that bos uses to decrypt it checks for it to be generated with the same library and version, effectively forcing other projects to use the same implementation. But this libray was last maintained more than 5 years ago by the time of this blueprint. The library does not support recent python versions if not with a hack.

So what?

These are serious problems, potentially leading to security issues.

Background

Current modernization of the IAM layer is being solved differently in every channel. The API authn/authz layer is being replaced by Kong as a gateway, but we lack an IdP to better manage identities and flows. Online is still using BOS legacy credentials and their own credentials. BOS 2.0 is using a Keycloak instance as a broker, federating the Google Identity Provider. For the Ops Dashboard: Keycloak is a Oauth 2.0 Resource Server for the channel and manages access policies for users against resources.

This document aims to unify these efforts and clarify the role and responsibilities of the IAM system in Ebury. The work in API (around Kong) and the work in BOS 2.0 (around Keycloak) strongly inspires this idea of unification and standardisation.

Solution

We have presented that we lack the ability to improve or extend the authentication flows, on top of our current identity and authorization system. Since the system in place is outdated and too difficult to be modified, we need to search for a different alternative.

Establish a single way to manage users and their credentials in Ebury channels.

Integrate an Identity Provider to be the new core of the Ebury IAM system for all the three channels.

The system of record must be shared among all the channels. Even if every channel needs a potentially different user base, and to support different authentication mechanisms: the centralization could help us maintaining the system and allowing for better reuse.

NB: Of course we can shard the different user pools between different IdP instances. This still can provide most the advantages of the purely-centralized solution. Please do not focus on the deployment mechanism for now.

Solutions not provided in this document

Resource-level authorization is not covered. I.e, authorization of "contact on behalf of client" operations are to be kept in BOS. The logic, models and relationships that currently solve that will not be touched until future efforts can extract them to a dedicated domain.

SCA and PSD2 requirements are not covered. This proposal will probably be a good foundation for these features to be modernized in the future, but we are covering only authentication for login operations and identification purposes.

Service to service communication is not covered in this document.

Details

API and ONLINE channels will need to maintain (perhaps separated) user pools, while the OPS channel will use Google as a federated IdP. In other words: the Identity Provider will be configured to have Google as a Federated IdP for the OPS user realm, and API/ONLINE can have 1 or 2 user pools each.

For the API channel, this solution also includes integrating the Gateway with the Identity Provider using OpenID Connect.

Ebury is facing a modernization challenge. Since BOS is deprecated as part of Ebury 2.0 architecture, we are encouraged to follow some distributed architecture patterns. Because of that, this solution is not only addressing the problem, but creating more advantages for Ebury developers.

We could deprecate some services that would not be needed anymore: app-proxy, auth-proxy, webhooks-proxy,... Also, we could reduce the usage or plan the deprecation of some other services:

auth-webapp (all except some PSD2 external endpoints, related to SCA, that is not covered here)
verify (all except the user_setup_payments part, about SCA, that is not covered here)

Ownership

The proposal in this blueprint is to have a shared ownership of IAM among the entire Ebury Channels area. The code and services ownerships can be evenly distributed between some members of all the area teams.

Requirements for the Identity Provider

It supports standard authentication flows, especially those mentioned in OpenID Connect and OAuth2.0.
It supports mTLS Client Authentication with certificate binding, as described in its RFC.
The Multi-Factor Authentication with TOTP can be integrated in the Identity Provider, including the onboarding phase. (We are talking about login/authentication stage, not SCA for transactions)
It can delegate into external user pools or federated IdP (Google).
It obviously includes user pools that are not federated, and we can add claims (attributes) to the users.
It has an admin API, so we can have flows that include our staff making modifications to the users.
It has a customizable/themable look and feel, so we can brand our pages.

Roadmap

Phase 0 - This blueprint

This document basically proposes to include a new service, that is an Identity Provider, so the channels can use it.

The approval of this proposal will need to follow the RFC process because this will make an architectural impact.

Phase 1 - Proof of concept (Plan A)

The 1st Proof of Concept could consist of integrating keycloak in the API channel, as part of the kong deployment.

All the flows expressed as requirements (email/password, TOTP, mTLS, refresh token,...) are to be tested using actual configurations.

The outcome of this phase is a video showing all the requirements being covered.

Phase 1 - Proof of concept (Plan B)

Only if the 1st PoC is unsuccessful, we can rethink it using a different alternative for the IdP.

But also, we will have a better understanding of the limitations.

Given AWS is already a used service in Ebury, using AWS Cognito for a potential 2nd PoC is reasonable.

Said that: probably Plan A is good enough to make us stick with the selected technologies. No big trouble is forecasted.

Phase 2 - Technology selection

With all of what we learned from the PoC, we will check if all the requirements are met.

We will need another RFC to compare the proposed technology with other alternatives.

If we stick with keycloak, the RFC will be mainly about the details of the architectural proposal. Also, we will need to defend the interchangeability in the integration: Can we switch to another IdP maintaining the connections based on standards?

If we decide to select a new provider or vendor, especially if not OSS, we would need to follow more steps: check the vendor, evaluate the capabilities, etc.

Phase 3 - Implementation

In this phase, we will need to spend time improving the PoC and converting all the configurations tested definitively.

Also, this includes all the tasks to make the service production-ready are to be taken: monitoring plan, rollout plan, security assessment, alerts, dashboards, logging configuration, CI/CD integration, code ownership, etc.

Phase 4 - Rollout (API channel)

Deploying the Identity Provider to production. If keycloak keeps being the option, we can use the already present keycloak instance in EKS.

That includes starting to port consumers from the old "api-auth" to this IdP, and write good technical user documentation.

Phase 5 - Plan to move OPS and ONLINE

After the API channel has been successful in migrating the users to the new IdP, the rest of the channels can start proposing documents to port their IAM implementations to this service.

Alternatives

Selecting an Identity Provider for the API channel only

API channel will deploy a gateway and needs an Identity Provider to fully replace the authn/authz layer.

Keeping the option for the channel is an option.

However, it does not solve how we can deprecate BOS credentials management, which will be in use for other channels, leading to security problems. Also, the quality of the integration will be limited by the different parts of the system that will interact with it, and that are owned or developed by different teams.

The main con is that other channels will select equivalent solutions with different implementations when facing the same problem. Which is a waste of time and resources; and can lead to bad or incoherent user experience.

Using API-Auth as Identity Provider

For the API channel, another alternative is connecting Kong with the OpenID provider currently implemented in auth-webapp service.

On top of the problems of the option before, this one has some extra risks:

BOS will not be decommissioned since the credentials are stored there. Modifications to that are risky and involve a fat process.

Future plans relating to authentication will be slower to implement, and we will need to take into account the regulatory requirements instead of relying on working solutions. Also, their implementations will get scattered or diverged between API and ONLINE.

We will not get some security benefits from the gateway. Since auth-webapp is not using the openid standard at its most: we lack the ability to use grants, scopes, groups, roles, etc. The authenticated groups feature is specially useful for the developer portal integration.

Users will face some inconsistencies like we have seen in the past. For example, a user blocked because MFA attempts were over using the ONLINE channel could still try to login in API since that information was not shared. [This problem was solved, it is only an example of a problem after a divergence].

Caveats

All technologies/providers cited (kong, keycloak, and AWS) are already approved for other uses or scenarios.

Of all the requirements heard about that, the most unlikely to have for granted is the look and feel based on the user brands. Since most IdPs will allow for this to be configured at pool level, not user level: we could end up about having users in different realms, by brand.

Operation

The service will run alongside other API gateway services, initially in ECS.

We could think about moving API gateway to their own k8s cluster, and also about having the IdP on k8s directly, but the most direct way to get it working is to have it in the same set of services already deployed.

Security Impact

This document is mainly about identities and access control implementation, so related with security a lot.

The solution proposed increased the security of the system by relying on open standards, open implementations, and good configurations.

Performance Impact

None.

Developer Impact

None so far.

At most, some developers could need to take some online course about Authentication standards and mechanisms, to better support this implementation.

Since we will rely only on open standards, this will be a good foundation for future changes so the developer experience will improve over time.

Data Contracts

None.

Data Sources

None.

Deployment

The deployment can be taken as part of the phase 4, described as part of the solution.

Dependencies

None.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search