Ebury Transactional API Gateway: the requirements

This document is the technical appendix to the main RFC.

This is not an implementation guide, but a requirements list.

Every requirement will probably have an implementation, whose proposals will need to gather approvals on their own detailed RFC documents.

What is not a gateway requirement

How to direct traffic to the API and how to balance it between different instances, is a problem not to be solved here. Probably, cloud load balancers and CDNs can be useful to solve these problems.

After a user request comes in and it's checked it can pass, the request is not managed by the gateway anymore. How to put logs, metrics, traces, health checks, circuit breakers, retries, timeouts, correlation IDs,... in the traffic between services are all problems not to be solved here. Probably, services proxy sidecars can be useful to solve all these problems.

The input transformation, the validation, and the output transformation are aimed to be solved in the services themselves. However, we will include some expectations from the gateway in this point later in the document. If a new designed service needs a “public” facade, or a transformation layer, it can be developed inside the same service. It will interact better with the business logic. For example, it will fit better with api versioning, and with the ability to show different/extended payloads in some endpoints.

Aggregation. Merging several responses into one involves knowledge about other services payloads, behaviours, etc. It means that building a response based on several ones is business logic and must be solved by an specialized service. A gateway cannot be used easily to do that because the logic involved in the workflow is complex. You need to manage retries, errors, timeouts, authentication…; and then probably use the output of one service as input for another, etc. A gateway is not for that.

Errors mapping. From the point of view of a service, every error is external. Ebury does not have a standardized way to map or document errors in the API. Perhaps we can have in the future all the possible errors from all the services in a centralized directory. It could be useful for the frontend, to share knowledge, to help support team, to internationalize messages… But this is outside the scope of this RFC.

Checking if the connected user has authorization over the requested resources and operations is a problem not to be solved here. However, we need basic ACLs to the main product parts and so we will cover some access control as requirement.

Functional requirements

Multi-service, multi-environment

Every "BOX" will be exposed using a different url, but the paths and sub-domains configuration will be shared.

For example: https://environment1.ebury.rocks and https://environment2.ebury.rocks could point to different environments (boxes) i.e. different installations of the software "set". And then https://environmentX.ebury.rocks/webhooks could point to the webhooks services in every environment X.

This was only an example: the gateway must support every combination of DNS regexps, paths and sub-domains (virtual hosts) and so it can be part of the design that different environments can live behind the same gateway.

Security: HTTPS and certificates

The gateway will expose https and will be responsible for renewing the certificates for the configured domains. If the gateway has no direct support for this, at least the used certificates must be externally configurable.

Security: Web Application Firewall

The API gateway needs to include abilities to enable or use a WAF that protects internal services from threats: SQL injection, cross site scripting, brute-force attacks, bots, etc.

Timeouts

The timeouts must be configurable per service basis.

Rate limit

We need to be able to configure limits at several granularities: to avoid intentionate flooding, but also to contain usage under legitimate expectations, encouraging users to good practices. This option could include general configurations, but then: we must be able to overrule some limits by IP, by customer, or by service.

Plan usage control is not forecasted. If it becomes a requirement at any moment, it will impact application-level anyway and will need a separate plan. However, it is considered a nice to have that the selected gateway includes the ability to throttle at business-level counters, in addition to the request rate limits. With this: we could throttle with expressions like "maximum 3 payments per second" or "maximum 10 logins a day" instead of the plain "maximum X requests per Y time".

Standard auth and 2FA

Eventually, the authentication mechanisms exposed to contacts using external integrations will be migrated to be part of the gateway. So the gateway must support a good set of standard authentication mechanisms.

Given we cannot be sure which exact mechanisms we will need in the future (API keys, JWT, Oauth,...) the gateway must allow for custom ones to be included. E.g. if we need to validate users by their BOS email and password, we need to have the technical option to connect the dots.

Trusted authentication

For EBO and FI authentication we will use mtls authentication. The gateway must support this feature.

Furthermore, the gateway must include the ability to map the certificate Subject Name to actual users. That way, we could have a process to validate the certificates and include them in our database.

Access control

Some routes, hosts, and upstream services will be restricted to several groups. This is done at request-level: "is this user able to see this endpoint/service?". So a basic ACL that allows for some basic groups is required.

Anonymous usage must be covered by this ACL mechanism. If there is some endpoints that are open to the public this must be part of the configuration. These endpoints could include public web sites, documentation, metadata endpoints, login endpoints, and more.

Please not that resources authorization is an application-level problem. "Is this user able to see this payment?" is a totally different problem, not solved here.

The access control we need will be used to select which groups (not users) are able to see which endpoints.

Correlation IDs

The gateway must generate correlation (request) IDs and send them to the upstream services. The format and the exact header name is not important, but the ability to configure them would be great.

I/O audit

For security reasons, we would like to send every user input (and/or every service response) to a secondary audit service.

The gateway must include a mechanism to implement such feature.

I/O scripting

Transforming the input (and/or the output) at a gateway level is a discouraged pattern, especially if modifying business-level information. That can break the compatibility checks we gain with shared libraries, versioned services,... We must avoid placing business-level features into a shared service, instead of having them in the right domain and service.

However said that, some technical reasons can appear to bring the necessity of modifying an input or an output. Duplicating one field as a header to comply with some external requirement, discarding a deprecated field, putting an envelope around a response, changing the name of a field to avoid leaking internal information, or banning all responses that matches a pattern... are only some examples.

So, even if all of them can have better alternative implementations, and will arrive at the gateway as "temporary" solutions, it is mandatory that it includes the mechanisms to do the most usual transformations and data movements.

The ability to deploy some complete transformation scripts is also a requirement. It does not matter which language, but we must have the ability to eventually put transformation code in the gateway.

Non functional requirements

Internal observability

To comply with existing or incoming RFCs, it is expected that the gateway is able to: write JSON logs, expose prometheus metrics using HTTP, and send distributed traces.

For the latest, it is also expected that the gateway has the ability to sample the configured amount of requests to be traced, since it is the entrypoint for most of them.

Luckily, all these three things are really common features.

Configuration as code

It is mandatory that the configuration can be stored in a git repository and applied by a CI process.

At the bare minimum, this means that the gateway must have a programmatic admin interface where we can direct our configurations. Honestly, we expect further: the system should be configurable in declarative ways.

The gateway configuration is a software artifact and so it will be versioned in text files. A graphical user interface to read or modify values is however a nice to have, because it could help in some processes to develop, explore, or prototype some behaviours.

Another nice to have is the gateway being configurable by terraform templates.

Infrastructure as code

We will have a gateway installation in every cloud account, and one more in the local boxes. So having the option to start a local gateway is necessary. It could be also a requirement that the gateway can be provisioned/upgraded in the cloud with good (automated) tools, but the focus should be on the configuration management.

That is: the desired minimum is that it must support to be deployed as a normal service (git->jenkins->ecs), and then use a persistence layer with RDS or other managed service. However said that, we could even accept that the gateway needs to be provisioned "by hand" once in every AWS account, if then the configuration upgrade is automatic and git-based after that.

Accountability

Given it is a shared resource, every register or artifact must be taggable. We could keep some of the already existing policies for tags used in cloud items, or add new ones.

If a team adds a service, an endpoint, a user, or a rate limit, then this resource must be tagged to keep track of the ticket ID, the team, etc.

Backups

In the case the persistence is not totally external (e.g., RDS) then it must allow to have backups following our policies.

The preferred option would be having a gateway with configuration persistence in the already widely used technologies: postgresql 11 and AWS RDS service.