API background tasks, to solve beneficiaries bulk import
This document reflects the proposal of including support for background tasks in the public API layer, so the offered experience can include richer flows.
Problem Description
Currently, the public API is a transformation layer exposing synchronous HTTP endpoints mostly. The little asynchronicity presented (e.g. the mass payments endpoints) is possible only because it is implemented in upstream internal systems, usually in BOS.
We want the API user experience to include some async operations that do not rely on upstream systems implementation details.
The first prototype of this async capability will be the API for beneficiaries bulk import. The scope of the document includes only the high level proposal for this use case. A future document with a low level design will contain the technological choices, data models, state machines, etc.
Background
BOS has REST endpoints to manage beneficiaries. An incoming rearchitecture will probably move this logic to a potential beneficiary command service. None of the two scenarios should be a limitation factor for the public API, which is providing this operation to the user as a proxy layer. API must be able to expose an async flow regardless the backend implementation.
Solution
This document explores a solution that brings background tasks to the API channel. Long running tasks are usually avoided during HTTP requests, and so lot of web frameworks and libraries are usually complemented with task processors. Django and Celery are a good and known example, even if they are not the selected technologies for this part of the stack.
This proposal includes considering "user experience" tasks as first class citizens in the API. The user could trigger them, and return later to check the status and result.
Beneficiary creation is an operation already supported in the API, but for a single item and in a synchronous way. To achieve bulk import in an async way, we could reuse all this creation code and tests: input validation, communication with BOS, output transformation, etc. Extracting the logic from the HTTP request-response cycle, to be reused from a background task, seems natural and safe. The task code will contain a simple loop to perform the creation invoking the tested code.
Implementation
To execute a background task, we can use existing libraries that allows (and requires) the basecode for the task worker process to be shared with the web server process. The same docker image can be started with a different entrypoint, in a similar way that celery uses in django.
Most of the alternatives need a persistence layer, luckily API already have a redis (elasticache) instance available, which is the usual choice for these libraries. API channel also have some dynamodb tables and creating a new one to keep operation information will be straightforward. Using dynamodb is the best option since it is already in place and it covers all the necessities.
The library finally selected must include abilities to return the status of the task once scheduled, and to have callbacks after success or error, so we can persist the result.
This RFC does not need to contain implementation details, but worth to note that rq library is an alternative already evaluated and will probably be the one used.
URLs
The app-webapp HTTP service will have some endpoints to manage the tasks.
POST /tasks/$task_type
The operation to spawn a task.
We have the task type in the URL path to make it more visible in logs, etc.
The first operation will be POST /tasks/beneficiaries-import.
We expect a type beneficiaries-validation to be included in the near future.
The body input will depend on the type: for the first case it will be a list with beneficiaries as already defined in the endpoint for single creation.
Input validation will be performed synchronously. The user will receive a successful response indicating that the task is created only after the sent params are checked as valid.
The output of the endpoint will contain, at the very least, an object with the task_id.
With this UUID, the user can query for more information later.
GET /tasks/$task_type/
This endpoint will list all the tasks of a specific type created by the user, with a summary of status, progress, and dates.
GET /tasks/$task_type/$task_id
This is the endpoint to get the information for a single task. Alongside the current status and progress, the output will contain some links (to errors and results) which could be empty if the task is not finished.
GET /tasks/$task_type/$task_id/errors
In case there are some errors, this endpoint will list them. In our example: for every beneficiary in the batch creation intent, there will be a potential error list. These error codes and messages will match the usual output of errors in the API, because the same translation code will be used. The consumer will get the same granularity and detail about the source of every possible problem than if using the synchronous operation.
GET /tasks/$task_type/$task_id/result
This endpoint will list the output the result of a finished task.
We discarded using /tasks/$task_type/$task_id/beneficiares because it is specific to the first prototype.
Having result is more generic and will have meaning for every different operation in the future.
Alternatives
The two main discarded alternatives are:
Coding it in BOS
BOS is a deprecated service and the existence of other feasible alternatives made this option not to be evaluated.
Coding it in a new service
That can involve sending a message to kafka, and then write a command service that does the job when receiving it. This can look more aligned to the new architecture, but there is some concerns. Mostly, this can overlap with the logic to be in a potential beneficiary command service, that will manage the data and the business rules.
We don't need a new service because we are not porting beneficiaries data neither logic out of BOS. The proposal of this document is only to make a bulk async import operation in the API, using an existing single synchronous creation operation in BOS.
Caveats
This solution is intended to allow for background tasks that can unload the HTTP main API flow and allow for fast async endpoints. However, only one type was evaluated and included. For every new operation we need to include, a new RFC will be needed.
About the potential clash between this proposal and the beneficiaries part, note that the beneficiaries creation will be eventually extracted from BOS. Then, this implementation will be tweaked to invoke the new logic, as part of that beneficiaries logic migration.
Operation
No new infrastructure artifacts to be created as part of this proposal. If wanted to be considered, only one DynamoDB table. Metrics about status and longevity of the tasks are to be exposed to prometheus, and some alarms can be created on top of them. We will manage to send to sentry every uncaught exception in the life of an operation, with all the context. Obviously, logs from the tasks will need to be alongside web server logs in kibana. The monitoring plan for the API services is likely not needing even an upgrade.
The only point to check before deployment is the memory usage of the containers for the different environments. We need to be sure this addition does not bloat them over the currently configured limit, or we will need to increase the limit.
Security Impact
The scenario presented includes reusing the same code built into the same docker image, with the same network configuration.
No security impact forecasted.
Performance Impact
The ability to run tasks in background will have no impact on the performance, since the code to create beneficiaries is to be the same. If any effect to be noted, the user will be able to do bigger tasks with smaller HTTP response times: so the perceived performance will be better.
Developer Impact
The selected technology will need to work with the actual app-webapp stack: code and external services available.
That means the developer experience will be confortable, no need for external tools or new languages.
Also, the key part of this proposal is the ability to run some existing parts of the logic in background, without modifying their implementation. That will allow us to trust the existing code and have a good test coverage.
Data Consumer Impact
None.
Deployment
Most of the libraries will need the basecode to be exactly the same in the web server and the worker, so keeping the same docker image seems the best option.
Because that, including the workers in the same ECS task (app-webapp) is good and reliable enough.
After we move the API services to kubernetes, this ECS task will be a Pod.
Dependencies
The presented idea can be executed without dependencies or blockers from other projects.
References
None.