Document upload service
Proposal for new service to handle file/document uploads to Google Drive.
Problem Description
We want to unify how different Ebury services upload files to third party services, in this case Google Drive.
Requirements
- Ebury services need to upload/find/delete files to Google Drive
- Ebury services need to optionally scan files for threats before uploading.
Background
Currently no internal service uploads files directly to Google Drive using Google's provided libraries.
Solution
The proposed service will expose a REST API to offer upload files to Google Drive.
The REST API main design principles are:
- No file uploads handled directly by this service. Files are always referenced as an URL.
- Endpoints that perform long running tasks (eg uploads to Google Drive) offer both a sync option that performs the task synchronously returning to the client only when the task has succeeded or finish with an error and a async mode where a response is returned to the client immediately with and ID from the target service and the actual task will run in the background. Providing those two options doesn't complicate the design but offers an alternative to the clients operation (eg some scenarios will work with a "fire-and-forget" upload but others might want to handle explicit retries etc).
Upload Service new endpoints
POST /api/drive
Upload a file to Google drive. Source file is an URL of the file to store in Google Drive.
The ACL for the file will be automatically inherited from the parent folder. The file is optionally scanned by SCANII and will reject the request if it contains malicious content.
Request payload:
file: {
path: "/gdrive/my/path/to/a/folder/invoice.pdf"
source_url: 'http://example.com/invoice.pf,
user_id: 'marcos.prieto@gmail.com', # Could be header parameter, up to the implementation
scanii: true/false
async: true/false
}
201 Response:
{
status: {...}
file: {path: "/gdrive/my/path/to/a/folder/invoice.pdf", file_id: <GDRIVE_FILE_ID>},
}
Errors:
- 400 Missing or malformed information on the request
- 403 Not allowed to create a file in the specified location as the specified user
- 40X Any Google Drive error
Implementation details:
If the async flag is set to true the service will /generteIds on google to return those to the client as soon as possible and queue the scan/upload of the files in celery. In sync mode the client waits until the file has been successfully added to google drive or raises an error.
GET /api/drive/
Get information of a previously uploaded file
Request parameters:
user_id: 'marcos.prieto@gmail.com' # Could be header parameter, up to the implementation
200 Response
{
"id": string,
"name": string,
"size": long,
}
Errors:
400 - If file_id is missing
404 - If file doesn’t exist in Google Drive
Implementation details:
The service will get the details from Google Drive and forward the info to the user.
Decided against returning the response form Google Drive verbatim as that would tie us their v3 library implementation. Which fields are actually exposed can be decided during implementation.
GET /api/drive
Search within google drive for files & folders
Request parameters:
user_id: 'marcos.prieto@gmail.com' # Could be header parameter, up to the implementation
search: ""
pagination: {}
200 Response
files: [{
"id": string,
"name": string,
"size": long,
}]
pagination: {}
Errors:
40X - Errors in Google Drive
DELETE /api/drive/
Delete the previously uploaded file stored in the storage (Google Drive).
Request parameters:
user_id: 'marcos.prieto@gmail.com' # Could be header parameter, up to the implementation
204 Response
Errors:
400 - If file_id is missing
404 - If file doesn’t exist in Google Drive
40X - Any Google Drive error
Implementation details:
The service will try to delete the file from gdrive and return synchronously the corresponding status code
Alternatives
Some alternatives to the service design and tradeoffs made is this proposal: - A simpler but less futureproof alternative would be to use Google's Drive API client library directly on the projects that need to upload files to Drive and, if more and more projects need this functionality, explore then the need to wrap google's library in our own ebury-drive library or create then a similar service as the one proposed in this document.
-
Instead of an API, clients of this service could publish events directly to SQS or similar and we'll need just to implement a worker to consume those. This would simplify some parts of the design, but decided against as it would make bi-directional communication more complicated (eg to retrieve newly created file ID, to check file status) and in general clients would have less friction communicating with a REST API than with SQS.
-
The service relies on files uploaded somewhere else. This might complicate operations for clients as in some cases they might have to do the upload somewhere and then talk to this service.
An easy to implement improvement over this would be to offer and endpoint the returns signed upload S3 URLs for a bucket owned by this service where clients can upload their files if needed and then use that uploaded file for the google drive upload. This doesn't remove the need for a separate upload step but provides a choice for clients to upload the file and leverages S3 functionally and reliability without having to deal with their own S3 bucket operation, permissions etc themselves.
-
Celery was chosen instead of SQS or other alternatives to run tasks despite some extra configuration/ops complexity as it provides much more functionality out of the box to publish and consume events (eg chords, canvas, periodic tasks...). Transport protocol for celery it's not specified here, it could be Redis/SQS itself or any other.
-
Adding state to this service in the form of some type of storage could potentially make it more flexible. For example, an UUID could be generated per file seen by the system to identify it and then use it across the system (eg a client could POST a file and then
/scan/<uuid_id>&/upload/<uuid_>/drive&/upload/<uuid>/salesforceand later GET /to get the status of that file across different services etc).
Caveats
-
Error handling of failed uploads/scans in
asyncmode stays inside the service infrastructure and it's not exposed to the clients currently on this design. Depending on the needs of the clients solutions for this could be explored (eg webhooks, event publishing). As async mode can handle multiple retries seamsly and sync mode is offered anyway this should block usage of the service as of now. -
As mentioned above the design is currently very focused on Google Drive.
Operation
The new Upload Service will be running as an internal resource, as it is not intended to be used by the public.
We aim for the same operation as we already do with Account Details service. We would like to use the same technologies, development and deployment processes.
Security Impact
Authentication and authorization
As this service it's not exposed to final clients it won't use a traditional three legged oauth flow to authenticate with Google Drive but use a two legged flow instead using Domain-Wide delegation. Using this flow the service can operate on Google Drive on behalf of any other user.
With Domain-Wide delegation a client of the service could potentially impersonate and create documents on behalf of a user it shouldn't be allowed to. To prevent this the service will keep a whitelist of allowed accounts to impersonate and handle it as a secret (stored in Vault etc). These accounts would be generally generic accounts (eg sales@ebury.com, compliance@ebury.com and not personal accounts). What access/restrictions those accounts have will be handled using Google Drive console by the security and relevant teams per use case.
Scanning content before is stored in a storage
Files can be checked against threats by a 3rd party service and rejected if a scanning result raises suspicion.
Performance Impact
The proposed service doesn't handle uploads itself and the most performance sensitive parts can be handed off to a task queue so the performance impact on other services is expected to be minimal.
Developer Impact
N/A
Deployment
New Upload Service should be running and deployed in a docker container.
Dependencies
New Upload Service will require access to SCANII and Google Drive API.