Document upload service

Proposal for new service to handle file/document uploads to Google Drive.

Problem Description

We want to unify how different Ebury services upload files to third party services, in this case Google Drive.

Requirements

  • Ebury services need to upload/find/delete files to Google Drive
  • Ebury services need to optionally scan files for threats before uploading.

Background

Currently no internal service uploads files directly to Google Drive using Google's provided libraries.

Solution

The proposed service will expose a REST API to offer upload files to Google Drive.

The REST API main design principles are:

  • No file uploads handled directly by this service. Files are always referenced as an URL.
  • Endpoints that perform long running tasks (eg uploads to Google Drive) offer both a sync option that performs the task synchronously returning to the client only when the task has succeeded or finish with an error and a async mode where a response is returned to the client immediately with and ID from the target service and the actual task will run in the background.   Providing those two options doesn't complicate the design but offers an alternative to the clients operation (eg some scenarios will work with a "fire-and-forget" upload but others might want to handle explicit retries etc).

Upload Service new endpoints

POST /api/drive

Upload a file to Google drive. Source file is an URL of the file to store in Google Drive.

The ACL for the file will be automatically inherited from the parent folder. The file is optionally scanned by SCANII and will reject the request if it contains malicious content.

Request payload:
file: {
  path: "/gdrive/my/path/to/a/folder/invoice.pdf"
  source_url: 'http://example.com/invoice.pf,
  user_id: 'marcos.prieto@gmail.com', # Could be header parameter, up to the implementation
  scanii: true/false
  async: true/false
}

201 Response:
{
    status: {...}
    file:  {path: "/gdrive/my/path/to/a/folder/invoice.pdf", file_id: <GDRIVE_FILE_ID>},
}

Errors:
- 400 Missing or malformed information on the request
- 403 Not allowed to create a file in the specified location as the specified user
- 40X Any Google Drive error
Implementation details:

If the async flag is set to true the service will /generteIds on google to return those to the client as soon as possible and queue the scan/upload of the files in celery. In sync mode the client waits until the file has been successfully added to google drive or raises an error.

GET /api/drive/

Get information of a previously uploaded file

Request parameters:

user_id: 'marcos.prieto@gmail.com' # Could be header parameter, up to the implementation

200 Response
{
    "id": string,
    "name": string,
    "size": long,
}

Errors:
400 - If file_id is missing
404 - If file doesn’t exist in Google Drive
Implementation details:

The service will get the details from Google Drive and forward the info to the user.

Decided against returning the response form Google Drive verbatim as that would tie us their v3 library implementation. Which fields are actually exposed can be decided during implementation.

GET /api/drive

Search within google drive for files & folders

Request parameters:

user_id: 'marcos.prieto@gmail.com' # Could be header parameter, up to the implementation
search: ""
pagination: {}

200 Response
files: [{
    "id": string,
    "name": string,
    "size": long,
}]
pagination: {}

Errors:
40X - Errors in Google Drive

DELETE /api/drive/

Delete the previously uploaded file stored in the storage (Google Drive).

Request parameters:

user_id: 'marcos.prieto@gmail.com' # Could be header parameter, up to the implementation

204 Response

Errors:
400 - If file_id is missing
404 - If file doesn’t exist in Google Drive
40X - Any Google Drive error
Implementation details:

The service will try to delete the file from gdrive and return synchronously the corresponding status code

Alternatives

Some alternatives to the service design and tradeoffs made is this proposal:   - A simpler but less futureproof alternative would be to use Google's Drive API client library directly on the projects that need to upload files to Drive and, if more and more projects need this functionality, explore then the need to wrap google's library in our own ebury-drive library or create then a similar service as the one proposed in this document.

  • Instead of an API, clients of this service could publish events directly to SQS or similar and we'll need just to implement a worker to consume those. This would simplify some parts of the design, but decided against as it would make bi-directional communication more complicated (eg to retrieve newly created file ID, to check file status) and in general clients would have less friction communicating with a REST API than with SQS.

  • The service relies on files uploaded somewhere else. This might complicate operations for clients as in some cases they might have to do the upload somewhere and then talk to this service.

An easy to implement improvement over this would be to offer and endpoint the returns signed upload S3 URLs for a bucket owned by this service where clients can upload their files if needed and then use that uploaded file for the google drive upload. This doesn't remove the need for a separate upload step but provides a choice for clients to upload the file and leverages S3 functionally and reliability without having to deal with their own S3 bucket operation, permissions etc themselves.

  • Celery was chosen instead of SQS or other alternatives to run tasks despite some extra configuration/ops complexity as it provides much more functionality out of the box to publish and consume events (eg chords, canvas, periodic tasks...). Transport protocol for celery it's not specified here, it could be Redis/SQS itself or any other.

  • Adding state to this service in the form of some type of storage could potentially make it more flexible. For example, an UUID could be generated per file seen by the system to identify it and then use it across the system (eg a client could POST a file and then /scan/<uuid_id>  &  /upload/<uuid_>/drive & /upload/<uuid>/salesforce and later   GET / to get the status of  that file across different services etc).

Caveats

  • Error handling of failed uploads/scans in async mode stays inside the service infrastructure and it's not exposed to the clients currently on this design. Depending on the needs of the clients solutions for this could be explored (eg webhooks, event publishing). As async mode can handle multiple retries seamsly and sync mode is offered anyway this should block usage of the service as of now.

  • As mentioned above the design is currently very focused on Google Drive.

Operation

The new Upload Service will be running as an internal resource, as it is not intended to be used by the public.

We aim for the same operation as we already do with Account Details service. We would like to use the same technologies, development and deployment processes.

Security Impact

Authentication and authorization

As this service it's not exposed  to final clients it won't use a traditional three legged oauth flow to authenticate with Google Drive but use a two legged flow instead using Domain-Wide delegation. Using this flow the service can operate on Google Drive on behalf of any other user.

With Domain-Wide delegation a client of the service could potentially impersonate and create documents on behalf of a user it shouldn't be allowed to.  To prevent this the service will keep a whitelist of allowed accounts to impersonate and handle it as a secret (stored in Vault etc). These accounts would be generally generic accounts (eg sales@ebury.com, compliance@ebury.com and not personal accounts). What access/restrictions those accounts have will be handled using Google Drive console by the security and relevant teams per use case.

Scanning content before is stored in a storage

Files can be checked against threats by a 3rd party service and rejected if a scanning result raises suspicion.

Performance Impact

The proposed service doesn't handle uploads itself and the most performance sensitive parts can be handed off to a task queue so the performance impact on other services is expected to be minimal.

Developer Impact

N/A

Deployment

New Upload Service should be running and deployed in a docker container.

Dependencies

New Upload Service will require access to SCANII and Google Drive API.

References

Google API Auth Delegation

Google API Service Accounts

Google API developer guide

Google API error handling

SCANII API docs