Data obfuscation

Define a platform for running data obfuscation workloads, and an interface for services to interact with such platform.

Problem Description

Ebury is required by regulation to use an obfuscated version of the data in the test and development environment.

Data must be consistent between different data sets (for instance, the IBAN accounts is present in different data sets for different services).

Furthermore, while using full extend of data is great for performance testing, it is not generally needed for functional testing, so a smaller version of the data set is also desirable.

Background

Until now, only data for BOS is being obfuscated, but the code for obfuscation process is tightly coupled with infrastructure, and it is maintained by SRE Team, whose knowledge of the data schema is limited. In addition, changes in the data model are not necessary followed by changes in the obfuscation process.

Process is being executed in management instance (management-dev in development account), scheduled with crontab. There is not provisioning code or infrastructure code for this. Process itself is a Shell script (database_obfuscation.sh), with calls to a series of other Shell and Python scripts. Source code is stored in https://github.com/Ebury/ebury-infrastructure-scripts/tree/master/obfuscated/

The scripts covers both the creation and deletion of resources and the manipulation of data in order to achieve obfuscation.

Access and credentials

BOS access its database through pg_bouncer, which also manage credentials. However, credentials are also available in Credstash.

Other services in ECS access database directly, with credentials stored in Vault. Access to that credentials is granted by IAM Role.

When we create RDS databases from snapshots, the database is restored with same credentials as the source.

BOS ElasticSearch

An additional problem specific to BOS database is that indices in ElasticSearch also need to be reindexed. This is already being addressed by current obfuscation process, by running local ElasticSearch in the management instance, and a quite obscure process involving repository-s3 and river-jdbc plugins, creating a dump in S3 that it is being restored in staging environment together with the obfuscated database.

Recreating the indexes in ElasticSearch through Django commands has been discarded in the past because it can easily take several days and it is a process that can likely run out of memory.

Elasticsearch, however, is scheduled for removal in BOS, and within BOS teams roadmap.

Risks

Due to the consequences of accidentally writing content in the production database, we must be extremely cautious and ensure that the obfuscation process never has access to actual production database clusters.

Other data sets

There are other data sets susceptible to obfuscation, namely: DynamoDB and S3 Buckets. Currently, there is no obfuscation process in place.

Solution

Decouple the infrastructure from the obfuscation code, with the code being maintained and tested by development team without needing access to the platform in production. A Jenkins pipeline will run the obfuscation tasks defined on each service in an isolated infrastructure based on the latest RDS snapshot.

Infrastructure

Infrastructure will be a new VPC and ECS cluster, with ECS Fargate tasks running the obfuscation scripts defined by each service. The RDS to be obfuscated will be created from scratch on each pipeline execution, with the latest available snapshot.

Roughly, the Pipeline would look like:

Infrastructure

The permanent infrastructure (a new VPC with its networking, and a ECS cluster) will be created with terraform code in terrafrom-global repository, and the ephemeral is created and destroyted by the pipeline.

Pipeline details

The pipeline will discover which tasks are ready for obfuscation with the tags included in the task definition. These tasks should be added to indicate a database to be obfuscated.

"ObfuscationTask" : "1",
"DatabaseName" : "dbname",
"DatabaseInstance" : "develenv0",

Using a command to filter by the tag ObfuscationTask.

aws resourcegroupstaggingapi get-resources --region eu-west-1 --resource-type-filters ecs:task-definition --tag-filters Key=ObfuscationTask,Values=1 --max-items 1 | jq --raw-output '.ResourceTagMappingList[].ResourceARN'

With the help of either terraform data sources or AWS CLI, we will obtain references to current running services, their database connections, and their latest automated snapshots, producing a list of services like:

[
    {
        cluster: "devel-backoffice-wlp",
        service: "devel-sherlock",
        task_definition: "devel-sherlock:93",
        container: "sherlock",
        db_instance: "develbackoffice1",
        db_snapshot: "rds:develbackoffice1-2020-09-28",
        name: "sherlock"
        db_name_variable: "DB_NAME"
        db_host_variable: "DB_HOST"
    }
]

The list of services can then be processed, grouping the list of services by database instance, which can then be iterated by a pipeline.

[
    snapshot: "rds:develbackoffice1-2020-09-23-23-23",
    version: "postgre9.8",
    services: [
        task_definition: "devel-sherlock:93",
        environment : {
            DB_HOST: "the.temporal.rds",
        }
        dbs: [
             {
                 environment : {
                     DB_NAME : "sherlock"
                 }
                 obfuscate: {
                     command: "database/obfuscate",
                     target: {
                         dump: "s3://ebury-obfuscated-dumps/sherlock/today.sql.tar.gz",
                         volume: "s3://ebury-obfuscated-dumps/sherlock/today.dockervolume.tar.gz"
                     }
                 },
                 truncate: {
                     command: "database/truncate",
                     target: {
                         dump: "s3://ebury-development-dumps/sherlock/today.sql.tar.gz",
                         volume: "s3://ebury-obfuscated-dumps/sherlock/today.dockervolume.tar.gz"
                     }
                 }
             }
        ],
    ]
]

Additional services can be added to previous list, for instance for services like BOS which is not currently running on ECS.

A Jenkins pipeline will iterate (or run in parallel) over all snapshots obtained as outputs, then restore them in a newly created RDS (with no multi-az, maybe in spot instance). The RDS will be created with AWS CLI (or equivalent programmatic library like boto3), but not with Terrafrom. Terraform is great for creating and maintaining persistent infrastructure, but not so good when dealing with temporal resources.

Exceptionally, in the case of BOS, we will also create a managed ElasticSearch with index restored from production snapshot. Actually, the case of bos is so exceptional that should not use service discovery, but an ad-hoc function for creating the resources and task definition.

In the pipeline, run an ECS Fargate task with access to the new RDS, running an image for the service.

The command specified will be database/obfuscate. The task will provide environment variables with database credentials, with the same variable names that the service is already using.

After obfuscation is completed, pipeline will use a different ECS Fargate task to dump it to sql file and store it gzipped in S3. A docker image for the purpose is already available (https://github.com/Ebury/docker-postgres-backup) and it is already being used for non obfuscated dumps.

Then, following the same ECS Fargate mechanisms, run a new task but with database/truncate.sh command. Once completed, dump again and store gzipped in S3.

Once all data has been stored, destroy the RDS (and any other temporal infrastructure).

After all the dumps has been completed, create docker gzipped volumes for the databases. Gzipped volumes are faster to restore than sql dumps, and are already being used in ED2K and EBOX environments. In order to create them: for each one of them, run a local psql docker container (postgre version can be obtained from the snapshot), and restore the sql dump there. Once completed, stop the container and gzip the docker volume for storing in S3.

Execution logs will be visible in the Jenkins pipeline log, but it is possible to stream them to ELK.

Metrics for the temporal RDSs will be exported from Cloudwatch to Prometheus, and presented in Grafana dashboard.

Service

Services will include two scripts in database/obfuscate and database/truncate in their docker images.

For the specific case of BOS, which is not running in containers, there is already a container being tagged with production label on each release, in order to be used by the Snippet pipeline, so the operation will not be significative different.

The responsibility for creating and maintaining the scripts rely in the development team that owns the service.

It should be possible to run the script against old versions of the database, so the script must ensure possible pending migrations are executed as part of the script.

Those scripts, when running, will have the same environment variables as the service, include database connection and access to credentials (through Vault or Credstash).

Scripts must take into account that only access to Database will be possible, and access to internal services, external services or other AWS resources will not be possible. Additionally, no other environment variables beside the ones for RDS will be available.

Exceptionally, ElasticSearch restored from production dump will be available for BOS script, at the following variables: * ELASTICSEARCH_URL * ELASTICSEARCH_INDEX_NAME

CI pipeline for service can run both scripts against a database restored from latest dump in S3 if feasible between time constraints. In any case, scripts will be executed against local development database. Two new make targets, test-obfuscate and test-truncate will create a local test environment, with source and format arguments similar to the ones provided in restore-db targets.

Alternatives

Aurora Serverless RDS. However, Serverless RDS can only be restored from Aurora RDS with the same DB engine, or from other Aurora Serverless RDS. Currently, only Eburyonline future infrastructure uses serverless RDS.

Use ECS backed by EC2 and a autoscaling groups, adding and removing EC2 instances in the same way we would do with RDS. Computation for the same resources in EC2 is ~20% cheaper than Fargate, but the operation would be more complex, and there is always a risk that a bug in our code left the instances running, incurring in additional costs.

Do not centralize the obfuscation in a Pipeline, but have it running as scheduled tasks in ECS. However, creation and destruction of temporal RDS should be handled somehow.

After restoring the snapshot in RDS, replace user and password. The pipeline would still need to retrieve current credentials for each database from somewhere (probably Vault) and also make available the new credentials.

Instead of having metadata in the ECS services, the service could expose a endpoint with this metadata, which could be consumed through service discovery.

Caveats

It may happen that service version (and schema) change between the time the snapshot was taken and the time the obfuscation is performed. Obfuscation script must ensure that any pending migration is applied before starting the process.

If the pipeline is aborted during execution, there is a risk the temporal RDS resource is left creating, incurring in additional costs.

Most of database credentials are stored in Vault, and services access Vault by IAM role. With the proposed implementation, the task for obfuscation will have the same role as the service (because it will be using the same task definition), thus having access to production resources whose access done granted by API (S3 buckets, SNS, SQS, etc.)

Operation

Pipeline would be configured for running automatically from Jenkins on a daily or weekly basis. Error in pipeline execution will trigger notifications in slack channels, different for each database.

Age of the latests obfuscated dump on each folder will be monitored, and alarms triggered if the dump is too old.

Security Impact

The pipeline will have read access to production data and it will manipulate it. This document and any further change in the process must have the approval from security team.

Any change in the pipeline will be done through Pull Request with the approval of Security Team, as well as any change in obfuscate and truncate scripts.

Task running in Fargate would have an interface in our VPC and access to the databases, but we will have no control over the underlying host. Risk is mitigated because the workload will not expose any port. Aquasec can provide some security to this kind of workloads through micro enforcers.

Performance Impact

The process will use its own resources, so it will have no impact in performance for the platform.

Developer Impact

The development teams will have a new responsibility for defining obfuscate and truncate commands.

Data Consumer Impact

The process will only read the current data schema, not including changes.

Deployment

Pipeline can be tested in the development account, but will from a Jenkins agent in production account. The permanent infrastructure (VPC, ECS Cluster) must be created by DevOps Team from code in terraform-global repository.

References

https://fxsolutions.atlassian.net/wiki/spaces/TEAM/pages/119736685/Database+Obfuscation https://blog.aquasec.com/securing-aws-fargate-with-sidecars https://blog.aquasec.com/revisiting-aws-fargate-with-aqua-3.0

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search