Automate snippet execution
Provide a mechanism for automating, auditing and testing the execution of code snippets in production environments.
Problem Description
With our current procedures, sometimes (quite often actually), we need to run code in Production Shell (Python code being executed in production environment with access to database through Django ORM). The use cases are data migrations, fixing errors in data processing and troubleshooting.
At the moment, development teams send the code through Bitbucket Snippets to Support team. Then, once approved, snippets are executed by Support team or SRE team. This approach presents several problems:
- It makes development teams dependent on Support or SRE availability.
- It does not guarantee that the approved code is the one that is actually executed.
- It implies that Support Team has unlimited write access to the platform.
- It means running code in production without passing the quality gates and security checks we enforce for other code.
Background
Support Team has provided several use cases for accessing Application Shell in production, which in turn can be grouped in three scenarios:
One-shot scripts due to:
- Data migrations in new developments
- Cannot be executed at release process due to execution time concerns
- It may make use of external sensitive data
- It is an activation script for a feature and we want to do canary deployment of the feature
- Fix database inconsistences due to incidents
- Bulk changes requested by Operations
Common day-to-day operations:
- Change requests from Operations: editing broker deals or trades, deleting duplicates and undoing matches
- Manage program managers
- Manage users, partners and API keys
Read only operations:
- Troubleshooting
- Reports
This RFC will focus only on the first group of use cases. Different RFCs should address the other use cases.
In addition, this is not a substitute for all data migrations. Any data migrations that is part of a schema migration (i.e. a migration that has been split in several steps in order to have zero downtime releases), should be still executed as part of usual migrations. Sometimes, this is problematic because populating a new field in big tables can take lots of time. In those events, migrations must still be responsible for having integrity in data, but it should be possible to run a snippet beforehand populating the field (hence making the final migration execution faster).
The proposal is for having a way to automate snippet execution without human intervention, other than just approve the snippet. In addition, since there will be no human intervention, we must mitigate also risks for a script being executed more than once or, even worse, running all the historical of scripts by accident.
Solution
Run snippets as data migrations, but in a separate Django application that lives in a repository independent from project code, so it has a different lifecycle in terms of deployment.
In essence, the need for running a snippet for changing data is conceptually a data migration. They only reasons for not running them as data migrations are the time it could take to run the migration, which would hurt a lot the release time, and the long time to production we have in some projects, specially BOS.
One of the major features of Django migrations is the ability to include dependencies between them, and run them in sequence. This is interesting for this use, because we can specify that a script is dependent on a specific schema, and it is still possible to implement with the solution proposed, but it is not the main feature we seek in the Django migrations here.
On the other hand, what really suites the use case is another feature of Django migrations: once executed, migrations are stored in the database as executed so subsequent executions will not run the script.
In addition, by having snippets executed in this way will give us auditability about who executed what and when.
Code reusability
Some times we may need to execute some scripts more than one time. Migration could be unapplied, and then re-executed, but this should be forbidden or be last resort. For this use case, if a re-execution is needed, the best way would be to add again the script in a new migration. If it is a recurrent script, then the code should be included as part of service base code with tests included. Then, in the data migration, include the parameterized call to the script.
For instance, in the case of activation scripts for canary release of features, it makes totally sense to develop the activation script during the feature development, and then add the call in the snippets code.
Over the time, it is possible the number of scripts or migrations in the snippets repository will grow. This can be solved in two ways:
- Having multiple independent repositories for snippets, dependent on context or scope for the snippets
- Regularly perform squash migration command on the system
Implementation details
The solution proposed will need to solve to facets: How do we tell, from a repository, that a snippet is ready for execution, and how do we actually run this script in production, in an automated way and following all security procedures.
How to define the snippets
Create a new Django application in a separate repository, running in a Docker image that extends the software version that is running in production.
.
├── Dockerfile
├── settings.py
└── snippets
├── admin.py
├── __init__.py
├── migrations
│ ├── 0001_initial.py
│ └── another_migration.py
├── models.py
├── settings.py
├── tests.py
└── views.py
Contents for snippets.py:
from bos.settings.ebury.docker import *
INSTALLED_APPS += (
'snippets',
)
Contents for Dockerfile:
ARG BOS_VERSION=local
FROM bos:${BOS_VERSION}
COPY snippets /code/snippets
COPY settings.py /code/bos/settings/ebury/snippets.py
ENV DJANGO_SETTINGS_MODULE=bos.settings.ebury.snippets
Developer will include the new snippets as migrations in the
snippets/migrations in the same way a normal migration would be added.
Each migration will follow Django migration syntax.
Image will contain all code from service, so any method from the application can be called within the migration. Furthermore, usual restrictions that we apply to migrations for using only ORM method and not application method does not apply here, because those "migrations" are not necessary for updating a database from old schema.
Additional tooling including Makefiles and docker-compose will be also present in the repository for local testing and as helpers when running in production.
Quality Gates in the pull request phase can be also implemented in
Jenkinsfile:
- Include tests
- Checks that the migration does not change schema
- Running the migrations with different production dumps
- Running the undo for the migration
- Check for conflicting migrations
- ...
Snippets must include their own checks in post execution for reporting the success or failure.
How to avoid removing objects in cascade through snippets
In order to safely remove objects, a previous analysis should be conducted to ensure no other objects are removed in cascade. We can do so by using the 'collect' method form the Collector.
The collect method adds objects to the collection of objects to be deleted as well as the parent instances. All Objects must be a homogeneous iterable collection of model instances (e.g. a QuerySet)
keep in mind that the collect method should not be used to handle the deletion but rather to return all related objects to the one that we need to remove.
The Collector can be used as follows:
from django.db.models.deletion import Collector
collector = Collector(using='default')
collector.collect([object_to_delete])
collector.data.items()
An example of the Collect method used could be as follows:
If we needed to remove a BankAccountEntry, this entry could be linked to a ReconMatch. By deleting the BankAccountEntry, we would also delete the ReconMatch which is something that we perhaps don’t want to do or we need to control.
from django.db.models.deletion import Collector
entry = BankAccountEntry.objects.get(pk=123456)
collector = Collector(using='default')
collector.collect([entry])
collector.data.items()
OutPut:[(reconciliation.models.reconmatch.ReconMatch,{<reconciliation.ReconMatch: 7891011>}),
(settlements.models.bank_account_entry.BankAccountEntry,{<settlements.BankAccountEntry: 123456>})]
How to execute snippets
In order to get the snippets container into production outside the release lifecycle, we will need the same access levels in term of IAM roles, networking, security groups, etc. than we have for current shell and services.
Services running in ECS
On each commit to master branch, a pipeline in Jenkins will:
- Activate a lock preventing releases (or wait if release is active)
- Build and push a new Docker Image including the snippets
- Run the snippet against a production DB dump.
- Run an ECS/Fargate scheduled task with the snippet in production environment
- Release lock
- Publish execution reports and send notifications
Alternatives
Include just the migration scripts in the repository
And have all Django boilerplate included in Dockerfile.
It would make a cleaner repository, but maybe it would hide implementation too much, in a way it would be less intuitive to understand what is actually going on.
Include the new Django application in the same project repository
If including the new Django dummy application in the same repository, the main problem will be with branching, release and deployment process. If we want the new migrations to be executed on Pull Request merge, operation would feel a little bit awkward.
In addition, some projects (i.e. BOS) are complex enough for adding even more complexity to the repository.
However, including it as a Django application in the same repository, but not including the application in the INSTALLED_APPS for default or release settings, is perfectly feasible.
Use a different data migration tool
Alternative migration tools have been proposed, like sqitch, but Django
has been chosen, at least for those projects already using Django because:
- Does not require additional configuration.
- Developers are already familiar with Django migrations.
Run scripts in a plain shell
Developing some tooling or procedure for ensuring the pipeline is not executing the snippet twice would not be straight forward, and it would increase complexity over a solution that seems to work out of the box.
Caveats
Common day-to-day operations, reports and troubleshooting are out of scope, but they are as important and risky as the use cases covered here. Solutions might be along the lines of having a read-only shell and administrative dashboards in the application, but new RFCs must be submitted with proposal for removing those shell accesses as well.
Solution is locked to Django. Services that includes a database and are developed with other frameworks or languages will be fairly new, and are expected to provide and administration API.
Sandbox and Demo environments are not being taken in consideration.
Operation
Development teams will be responsible for including snippets and testing them. Snippets will be executed automatically on merge to master branch, so merge must be done with a Pull Request, with the approvals specified in each project documentation.
Security Impact
The process and tools described in this proposal, once implemented, will have the ability to read and modify data from production systems.
Performance Impact
Automated tests will be under ideal conditions, without operations in the platform being simulated, but the real execution will take place in a different scenario. Additional tests are needed to check actual performance impact of each snippet, and assessment should be also done in the Pull Request review.
Developer Impact
The proposal will impact the way development pushes to production some of their changes, so proper understanding of the process in the development teams is needed.
Data Consumer Impact
N/A
Deployment
Proposal can be implement independently for the different services in Ebury Platform.
Dependencies
N/A
References
https://django.readthedocs.io/en/1.7.x/topics/migrations.html https://docs.djangoproject.com/en/1.8/topics/migrations/#data-migrations https://fxsolutions.atlassian.net/wiki/spaces/TEAM/pages/1425276985/Support+-+Platform+Accesses