Pytest Analysis

This RFC is describing a lightweight approach to get analytic information about the execution of application code.

Given that much of the analytical data is measurable, it's an excellent base for metrics to be used on pull requests, ensuring automatic quality control on each change.

As all our code is supposed to be tested, this attempt is using pytest to gather statistical information on application code execution.

Overview

There are great tools provided for analytic and performance measurements, on various levels of software development and related areas (databases, systems, network, etc.).

However, especially when dealing with high complexity software, it could be particularly useful to get lightweight, "rule of thumb" analysis, "that comes cheap" alongside the development process.

Furthermore, in order to avoid code changes introducing potential performance decrease, we need measurements on the stable version of the application code, that we could compare against.

Background

We have had a number of performance issues which highlight the fact that measurements are necessary on code executions covering all potential workflows.

Solution

Disclaimer: A number of excellent in-depth analysis tools are available on the market, allowing for detailed investigation of running systems and services. There was no aim to "re-invent the wheel".

Instead we intended to demonstrate that with simple tools and little development effort we can tailor solutions to patch most critical leaks in performance analysis of our systems.

What we want is quick, lightweight, potentially less precise information however easy to produce, easy to consume, easy to turn to measurable metrics.

Purpose

The two main usage scenarios, where this tool should prove useful.

Jenkins pipelines
- metrics should be collected on the production version of the application code, and compared against on every change
- metrics allows us to track, and prevent or encourage behaviors such as:
  - unexpected performance increases
  - performance improvements
  - performance metrics for all new code (endpoint, feature, etc.)
- any merge to master should be blocked, if metrics comparison is showing performance penalty
  - enforced either automatically or by policy (Jira checks)
- modules: django_queries
  - database interaction statistics are lightweight and measurable, perfect for the purpose
Development phase
- gathering information of the running code at development time
  - developers can follow the full application execution from a desired starting point
- modules: trace
  - this module is more CPU-intensive, but in return the output is much detailed
  - as useful for the eye as for statistics

The idea

The pytest framework allows for custom code injection on top of the regular test execution. Thus we can get information of application code runs from various aspects.

These could be simple, lightweight measurements. Generic ones (like execution trace) or potentially customized for a particular application.

Furthermore, libraries or other software components often offer analytical tools to gather dynamic measurements on the corresponding level. (Like Django database queries information.)

We wanted to create a tool that is primarily providing a small execution environment for these, together with a few modules that seemed reasonable to have first.

Requirements for this tool are, so it is:

modular
- various checks based on generic or custom needs
configurable on ALL levels.
1. Possibility to enable/disable each check (perhaps with more options)
2. Custom configuration on the level of each check/module
  - example: summary on database queries targeting particularly big tables of a particular application

Implementation

We provided an implementation including basic functionality of the above.

Two initial modules are included, to

collect database queries
collect tracing data

on a test level.

Both modules offer data collection, potentially extracting further statistics out of it (number of repetitions, etc). The type and level of detail for data collections is all configurable.

Having this data available, we can extract further (statistical) information out of it.

Number of queries executed

At the end of each test run, summary numbers are produced
- All queries
- Broken down by configured characteristics
  - Default: INSERT, JOIN, etc.
- Formats:
  - Plain text table
  - JSON
Note: Getting these numbers on the BOS framework, we get a surprisingly high number of queries. Given that we have the list of all the queries available it could be a good place to start, looking into ways to optimize/decrease these database interactions

All of queries executed per test

The full list of every individual query executed during a test run. Excellent place for quick analysis
With each query, the following information is provided:
- SQL query
- Execution time
- Duration
- Trace: caller module, line, function, ("above" the Django DB layer) [optional]
- Number of repetitions (execute_many)
Format: JSON

Full trace of test runs

Full trace of code run by a test excluding underlying libraries. No more than your own code.
This information is particularly important, as it allows you to get a FULL trace your application running code.
- For example: what happens after a web API endpoint was invoked?
- Combined with the database interactions this allows you to understand your code in details
- Excellent place for a surprise ;-)

Extracted trace of test runs

Sometimes there's quite some "garbage" even in trace applied only for the application. This can be filtered out easily, allowing for better traceability of stuff that matters.

Code statistics of test runs

Simple data collection from the trade files.
- Digest of most repetitive lines
- Digest of most repetitive expressions
- Regexp-like search for particular expressions
  - Example: expensive operations (objects.filter)

Checks are both possible to apply on complete test-suites, or to be disabled by default and only executed on marked tests. More information on the usage is available in the documentation provided with the code.

Flows

Each of these measurements are taken for the whole test run. Which means that they include setting up test data structures, and comparisons after (assert).

Which means that the actual application run is embedded within the full information retrieved from a test.

However, note that, for example:

We apply further extractions of data in attempt to reduce noise
For example:
- From a trace file on an endpoint test you can easily search where the view module was invoked first. That is exactly where the application code was invoked.
- Database queries should clearly reflect the point until when test data was set up. Right after the application took over.

Which means that, even though some of the output collected here is not a silver apple on a tray, it's very easy to find relevant information within.

Future Improvements

Further to the functionality already available, there are loads of further opportunities that could be added to the tool.

Just to mention an nice idea, we could make measurements available so that assert could be done on the analytical output. (Like len(analyse.database.queries) < 10)

Source

Codebase for a demo POC can be found in Ebury bitbucket ebury-pytest-analyse repository.

Alternatives

Clearly there are excellent alternatives on the market (Silk, PG Analysis, etc.). However they are significantly more elaborate and heavyweight, and require more effort to analyse the outcome.

Our attempt here was to provide a lightweight tool for quick analysis. By no means a replacement for significantly better established projects.

Caveats

The current code is there to demonstrate our possibilities and data we could generate.

There's a lot more to improve about it, together with hoping for a community effort adding more modules.

Operation

Some of the checks should be implemented as part of Jenkins pipelines, collecting measurements that can be compared against the same metrics taken on the application's production code.

In case any code modification seem to decrease performance, it should be automatically detected and raise an error.

Security Impact

Doesn't apply.

Performance Impact

Running extra data collection and checks clearly has an impact on the performance (speed) of test runs. Depending on the check this may be negligible or not.

This is also a reason why checks must be possible to disable, and fine-grain by configuration or command-line parameters. Thus:

Jenkins pipelines would only apply lightweight yet essential data collection.
Detailed analysis would be available for developers for further investigation at development time.

Developer Impact

This tool should be available to all Ebury Developers working on Python codebases.

Different teams can implement different checks, or configure already existing ones to their needs.

The codebase should rely on community effort counting on contributions from all teams.

Data Consumer Impact

All analytic data goes to text files, databases are not impacted.

Deployment

As described above, we want to integrate some of the checks with Jenkins pipelines.

Dependencies

The tool is using Pytest, and (on demand) Django.

References

Codebase for a demo POC can be found in the Ebury bitbucket ebury-pytest-analyse repository.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search