Merge queues

Problem Description

There are two major pain points in our currently CI setup, specially for BOS Project:

  • Getting a green build that allows a Pull Request to be merged is hard and time consuming.
  • Merge to integration branch is not always done with up to date branch, causing the CI to be broken from time to time.

The proposed solution aims to solve both problems.

Background

When a developer submits a Pull Request for review, then we also launch Pull Request verification in the CI, and we do so for each new commit or amendment that the developer may include afterwards. In this way, we guarantee that any Pull Request included in the integration branch has executed the verification.

With smaller verification cycles, this approach works great because the developer gets faster feedback about possible issues on their code. With larger projects as BOS, this has become an unbearable burden for the CI infrastructure.

In addition to that problem, we also want the Pull Request verification to be done with up to date code (that is, either run the tests for the resulting merge of the code with the target branch or run with branches that already include the latest commit in the target branch). Quite often, developers are waiting for their verification to be completed to find out that another developer has already merge and their branch is no longer up to date. They update the branch, but then it has to run again verification, son they have to wait again (merge race), becoming quite frustrating, and in some case we just decide to merge the pull request even if verification was not done with and up to date branch.

It has been proposed to remove some of the verifications from Pull Request Phase and have them running once the changes has been included in the integration branch (shift right), in the past, our experience is that Quality Gates executed on the integration branch have a significantly higher cost to maintain and recover if errors slip into the integration branch, while shifting test to the left produce a significative overload in the CI infrastructure, thus producing higher infrastructure costs and slower executions.

Another proposed approach for reducing CI infrastructure overload have been to just run a small verification on each commit on development branch, and the run the full verification on demand by the developer before actually merging into the integration branch. The problem with this approach is that the developer will need to remember to run this verification, and then wait for its completion in every merge requests before proceeding with the integration. Also, it would not solve the merge race problem.

Solution

Implement the concept of merge queue (or merge train) where there is a new intermediate, automated phase in the integration cycle, where ready to merge changes are sent to the integration queue, and then merged in order if, and only if, the verification pass.

If the merge train is automated, and then the developers will only manifest their intention to merge, and then the system will notify back when the code is actually merged after passing verifications, or if it failed to pass verification. In the event of a failed verification, developers could decide to send the same build again to the train, or to amend the error with a new commit.

Merge trains also offer the possibility to run different verification content in the CI pipeline, like a smaller one being executed on each commit to a development branch, for instance a reduced number of tests, or some test suites depending on what is being modified in the given pull request. While a full verification for the full code is executed only when the developers manifest their wish of merging the code into integration branch.

Merge queue orchestration

In Jenkins, develop a Parameterized Pipeline in charge of triggering verification and, if verification passes, merging the change request with squash strategy through Bitbucket API.

The parameterized pipeline will be defined in Jenkins jobs repository.

In Bitbucket repository settings, restrict merge to integration branch to be only done by pipeline user. If for some reason we need to skip this mechanism, repository admins could allow themselves temporary for doing such merges.

Logic for the pipeline would be as follows:

Merge Train Pipeline

The reason for implementing it in Jenkins is that interaction with the verification pipeline in terms of triggering and retrieving result would be easier. In addition, we already have methods in Jenkins library for interacting with Bitbucket API. Also, a proof of concept is already working. The alternative of using AWS lambda for it is appealing, but at the moment we lack proper standardized configuration for logging, debugging, handling secrets and permissions in internal services.

Repository will be configured in a way that only the pipeline will be able to actually hit the merge button the the pull request:

Branch permission configuration

Pipeline can also be extended to enforce merge checks are being met, thus not needing Bitbucket Premium features, or even adding features not currently supporting in Bitbucket, as code owners.

Expressing the wish to merge

It can be done in several ways, for instance:

  • Hooking to comments in Bitbucket Pull Request: There is a Jenkins plugin that allows pipelines to be triggered by any webhook with the ability to easily convert the payload into parameters.

Split pipelines in two

For heavier pipelines (i.e. BOS), develop different pipelines:

  • Full verification: the current pipeline being applied to all commits, will be executed now only just before merge (or on demand).

  • Lean verification: smart verification that runs only tests and quality gates that are likely to be impacted by the changes being merged. In order to do that, define a matrix of directories in the repository (or globs) mapped to quality gates. This pipeline will be executed on each commit in the development branch.

The parameterized pipeline will be created in Jenkins jobs repository, with its behavior being specified in Jenkinsfiles in the code repository.

The lean verification shall be a subset of the full verification. And core execution logic will be defined as common code or in the jenkins-devops shared library.

Manual operation

It shall be possible for human operator to acquire lock in the queue in order to control release process, or in order to ensure only specific change enter the integration or release branches. Operator should also be able to remove requests from the queue.

Alternatives

  • Optimistic merging, as mentioned by @NicolaHeald:

You take everything in the queue (assuming each thing in the queue is a squashed commit) and merge them all to a pre-dev (or pre-master) branch that is branched from the current dev (or master) branch.

Then you run the tests on the that branch. If it passes, you merge all of them, and you’ve only done one test to merge multiple PRs.

If the tests fail, you git bisect until you find the last good commit, and merge. For the commits on the bad side of the bisect, you run the tests for each one in isolation and report the failures to the PRs.

For long queues where the tests are reliable, this can save a significant amount of time. Flaky tests can cause chaos with it though, so maybe it’s something we can revisit when we have minimised our flaky test problems.

That would be more complex to implement, but it can be considered for further iterations.

  • Instead of a pipeline in Jenkins, the queue orchestration can be done with a developed daemon or with AWS lambda function handling the locks with AWS services (for instance, with DynamoDB).

  • GitLab already offers the merge tree feature integrated together with GitLab Pipelines.

  • There are several tools providing the same feature for GitHub:

  • Instead of creating a leaner verification, still run the full verification on each commit, while still relying on the merge queue to ensure we always run the verification with up to date code.

  • In order to reduce load in the CI while still running tests for each commit, do extensive use of cache for test results. In statically-typed languages, this is easier to achieve, but with dynamically-typed languages is much more difficult and risky. Tools like Bazel for build orchestration could help with cache, and for python tests it is possible to record tests executions and determine which tests are affected by code changes with tools like testmon. However, effort for this, specially in BOS codebase, seems enormous. Also, it seems only feasible for some Quality Gates, like Unit Tests or Integration Tests.

  • Instead of using merge squash as strategy, use fast forward only strategy. The advantage is that the tested commit will the exactly the same as the merged commit even at sha1 level. The disadvantage is that commit history will be less readable.

  • Use Slack comments for expressing the wish to merge. This has been discarded because:

    • It would decouple the current state indicator for the Pull Request from the actual pull request.
    • Slash command, the solution initially proposed and supported by Jenkins, is deprecated in Slack and we would need to create a Slack App.
    • Slack could be still used indirectly for commenting in the Pull Request, thus triggering a merge request.

Caveats

The needed development is tightly coupled with Bitbucket, and there is serious debate around Bitbucket being the best tool for source code management. Most of development would become useless if migrating to a different provider.

Operation

The solution will run automatically, with the CI Squad monitoring and solving incidents.

Once fully deployed, development teams will interact as users through Slack or Bitbucket Web UI.

Repository code owner will have the possibility of skipping the merge queue.

Security Impact

Actual integration into the production codebase will be done by a service account, which will hold merge permissions in the repository. However, information for author and committer is being preserved in the git repository.

Performance Impact

CI load will be reduced, so the pipelines will have better performance. As a tradeoff, for the cases where a quality gate fails due to a code change, and the dependency between that gate and the code change was not identified in advance, it will mean a delay in the task delivery.

Developer Impact

There will be a significative change in how developers interact with SCM. Upon activation, Merge button will no longer be available in Bitbucket and development will need to learn a new way for integrating their code in the common trunk.

Data Consumer Impact

N/A

Deployment

Mechanism can be developed while the current ways of working is still in place, because it will only trigger on demand for each merge requests. It will be activated, by restricting permissions to developers for merging pull requests.

A demo showing the new mechanism will be performed before activation.

Dependencies

N/A

References