Kubernetes cluster for CI/CD tools

Problem Description

Our current setup for running Continuous Integration (CI) tasks, building container images, applying Infrastructure as Code (IaC) and performing deployments to either staging or production is slow.

For instance, in one of our latest projects running in Kubernetes (gpi-tracker-interface), verifying a Pull Request can take around 10 minutes, plus around 15 minutes for building a deployable package once it reaches main branch, 4 minutes more for deploying in staging, and another 4 for deploying in production. That is more than 30 minutes in the best cases for a change to reach production once it is approved. A considerable portion of this time is because of infrastructure overhead in the CI.

In addition, the adoption of the newest techniques (like GitOps) is problematic, sometimes leading us to "reinvent the wheel" and provide substandard solutions.

Background

Besides some legacy automation based on Ansible that runs on AWX, our only supported tool for verifying Pull Requests, performing deployments and driving IaC is Jenkins.

Jenkins setup comprises a VPC with a single EC2 instance running Jenkins main server (also a test server for testing Jenkins upgrades and features), with Jenkins EC2 plugin installed and configured for starting EC2 instances (Jenkins agents) on demand in a private subnet whenever a new pipeline needs compute power. The server is configured for providing different flavors in terms of compute power and permissions in AWS accounts. The need for starting new EC2 instances is often the cause for slowness.

Jenkins is a very flexible tool, often deemed as a "swiss knife" because of that reason, making it possible to define virtually any pipeline. However, it was first released in 2005, way before any cloud native concept was in place, making it hard to configure and maintain. Since then, specially in the last years, a whole new ecosystem of tools and concepts is available for cloud native projects, leaving Jenkins quite behind in capabilities.

Over the years, we have developed a big number of primitives and fully fledged generic pipelines in a Jenkins library written in Groovy. However, though powerful, it is complex and hard to understand for end users, as well as hard to test.

Recently we have included support for GitHub as source code management (SCM) tool. In addition to SCM, GitHub also provides its own modern CI tool, that can be used either as a service in a pay as you go model, or within our own infrastructure.

Solution

Create a new Kubernetes cluster in CI VPC.

By running workloads in a specific CI hosted by ourselves in the same VPC and network as the current Jenkins infrastructure, we will get the same benefits and constraints we have already for Jenkins agents:

A known set of CIDRs and IPs that are already whitelisted in other parts of our infrastructure.
Ability to use already existing IAM roles in a native way, without sharing AWS with third parties or even storing them.

Such cluster will be intended for:

Run Jenkins agents (through Jenkins Kubernetes plugin). It will provide faster pipelines, and it can be configured in a way it mimics EC2 plugin behavior, being transparent for the end user, but it also allow pipelines to use Kubernetes pods in a native way inside the pipelines.
Run ArgoCD (or an equivalent GitOps operator) for orchestrating GitOps deployments in a variety of environments.
Run Kaniko (or equivalent) for providing image builds as a service.
Run ephemeral ED8K environments for CI pipelines (currently we use DXP cluster for that purpose.
Run self-hosted GitHub runners, enabling GitHub actions as opt in for our projects, eventually phasing out most of our dependency on Jenkins for CI.
Eventually, run Jenkins main server or even smaller ad-hoc Jenkins masters.
Provide flexibility for adding new tools to our ecosystem.

Details on each specific implementation will be provided either as amends to this document or in its own blueprint.

Alternatives

Use any of the clusters we already have up and running, like DXP or Staging. Blast radius for misconfiguration would be greater, and we would need to rethink all the current connections and firewalls between VPCs.
Do nothing and keep supporting only CI/CD through Jenkins agents on top of EC2 instances. We will keep suffering for unneeded slowness on most CI pipelines and our whole setup will become more and more obsolete over time as adoption of new tools and techniques would be very specific and troublesome.
Create the cluster, but use it only for Jenkins, not supporting either GitHub actions or ArgoCD. Though we may consider Jenkins as "good enough" for CI and we may discard GitHub Actions or any other tool for CI for the time being, we consider essential having a modern GitOps tool for deployments in Kubernetes instead of reinventing the wheel with ad-hoc Jenkins pipelines
Run GitHub actions as a service hosted by GitHub. Though would be an alternative for some simple CI, more complex scenarios would need AWS keys stored as secrets on GitHub, and that would be against IAM security best practices.
Run one ArgoCD instance on each cluster. Though possible, it would complicate promotion of artifacts between clusters and creation of ephemeral environments.

Caveats

It will be a new cluster to monitor and maintain.
It will add an extra layer of complexity for CI maintainers.
Fast and streamlined CI/CD pipelines are great for efficiency, but automation can become dangerous when things go crazy and automation starts to destroy content at the same pace. We will need to be extremely cautious with the implementation of new tools.

Operation

The cluster will be created as code, with the same Terraform modules we are currently using for DXP, Staging and Production clusters, thus getting the same support in terms of upgrades, logging, monitoring, secrets management, autoscaling, etc.

Security Impact

Access to cluster API server shall be private, reachable from outside only through VPN.

By adopting GitOps for our deployments, and decoupling deployment from CI, no one will be able to deploy arbitrary workloads through malicious CI Pipelines in Jenkins trying to bypass our security checks, as the whole Jenkins will simply lack permissions to do so.

Access to the UI for Jenkins is currently protected by VPN and SSO in the Jenkins server. If we run Jenkins and other tools inside a K8s cluster, we can provide ingress through and authenticated proxy with SSO for any tool installed, thus removing the need for VPN access.

Performance Impact

Overall, execution time for CI/CD pipelines will improve as time for starting a Pod in Kubernetes is considerably faster than spinning EC2 instances.

Operation costs should decrease as well as the pipelines will be more efficient. In the worst case, it should not mean an increase in costs.

Developer Impact

Upon implementation, it should be transparent for development teams. However, as it will allow us implementing new tools, the development teams will need to catch up with new capabilities and deprecations over time.

Data Contracts

N/A

Deployment

On a first iteration, we will just create a Kubernetes cluster in the CI VPC. Further capabilities, like migrating workloads from EC2 to K8s, rolling out ArgoCD, providing Builds as a service, etc. will come afterwards in different work packages.

Dependencies

None

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search