Kubernetes Service Requirements

Problem description

Kubernetes is a flexible framework for delivering resources but in the majority of workloads only small set of kubernetes features is required to successfully run a service.

From perspective of SRE Team, supporting all kubernetes features would be expensive, so in order to reduce unnecessary spending platform design might use only selected kubernetes features.

More over with great number of choices comes steep learning curve.

This introduces uncertainty on what development team need to deliver in order to workload be allowed to consume resources. This ambiguity translates directly into increased development time.

Intention of this document is to establish minimal set of objects expected from service in order to be accepted on kubernetes platform.

Background

Before kubernetes, the majority of workload was run on Amazon ECS. In order to add new service well established pattern could be followed. Interface to infrastructure was provided by terraform modules. Creating new resources definitions required a pull request and approvals from SRE Team.

While it was working, it was also creating bottlenecks as pull requests were approved by humans and infrastructure module interfaces required learning.

Solution

Kubernetes design allow to solve both problems. Well established interface definitions and widespread knowledge of them makes it easier to adopt by new developers. Secondly, once resource definition is provided to kubernetes, it can be automatically picked up and created. No human intervention is required.

Relying on kubernetes API definition and removing human from approving new workload is expected to speed up deployment time.

Scope

Kubernetes platform is complicated. In order to help to make decisions, scope was deliberately narrowed to:

definition of interface between team that request resources and team that is responsible to provide them.

This mean that in particular following topics are considered out of scope:

validation and enforcement tools
configuration delivery method such as helm charts
methods of calculation service availability
platform features that are not essential to accept and run workloads

Dictionary

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Kubernetes platform

Kubernetes platform is a set of interfaces to access resources like CPU, memory, network, logs collection, metrics collection, alerting. Full set of Platform Capabilities Catalog would be provided in a separate RFC.

Helm charts

This RFC describes only a set of kubernetes API objects that service must deliver in order to be accepted on the platform. Means of crafting object definitions are out of scope of this document. Please refer to [Helm] RFC in that matter.

Validation and enforcement

Validation and enforcement of requirements tools are considered an implementation detail and move out of this document. Focus is places on providing resource requirements in specification allowing to run it on cluster. Tools used for verification if service meets requirements and enforcing those requirements are important but not vital to allow development team to deliver correct service configuration.

SRE Team support level

Depending on met service requirements, SRE Team would provide following support:

minimal service requirements: make sure service runs with number of pods defined in resource requirements.
regular service requirements: SRE Team would act on alert notification defined by service according to Runbook provided by development team.

Minimal service requirements

Set of requirements that development team MUST deliver before any workload is accepted to kubernetes cluster.

The Code

Development team provide service code that SHOULD be available to download from ECR image registry. It implies OCI compliant format. No other format is supported.

Mandatory Kubernetes API Objects

Development team SHOULD provide request for computational resources in form of Deployment, Job or CronJob object.

Namespace

Computational resources requests (ie. Deployment, Job) MUST run in a namespace which MUST NOT be a default namespace.

Resource requirements

Each pod container MUST define resources.requests for cpu and memory and resources.limits for memory. resources.limits.cpu should not typically be set so a pod can use its node's current spare CPU time.

Each kubernetes object (Pod, Deployment, Ingress, etc) defined by application MUST define labels app with application name as a value and team with team owner name as a value.

Platform responsibility

As platform only provides resources, it is responsible for making sure service are able to consume them. In particular for keeping number of running pods the same as declared in Deployment document.

CPU and memory SLO

Platform would provision requested computational resources within 15 minutes. In case this limit is not met, alert would be raised with severity warning. If resources can't be provisioned for another 15 minutes, notification with severity critical would be sent to on-call rota.

Regular service requirements

Set of requirements that development team MUST deliver in order to SRE Team respond to service defined alert notifications.

Liveness Probe

Service workload definition MUST define liveness probe path.

Business metrics

Service workload definition MUST expose metrics as defined in [Monitoring Platform] RFC. Service MUST provide ServiceMonitor object pointing to metrics endpoint exposed by service.

Metrics visualisation

Service MUST provide ConfigMap object with mandatory label grafana_dashoard: "1" and annotation metadata.annotations.grafana-dashboard-target-directory.

apiVersion: v1
kind: ConfigMap
metadata:
 annotations:
   grafana-dashboard-target-directory: "folder_name_for_grouping_dashboards"
 name: "service_name"
 labels:
   grafana_dashboard: "1"
data:
 fasentenialproxy-main.json: |-
   {
     "annotations": {
       "list": [

Logs collection

Service workload definition MUST provide annotation spec.template.metadata.annotations.elasticsearch-index with name of ElasticSearch index to create.

Logs SHOULD be provided in json format and its schema MUST be consistent. Consistency of schema is an Elasticsearch index requirement.

Alert definitions

Service MUST provide alerting rules as described in [Monitoring Platform] RFC with kubernetes object PrometheusRule.

Runbooks

Service MUST provide runbook (aka monitoring plan) for each defined alert type. It SHOULD contain actionable steps to follow once notification is delivered.

High availability

Service processing requests from users (ie HTTP) MUST run in at least N+1 pods. Where N is a number of pod required to handle load in peak time.

Additional pod is required as by design single worker node can fail at any time. Two failed nodes at the same time are unlikely. Additionally, pod need to provide topologySpreadConstraints with topologyKey: "kubernetes.io/hostname" in format as described:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: "kubernetes.io/hostname"
      whenUnsatisfiable: "ScheduleAnyway"
      labelSelector:
        matchLabels:
          app: "dummy-application"

Workers consuming events (redis, kafka, sqs, sns) MAY run in a single instance.

Life cycle management

Service process MUST handle SIG_TERM signal by performing shutdown procedure. By default, kubernetes waits 30 seconds before sending SIG_KILL. If service needs more time to shut down gracefully field terminationGracePeriodSeconds MAY be used in POD definition.

Idempotency

Process SHOULD be prepared for abrupt termination with SIG_KILL and be able to resume service after termination.

Caveats

Accepting services with minimal set of requirements to run as production workload would reduce time from idea to running code but at a cost of adding risk of unnoticed failure.

It's out of scope of this document to discuss how SLA would be calculated depending on set of requirements implemented by service.

Operation

Requirements defined here are going to be validated and enforced automatically on kubernetes cluster.

Security Impact

Please take a look at Caveats section.

Performance Impact

Not applicable.

Developer Impact

Development team should be aware of difference between minimal and regular requirements for accepting workload in kubernetes platform.

Development team is now responsible for delivering set of object definitions required by kubernetes platform. It's up to development team to choose support level for their service.

Data Contracts

None.

Deployment

For details on deployment of required objects please refer to [Helm] RFC.

Dependencies

In order to accept workload based on this document, two additional topics need to be addressed: - access separation SRE-2209 - multitenancy SRE-1456

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search