Balancing Karpenter Consolidation & Cluster Efficiency with Critical Workloads, using Kyverno Policies

Table of Contents

Introduction

Consider this scenario: You operate an Amazon EKS cluster hosting hundreds of microservices that make up a product suite. You have chosen Karpenter as your cluster autoscaler & enabled consolidation for maximum efficiency. But there’s a catch: There are a number of microservices whose pods must not be interrupted, either because they aren’t designed to handle interruptions, or because their interruption has a user-visible impact.

You can configure Karpenter to never disrupt these pods but if these pods are distributed across several nodes of the cluster, you’ll end up blocking consolidation on all such nodes, which defeats the purpose of consolidation! This article explores both how to prevent disrupting such critical pods (defined dynamically at runtime) & how to achieve max consolidation efficiency without disrupting critical workloads.

Why Kyverno?

Kyverno is a Kubernetes-native policy engine. Kyverno policies can validate, mutate, generate & cleanup Kubernetes resources & even verify image signatures & artifacts. Kyverno policies are written as K8s-native YAML manifests & can be managed by standard K8s tooling like kubectl & kustomize.

Preventing Karpenter from disrupting a pod is as simple as annotating it with karpenter.sh/do-not-disrupt: "true". However, incorporating this into your product Helm chart may not be the way to go if the product is designed to be deployable on any Kubernetes platform, that may or may not be Karpenter-managed. And so, it’s much better to use a mutating Kyverno policy to add this annotation to critical pods when they’re created in your EKS cluster.

Kyverno can also help us avoid the spread of critical pods across the cluster, thus blocking consolidation of several nodes. Instead, if all critical pods were colocated in a node or two, only a couple of nodes will be blocked from consolidation, thus allowing efficient operation of other nodes. This can be achieved by adding pod inter-affinities to the critical pods using a mutating Kyverno policy. Once again, it doesn’t make sense to add this to the source code since we’re only colocating these pods to work around Karpenter consolidation.

Annotate Critical Pods

First, to annotate pods as do-not-disrupt, create a Kyverno policy like this:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: do-not-disrupt-critical-pods
spec:
  rules:
  - name: do-not-disrupt-critical-pods
    match:
      any:
      - resources:
          kinds: [Pod]
          names:
          - critical-app-a-*
          - critical-app-b-*
          - critical-app-c-*
    mutate:
      patchStrategicMerge:
        metadata:
          annotations:
            +(karpenter.sh/do-not-disrupt): "true"

Note that:

  • This is a ClusterPolicy, not a namespaced resource, so it works across all namespaces
  • It has 1 rule that:
    • Looks for pods with names starting with critical-app-a/b/c- &
    • Annotates them with karpenter.sh/do-not-disrupt: "true"

Henceforth, whenever a critical app pod is created in the cluster, it will be annotated automatically.

Colocate Critical Pods

Now, for the next part: Colocating critical pods on the same node. Outside of Kyverno, you would do this by defining pod affinity on the pods/deployments:

spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          topologyKey: kubernetes.io/hostname
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - critical-app-a
              - critical-app-b
              - critical-app-c

This works well if your critical apps come with such well-defined labels as critical-app-a/b/c. If however, there’s every a bit of unpredictability in the labels, like if you allow multiple installs of app A’s Helm chart with different release names & make the release name part of the label, the above approach won’t work because wildcards cannot be used in label selectors.

In such a scenario, we can again rely on Kyverno to label our apps consistently, which can then be used to define their pod affinities. That’s what we’ll do here. We already have a policy that annotates the desired pods. We can use the same to apply the label as well. Let’s use the same do-not-disrupt as a label, although any other label will work too. The updated policy is as follows:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: do-not-disrupt-critical-pods
spec:
  rules:
  - name: do-not-disrupt-critical-pods
    match:
      any:
      - resources:
          kinds: [Pod]
          names:
          - critical-app-a-*
          - critical-app-b-*
          - critical-app-c-*
    mutate:
      patchStrategicMerge:
        metadata:
          annotations:
            +(karpenter.sh/do-not-disrupt): "true"
          labels:
            +(karpenter.sh/do-not-disrupt): "true"

This policy adds both the karpenter.sh/do-not-disrupt: "true" annotation & label to all critical pods when they’re created. Now, the pod affinity can be much simpler:

spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          topologyKey: kubernetes.io/hostname
          labelSelector:
            matchExpressions:
            - key: karpenter.sh/do-not-disrupt
              operator: In
              values: ["true"]

Now it’s just a matter of creating the Kyverno policy that will “inject” this pod affinity definition into the desired pods. There is catch though! Since both the label & the affinity are added by Kyverno, we must employ Kyverno’s “cascading” mutating rules. A single Kyverno policy must declare 2 rules: first to apply the label, then to use the label as selector to apply the affinity. The resulting policy looks like this:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: colocate-and-do-not-disrupt-critical-pods
spec:
  rules:
  - name: do-not-disrupt-critical-pods
    match:
      any:
      - resources:
          kinds: [Pod]
          names:
          - critical-app-a-*
          - critical-app-b-*
          - critical-app-c-*
    mutate:
      patchStrategicMerge:
        metadata:
          annotations:
            +(karpenter.sh/do-not-disrupt): "true"
          labels:
            +(karpenter.sh/do-not-disrupt): "true"
  - name: colocate-critical-pods
    match:
      any:
      - resources:
          kinds: [Pod]
          selector:
            matchLabels:
              karpenter.sh/do-not-disrupt: "true"
    mutate:
      patchStrategicMerge:
        spec:
          affinity:
            podAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  topologyKey: kubernetes.io/hostname
                  labelSelector:
                    matchExpressions:
                    - key: karpenter.sh/do-not-disrupt
                      operator: In
                      values: ["true"]

In this policy:

  • Rule 1 is the same as before
  • Rule 2:
    • First looks for pods with the do-not-disrupt label (applied by rule 1)
    • Then adds pod affinities to these pods

topologyKey of kubernetes.io/hostname ensures that we’re colocating pods on the same host/node, as opposed to a cloud zone, region, etc.

Conclusion

This article demonstrated a way to make full use of Karpenter’s consolidation abilities to maximize your cluster’s efficiency, while still ensuring that critical workloads are not interrupted. Additionally, we set all this up dynamically using some “powerful” Kyverno policies!

About the Author ✍🏻

Harish KM is a Principal DevOps Engineer at QloudX & a top-ranked AWS Ambassador since 2020. 👨🏻‍💻

With over a decade of industry experience as everything from a full-stack engineer to a cloud architect, Harish has built many world-class solutions for clients around the world! 👷🏻‍♂️

With over 20 certifications in cloud (AWS, Azure, GCP), containers (Kubernetes, Docker) & DevOps (Terraform, Ansible, Jenkins), Harish is an expert in a multitude of technologies. 📚

These days, his focus is on the fascinating world of DevOps & how it can transform the way we do things! 🚀

Leave a Reply

Your email address will not be published. Required fields are marked *