Tech

Workshop Ingeniería del caos sobre Kubernetes con Litmus

July 7, 2021 by Bluetab

Workshop Ingeniería del caos sobre Kubernetes con Litmus

LitmusChaos nace con el objetivo de ayudar a desarrolladores y SREs (Site Reliability Engineering ) de Kubernetes a identificar puntos débiles y mejorar la resiliencia de sus aplicaciones/plataformas proporcionando un marco de trabajo completo.

Sus principales ventajas respecto a otras herramientas son:

Experimentos declarativos mediante K8S CRDs (Custom Resource Definition): todos los componentes (planificación, ejecución, parametrización, etc.) de un experimento se definen dentro del ámbito de kubernetes haciendo uso de YAML.
Múltiples experimentos predefinidos: dispone de un conjunto de experimentos suficientemente amplio para dar cobertura a los principales recursos de K8s.
SDK en Go/Python/Ansible para desarrollar tus propios experimentos: dispone de un metodología de desarrollo bien definida para construir experimentos que se adapten a tus necesidades particulares.
Creación de workflows a través de GUI: con Litmus UI Portal puedes crear workflows complejos utilizando todos los experimentos predefinidos mediante interfaz web.
Fácil integración en pipelines CI/CD: invocar y obtener el resultado de un experimento es extremadamente fácil.
Exportación de métricas: puedes exportar distintas métricas de tus experimentos directamente a Prometheus.

El producto está liberado bajo licencia Apache-2.0, dispone de una amplia comunidad de desarrolladores y desde 2020 pertenece a Cloud Native Computing Foundation.

Objetivos del workshop

Conocer los principales componentes de un experimento y realizar su despliegue
Analizar detalladamente la ejecución de tres experimentos (criterios de entrada, hipótesis, observaciones y resultados)
Ver las múltiples opciones referentes a planificación de experimentos.
Visualizar los resultados mediante Prometheus/Grafana.
Analizar un caso de pruebas de resiliencia + test de rendimiento con JMeter.
Principales funcionalidades de Litmus UI Portal

Preparación de consola

Recomendamos abrir una consola y crear 4 paneles:

Panel principal (ejecutaremos todo el contenido del workshop)
Monitorización de la aplicación de test
Monitorización de pods
Monitorización de eventos

Clonación de repositorio

git clone https://github.com/angelmaroco/litmus-chaos-engineering-workshop.git
cd litmus-chaos-engineering-workshop

Creación de entorno de pruebas K8s con minikube

Para este workshop vamos a utilizar minikube pero Litmus puede ser desplegado en cualquier servicio gestionado tipo EKS/AKS/GKE.

Minikube requiere de un gestor de contenedores o máquinas virtuales (Docker, Hyperkit, Hyper-V, KVM, Parallels, Podman, VirtualBox, or VMWare).

Recomendamos hacer uso de docker. En el caso de no estar disponible en el sistema, puedes realizar la instalación con los siguientes comandos:

if ! [ -x "$(command -v docker)" ]; then
    curl -fsSL https://get.docker.com -o /tmp/get-docker.sh
    sh /tmp/get-docker.sh
fi

# install kubectl
curl -Ls "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" --output /tmp/kubectl
sudo install /tmp/kubectl /usr/local/bin/kubectl
kubectl version --client

# install minikube
curl -Ls https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 --output /tmp/minikube-linux-amd64
sudo install /tmp/minikube-linux-amd64 /usr/local/bin/minikube
minikube version

# starting minikube
minikube start --cpus 2 --memory 4096

# enabled ingress & metrics servers
minikube addons enable ingress
minikube addons enable metrics-server

# enabled tunnel & dashboard
minikube tunnel > /dev/null &
minikube dashboard > /dev/null &

Creación de namespaces K8s

# create namespace testing
kubectl apply -f src/base/testing-ns.yaml

# create namespace litmus
kubectl apply -f src/base/litmus-ns.yaml

# create namespace monitoring (prometheus + grafana)
kubectl apply -f src/base/monitoring-ns.yaml

TESTING_NAMESPACE="testing"
LITMUS_NAMESPACE="litmus"
MONITORING_NAMESPACE="monitoring"

Despliegue de aplicación de test

Desplegamos una aplicación de test para poder ejecutar los experimentos de litmus.

nginx-deployment.yaml: creación de despliegue “app-sample”, con recursos de cpu/memoria “limits”/”request” y configuración de “readinessProbe”. Exponemos el servicio en el puerto 80 a través de un balanceador.
nginx-hpa.yaml: creación de Horizontal Pod Autoscaler (min 2 réplicas / max 10 réplicas)

# deployment
kubectl apply -f src/nginx/nginx-deployment.yaml --namespace="${TESTING_NAMESPACE}"

# enable hpa
kubectl apply -f src/nginx/nginx-hpa.yaml --namespace="${TESTING_NAMESPACE}"

# expose service 
kubectl expose deployment app-sample --type=LoadBalancer --port=80  -n "${TESTING_NAMESPACE}"

# wait deployment
kubectl wait --for=condition=available --timeout=60s deployment/app-sample -n "${TESTING_NAMESPACE}"

# get pods
kubectl get pods -n "${TESTING_NAMESPACE}"

#-----------------------------------------

NAME                          READY   STATUS    RESTARTS   AGE
app-sample-7ff489dbd5-82ppw   1/1     Running   0          45m
app-sample-7ff489dbd5-jg9vh   1/1     Running   0          45m

# get service
kubectl get services -n "${TESTING_NAMESPACE}"

# -----------------------------------------

NAME         TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)        AGE
app-sample   LoadBalancer   10.109.196.239   10.109.196.239   80:30020/TCP   3m54s

En PANEL 2 ejecutar:

TESTING_NAMESPACE='testing'
URL_SERVICE=$(minikube service app-sample --url -n "${TESTING_NAMESPACE}")
while true; do sleep 5; curl --connect-timeout 2 -s -o /dev/null -w "Response code %{http_code}"  ${URL_SERVICE}; echo -e ' - '$(date);done

En PANEL 3 ejecutar:

TESTING_NAMESPACE='testing'
watch -n 1 kubectl get pods -n "${TESTING_NAMESPACE}"

En PANEL 4 ejecutar:

kubectl get events -A -w

Despliegue Chaos Experiments

# litmus operator & experiments
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.0.yaml -n "${LITMUS_NAMESPACE}"

kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.0\?file\=charts/generic/experiments.yaml -n "${TESTING_NAMESPACE}"

kubectl get chaosexperiments -n "${TESTING_NAMESPACE}"

# ----------------------------------------------------

NAME                      AGE
container-kill            6s
disk-fill                 6s
disk-loss                 6s
docker-service-kill       6s
k8-pod-delete             6s
k8-service-kill           6s
kubelet-service-kill      6s
node-cpu-hog              6s
node-drain                6s
node-io-stress            6s
node-memory-hog           6s
node-poweroff             6s
node-restart              6s
node-taint                6s
pod-autoscaler            6s
pod-cpu-hog               6s
pod-delete                6s
pod-io-stress             6s
pod-memory-hog            6s
pod-network-corruption    6s
pod-network-duplication   6s
pod-network-latency       6s
pod-network-loss          6s

Despliegue servicios monitorización: Prometheus + Grafana

Litmus permite exportar las métricas de los experimentos a Prometheus a través de chaos-exporter.

kubectl -n ${MONITORING_NAMESPACE} apply -f src/litmus/monitoring/utils/prometheus/prometheus-operator/

kubectl -n ${MONITORING_NAMESPACE} apply -f src/litmus/monitoring/utils/metrics-exporters-with-service-monitors/kube-state-metrics/

kubectl -n ${MONITORING_NAMESPACE} apply -f src/litmus/monitoring/utils/alert-manager-with-service-monitor/

kubectl -n ${LITMUS_NAMESPACE} apply -f src/litmus/monitoring/utils/metrics-exporters-with-service-monitors/litmus-metrics/chaos-exporter/

kubectl -n ${MONITORING_NAMESPACE} apply -f src/litmus/monitoring/utils/prometheus/prometheus-configuration/

kubectl -n ${MONITORING_NAMESPACE} apply -f src/litmus/monitoring/utils/grafana/

kubectl -n ${MONITORING_NAMESPACE} apply -f src/litmus/monitoring/utils/metrics-exporters-with-service-monitors/node-exporter/

# wait deployment
kubectl wait --for=condition=available --timeout=60s deployment/grafana -n ${MONITORING_NAMESPACE}
kubectl wait --for=condition=available --timeout=60s deployment/prometheus-operator -n ${MONITORING_NAMESPACE}

echo "Acceso dashboard --> $(minikube service grafana -n ${MONITORING_NAMESPACE} --url)/d/nodepodmetrics/node-and-pod-chaos-metrics?orgId=1&refresh=5s"

Para este workshop hemos personalizado un dashboard de grafana donde visualizaremos:

Timelime de experimentos ejecutados
4 gráficas tipo “Gauge” con número de total de experimentos, estado Pass, estado Fail y estado Awaited.
Consumo de CPU nivel nodo
Consumo de CPU a nivel POD (app-sample)
Consumo de memoria nivel nodo
Consumo de memoria a nivel POD (app-sample)
Tráfico red (IN/OUT) nivel nodo
Tráfico red (IN/OUT) nivel POD (app-sample)

Datos acceso grafana:

usuario: admin
password: admin

Creación de anotación "litmuschaos"

Para habilitar la ejecución de experimentos contra nuestro deployment, necesitamos añadir la anotación litmuschaos.io/chaos=“true“. Como veremos más adelante, todos los experimentos tienen la propiedad annotationCheck: “true”.

# add annotate (enable chaos)
kubectl annotate deploy/app-sample litmuschaos.io/chaos="true" -n "${TESTING_NAMESPACE}"

kubectl describe deploy/app-sample -n "${TESTING_NAMESPACE}"

# -----------------------------------------------------------

Name:                   app-sample
Namespace:              testing
CreationTimestamp:      Mon, 29 Mar 2021 09:35:53 +0200
Labels:                 app=app-sample
                        app.kubernetes.io/name=app-sample
Annotations:            deployment.kubernetes.io/revision: 1
                        litmuschaos.io/chaos: true # <-- HABILITAMOS EXPERIMENTOS
Selector:               app.kubernetes.io/name=app-sample
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate

Detalle componentes de un experimento

Service Account, Role y RoleBinding

Cada experimento debe tener asociado un ServiceAccount, un Role para definir permisos y un RoleBinding para relacionar el ServiceAccount/Role.

Podéis encontrar todas las definiciones dentro de src/litmus/nombre-experimento/nombre-experimento-sa.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: container-kill-sa
  namespace: testing
  labels:
    name: container-kill-sa
    app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: container-kill-sa
  namespace: testing
  labels:
    name: container-kill-sa
    app.kubernetes.io/part-of: litmus
rules:
  - apiGroups: [""]
    resources:
      ["pods", "pods/exec", "pods/log", "events", "replicationcontrollers"]
    verbs:
      ["create", "list", "get", "patch", "update", "delete", "deletecollection"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "list", "get", "delete", "deletecollection"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets", "daemonsets", "replicasets"]
    verbs: ["list", "get"]
  - apiGroups: ["apps.openshift.io"]
    resources: ["deploymentconfigs"]
    verbs: ["list", "get"]
  - apiGroups: ["argoproj.io"]
    resources: ["rollouts"]
    verbs: ["list", "get"]
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines", "chaosexperiments", "chaosresults"]
    verbs: ["create", "list", "get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: container-kill-sa
  namespace: testing
  labels:
    name: container-kill-sa
    app.kubernetes.io/part-of: litmus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: container-kill-sa
subjects:
  - kind: ServiceAccount
    name: container-kill-sa
    namespace: testing

Definición ChaosEngine

Para facilitar la comprensión, hemos dividido en 3 secciones el contenido de un experimento. Podéis encontrar todas las definiciones dentro de src/litmus/nombre-experimento/chaos-engine-.yaml

Especificaciones generales

En esta sección especificaremos atributos comunes a todos los experimentos. Para este workshop, debido a que estamos realizando los experimentos contra un único deployment, el único atributo que cambiará entre experimentos es “chaosServiceAccount”.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: app-sample-chaos # Nombre del chaos-engine
  namespace: testing     # Namespace de testing
spec:
  annotationCheck: "true" # Hemos creado una anotación en nuestro deployment app-sample. Con la propiedad marcada a "true" indicamos que aplicarmeos el experimento a este despliegue.

  engineState: "active"   # Activación/desactivación de experimento

  appinfo:                # En esta sección proporcionamos la información de nuestro deployment.
    appns: "testing"      # Namespace donde se localiza
    applabel: "app.kubernetes.io/name=app-sample" # Etiqueta asociada a nuestro deployment
    appkind: "deployment" # Tipo de recurso (sólo admite deployment, lo que afectará a todos los pods)

  chaosServiceAccount: container-kill-sa # Nombre del service account (creado en el paso anterior)
  monitoring: true       # si queremos activar la monitorización (prometheus o similares)
  jobCleanUpPolicy: "delete" # Permite controlar la limpieza de recursos tras la ejecución. Especificar "retain" para debug.

Especificaciones de componentes

En esta sección definiremos las variables de entorno propias de cada experimento. Las variables “CHAOS_INTERVAL” y “TOTAL_CHAOS_DURATION” son comunes a todos los experimentos.

  experiments:
    - name: container-kill # Nombre del experimento
      spec:
        components:
          env:
            # Intervalo (segundos) por cada iteración
            - name: CHAOS_INTERVAL
              value: "10"

            # Tiempo total (segundos) que durará el experimento
            - name: TOTAL_CHAOS_DURATION
              value: "60"

Especificaciones de pruebas

En esta sección se informan los atributos para las pruebas de validación. El resultado del experimento dependerá del cumplimiento de la validación especificada.

En el siguiente enlace podeis consultar los tipos de pruebas disponibles.

        probe:
          - name: "check-frontend-access-url" # Nombre de prueba
            type: "httpProbe"                 # Petición de tipo HTTP(S). Alternativas: cmdProbe, k8sProbe, promProbe.
            httpProbe/inputs:                  
              url: "http://app-sample.testing.svc.cluster.local" # URL a validar
              insecureSkipVerify: false                               # Permitir HTTP sin TLS
              method:
                get:                          # Petición tipo GET
                  criteria: ==                # Criterio a evaluar
                  responseCode: "200"         # Respuesta a evaluar
            mode: "Continuous"                # La prueba se ejecuta de forma continua (alternativas: SoT, EoT, Edge, OnChaos)
            runProperties:
              probeTimeout: 5                 # Número de segundos para timeout en la petición
              interval: 5                     # Intervalo (segundos) entre re-intentos
              retry: 1                        # Número de re-intento antes de dar por fallida la validación   
              probePollingInterval: 2         # Intervalo (segundos) entre peticiones

Gestión de experimentos

Una de las principales ventajas de litmus es poder definir los experimentos de forma declarativa, lo que nos permite incluir fácilmente nuestros gestores de plantillas. Recomendamos el uso de kustomize.

Ejecución de experimentos

Container Kill

Descripción: Aborta la ejecución del servicio docker dentro de un pod. La selección del pod es aleatoria.
Información oficial del experimento: enlace
Criterio de entrada: 2 pods de app-sample en estado “Running”
```
  kubectl get pods -n "${TESTING_NAMESPACE}"
```

  kubectl get pods -n "${TESTING_NAMESPACE}"

  # -----------------------------------------

  NAME                          READY   STATUS    RESTARTS   AGE
  app-sample-7ff489dbd5-82ppw   1/1     Running   0          9h
  app-sample-7ff489dbd5-jg9vh   1/1     Running   0          9h

Parámetros de entrada experimento:

experiments:
    - name: container-kill
    spec:
        components:
        env:
            # provide the chaos interval
            - name: CHAOS_INTERVAL
            value: "10"

            # provide the total chaos duration
            - name: TOTAL_CHAOS_DURATION
            value: "20"

            - name: CONTAINER_RUNTIME
            value: "docker"

            - name: SOCKET_PATH
            value: "/var/run/docker.sock"

Hipótesis: Tenemos dos pods escuchando por el 80 tras un balanceador. Nuestro deployment tiene readinessProbe con periodSeconds=1 y failureThreshold=1. Si uno de los pods deja de responder, el balanceador deja de enviar tráfico a ese pod y debe responder el otro. Hemos establecido el healthcheck del experimento cada 5s (tiempo máximo de respuesta aceptable) atacando directamente contra el balanceador, por lo que no deberíamos de tener pérdida de servicio en ningún momento.
Creación de SA, Role y RoleBinding

kubectl apply -f src/litmus/kill-container/kill-container-sa.yaml -n "${TESTING_NAMESPACE}"

Ejecución de experimento

kubectl apply -f src/litmus/kill-container/chaos-engine-kill-container.yaml  -n "${TESTING_NAMESPACE}"

# Awaited -> Pass/Fail
watch -n 1 kubectl get chaosresult app-sample-chaos-container-kill -n "${TESTING_NAMESPACE}" -o jsonpath="{.status.experimentstatus.verdict}"

Observaciones: durante el experimento observamos 2 reinicios de pod con transición “Running” -> “Error” -> “Running”.
Validación: Peticiones get al balanceador con respuesta 200.

probe:
- name: "check-frontend-access-url"
    type: "httpProbe"
    httpProbe/inputs:
    url: "http://app-sample.testing.svc.cluster.local"
    insecureSkipVerify: false
    method:
        get:
        criteria: ==
        responseCode: "200"
    mode: "Continuous"
    runProperties:
    probeTimeout: 5
    interval: 5
    retry: 1
    probePollingInterval: 2

Resultado: resultado “Pass” (dos pods en estado “Running”, sin pérdida de servicio durante la duración del experimento)

$ kubectl describe chaosresult app-sample-chaos-container-kill -n "${TESTING_NAMESPACE}" 

# --------------------------------------------------------------------------------------

Spec:
    Engine:      app-sample-chaos
    Experiment:  container-kill
Status:
    Experimentstatus:
        Fail Step:                 N/A
        Phase:                     Completed
        Probe Success Percentage:  100
        Verdict:                   Pass
History:
    Failed Runs:   0
    Passed Runs:   6
    Stopped Runs:  0
Probe Status:
    Name:  check-frontend-access-url
    Status:
        Continuous:  Passed 👍
    Type:            httpProbe
Events:
    Type    Reason   Age    From                         Message
    ----    ------   ----   ----                         -------
    Normal  Awaited  4m48s  container-kill-5i56m6-4pkxg  experiment: container-kill, Result: Awaited
    Normal  Pass     4m4s   container-kill-5i56m6-4pkxg  experiment: container-kill, Result: Pass


$ kubectl get pods -n testing

NAME                          READY   STATUS    RESTARTS   AGE
app-sample-6c48f8c4cc-74lvl   1/1     Running   2          25m
app-sample-6c48f8c4cc-msdmj   1/1     Running   0          25m

Pod autoscaler
- Descripción: permite escalar las réplicas para testear el autoescalado en el nodo.
- Información oficial del experimento: enlace
- Criterio de entrada: 2 pods de app-sample en estado “Running”

  $ kubectl get pods -n "${TESTING_NAMESPACE}"

  # ------------------------------------------

  NAME                          READY   STATUS    RESTARTS   AGE
  app-sample-6c48f8c4cc-74lvl   1/1     Running   2          29m
  app-sample-6c48f8c4cc-msdmj   1/1     Running   0          28m

Parámetros de entrada experimento:

experiments:
  - name: pod-autoscaler
    spec:
      components:
        env:
          # set chaos duration (in sec) as desired
          - name: TOTAL_CHAOS_DURATION
            value: "60"

          # number of replicas to scale
          - name: REPLICA_COUNT
            value: "10"

Hipótesis: Disponemos de un HPA con min = 2 y max = 10. Con la ejecución de este experimento queremos validar que nuestro nodo es capaz de escalar a 10 réplicas (el max. establecido en el HPA). Cuando ejecutemos el experimento, se crearán 10 réplicas y en ningún momento tendremos pérdida de servicio. Nuestro HPA tiene establecido el parámetro “–horizontal-pod-autoscaler-downscale-stabilization” a 300s, por lo que durante ese intervalo tendremos 10 réplicas en estado “Running” y transcurrido ese intervalo, volveremos a tener 2 réplicas.
Creación de SA, Role y RoleBinding

$ kubectl apply -f src/litmus/pod-autoscaler/pod-autoscaler-sa.yaml -n "${TESTING_NAMESPACE}"

Ejecución de experimento

$ kubectl apply -f src/litmus/pod-autoscaler/chaos-engine-pod-autoscaler.yaml  -n "${TESTING_NAMESPACE}"

Observaciones:
Validación: Peticiones get al balanceador con respuesta 200.

probe:
- name: "check-frontend-access-url"
    type: "httpProbe"
    httpProbe/inputs:
    url: "http://app-sample.testing.svc.cluster.local"
    insecureSkipVerify: false
    method:
        get:
        criteria: ==
        responseCode: "200"
    mode: "Continuous"
    runProperties:
    probeTimeout: 5
    interval: 5
    retry: 1
    probePollingInterval: 2

Resultado:

$ kubectl describe chaosresult app-sample-chaos-pod-autoscaler  -n "${TESTING_NAMESPACE}"

# ----------------------------------------------------------------------------------------

Spec:
    Engine:      app-sample-chaos
    Experiment:  pod-autoscaler
Status:
    Experimentstatus:
        Fail Step:                 N/A
        Phase:                     Completed
        Probe Success Percentage:  100
        Verdict:                   Pass
History:
    Failed Runs:   0
    Passed Runs:   6
    Stopped Runs:  0
Probe Status:
    Name:  check-frontend-access-url
    Status:
        Continuous:  Passed 👍
    Type:            httpProbe
Events:
    Type    Reason   Age    From                         Message
    ----    ------   ----   ----                         -------
    Normal  Awaited  4m46s  pod-autoscaler-95wa6x-858jv  experiment: pod-autoscaler, Result: Awaited
    Normal  Pass     3m32s  pod-autoscaler-95wa6x-858jv  experiment: pod-autoscaler, Result: Pass

$ kubectl get pods -n testing

# ---------------------------

NAME                          READY   STATUS        RESTARTS   AGE
app-sample-6c48f8c4cc-5kzpg   0/1     Completed     0          39s
app-sample-6c48f8c4cc-74lvl   0/1     Running       2          32m
app-sample-6c48f8c4cc-bflws   0/1     Completed     0          39s
app-sample-6c48f8c4cc-c5ls8   0/1     Completed     0          39s
app-sample-6c48f8c4cc-d9zj4   0/1     Completed     0          39s
app-sample-6c48f8c4cc-f2xnt   0/1     Completed     0          39s
app-sample-6c48f8c4cc-f7qdl   0/1     Completed     0          39s
app-sample-6c48f8c4cc-ff84v   0/1     Completed     0          39s
app-sample-6c48f8c4cc-k29rr   0/1     Completed     0          39s
app-sample-6c48f8c4cc-l5fqp   0/1     Completed     0          39s
app-sample-6c48f8c4cc-m587t   0/1     Completed     0          39s
app-sample-6c48f8c4cc-msdmj   1/1     Running       0          32m
app-sample-6c48f8c4cc-n5h6l   0/1     Completed     0          39s
app-sample-6c48f8c4cc-qr5nd   0/1     Completed     0          39s
app-sample-chaos-runner       0/1     Completed     0          47s
pod-autoscaler-95wa6x-858jv   0/1     Completed     0          45s

Pod CPU Hog

Descripción: permite consumir recursos de CPU dentro de POD
Información oficial del experimento: enlace
Criterio de entrada: 2 pods de app-sample en estado “Running”

  kubectl get pods -n "${TESTING_NAMESPACE}"

  # ---------------------------------------

  NAME                          READY   STATUS    RESTARTS   AGE
  app-sample-6c48f8c4cc-74lvl   1/1     Running   2          52m
  app-sample-6c48f8c4cc-msdmj   1/1     Running   0          52m

Parámetros de entrada experimento:

experiments:
  - name: pod-cpu-hog
    spec:
      components:
        env:
          #number of cpu cores to be consumed
          #verify the resources the app has been launched with
          - name: CPU_CORES
            value: "1"

          - name: TOTAL_CHAOS_DURATION
            value: "60" # in seconds

          - name: PODS_AFFECTED_PERC
            value: "0"

Hipótesis: Disponemos de un HPA con min = 2 y max = 10. Con la ejecución de este experimento queremos validar que nuestro HPA funciona correctamente. Tenemos establecido un targetCPUUtilizationPercentage=50%, lo que quiere decir que si inyectamos consumo de CPU en un pod, el HPA debe establecer el número de réplicas a 3 (2 min + 1 autoscaler). En ningún momento debemos tener pérdida de servicio. Nuestro HPA tiene establecido el parámetro “–horizontal-pod-autoscaler-downscale-stabilization” a 300s, por lo que durante ese intervalo tendremos 10 réplicas en estado “Running” y transcurrido ese intervalo, volveremos a tener 2 réplicas.
Creación de SA, Role y RoleBinding

kubectl apply -f src/litmus/pod-cpu-hog/pod-cpu-hog-sa.yaml -n "${TESTING_NAMESPACE}"

Ejecución de experimento

kubectl apply -f src/litmus/pod-cpu-hog/chaos-engine-pod-cpu-hog.yaml -n "${TESTING_NAMESPACE}"

Observaciones: durante el experimento vemos 2 pods en estado “Runnning”. Se comienza a inyectar consumo en uno de los POD y se autoescala a 3 réplicas. A los 300s se vuelve a tener 2 réplicas.
Validación: Peticiones get al balanceador con respuesta 200.

probe:
- name: "check-frontend-access-url"
    type: "httpProbe"
    httpProbe/inputs:
    url: "http://app-sample.testing.svc.cluster.local"
    insecureSkipVerify: false
    method:
        get:
        criteria: ==
        responseCode: "200"
    mode: "Continuous"
    runProperties:
    probeTimeout: 5
    interval: 5
    retry: 1
    probePollingInterval: 2

Resultado: resultado “Pass” (tres pods en estado “Running”, sin pérdida de servicio durante la duración del experimento)

$ kubectl describe chaosresult app-sample-chaos-pod-cpu-hog -n "${TESTING_NAMESPACE}" 

# -----------------------------------------------------------------------------------

Spec:
    Engine:      app-sample-chaos
    Experiment:  pod-cpu-hog
Status:
    Experimentstatus:
        Fail Step:                 N/A
        Phase:                     Completed
        Probe Success Percentage:  100
        Verdict:                   Pass
History:
    Failed Runs:   0
    Passed Runs:   6
    Stopped Runs:  0
Probe Status:
    Name:  check-frontend-access-url
    Status:
        Continuous:  Passed 👍
    Type:            httpProbe
Events:
    Type    Reason   Age    From                         Message
    ----    ------   ----   ----                         -------
    Normal  Awaited  2m23s  pod-cpu-hog-mpen59-zcpr6  experiment: pod-cpu-hog, Result: Awaited
    Normal  Pass     74s    pod-cpu-hog-mpen59-zcpr6  experiment: pod-cpu-hog, Result: Pass

$ kubectl get pods -n testing

  NAME                          READY   STATUS      RESTARTS   AGE
  app-sample-6c48f8c4cc-74lvl   1/1     Running     6          46m
  app-sample-6c48f8c4cc-msdmj   1/1     Running     0          46m
  app-sample-5c5575cdb7-hq5gs   1/1     Running     0          49s
  app-sample-chaos-runner       0/1     Completed   0          104s
  pod-cpu-hog-mpen59-zcpr6      0/1     Completed   0          103s

Extra – Otros experimentos

pod-network-loss

kubectl apply -f src/litmus/pod-network-loss/pod-network-loss-sa.yaml -n "${TESTING_NAMESPACE}"

kubectl apply -f src/litmus/pod-network-loss/chaos-engine-pod-network-loss.yaml  -n "${TESTING_NAMESPACE}"

kubectl describe chaosresult app-sample-chaos-pod-network-loss -n "${TESTING_NAMESPACE}"

pod-memory-hog

kubectl apply -f src/litmus/pod-memory/pod-memory-hog-sa.yaml -n "${TESTING_NAMESPACE}"

kubectl apply -f src/litmus/pod-memory/chaos-engine-pod-memory-hog.yaml  -n "${TESTING_NAMESPACE}"

kubectl describe chaosresult app-sample-chaos-pod-memory-hog -n "${TESTING_NAMESPACE}"

pod-delete

kubectl apply -f src/litmus/pod-delete/pod-delete-sa.yaml -n "${TESTING_NAMESPACE}"

kubectl apply -f src/litmus/pod-delete/chaos-engine-pod-delete.yaml -n "${TESTING_NAMESPACE}"

kubectl describe chaosresult app-sample-chaos-pod-delete -n "${TESTING_NAMESPACE}"

Planificación de experimentos

Litmus soporta el uso de planificaciones de experimentos. Dispone de las siguientes opciones:

Inmediato

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    now: true
  engineTemplateSpec:
    appinfo:
      appns: testing
      applabel: app.kubernetes.io/name=app-sample
      appkind: deployment
    annotationCheck: 'true'

Timestamp específico

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    once:
      #should be modified according to current UTC Time
      executionTime: "2020-05-12T05:47:00Z" 
  engineTemplateSpec:
    appinfo:
      appns: testing
      applabel: app.kubernetes.io/name=app-sample
      appkind: deployment
    annotationCheck: 'true'

Repeticiones

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    repeat:
      properties:
         #format should be like "10m" or "2h" accordingly for minutes or hours
        minChaosInterval: "2m"  
  engineTemplateSpec:
    appinfo:
      appns: testing
      applabel: app.kubernetes.io/name=app-sample
      appkind: deployment
    annotationCheck: 'true'

Repeticiones entre un rango de fechas

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    repeat:
      timeRange:
        #should be modified according to current UTC Time
        startTime: "2020-05-12T05:47:00Z"   
        endTime: "2020-09-13T02:58:00Z"   
      properties:
        #format should be like "10m" or "2h" accordingly for minutes and hours
        minChaosInterval: "2m"  
  engineTemplateSpec:
    appinfo:
      appns: testing
      applabel: app.kubernetes.io/name=app-sample
      appkind: deployment
    annotationCheck: 'true'

Repeticiones con una fecha de finalización

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    repeat:
      timeRange:
        #should be modified according to current UTC Time
        endTime: "2020-09-13T02:58:00Z"   
      properties:
        #format should be like "10m" or "2h" accordingly for minutes and hours
        minChaosInterval: "2m"   
  engineTemplateSpec:
    appinfo:
      appns: testing
      applabel: app.kubernetes.io/name=app-sample
      appkind: deployment
    annotationCheck: 'true'

Repeticiones desde una fecha de inicio (ejecuciones indefinidas)

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    repeat:
      timeRange:
        #should be modified according to current UTC Time
        startTime: "2020-05-12T05:47:00Z"   
      properties:
         #format should be like "10m" or "2h" accordingly for minutes and hours
        minChaosInterval: "2m" 
  engineTemplateSpec:
    appinfo:
      appns: testing
      applabel: app.kubernetes.io/name=app-sample
      appkind: deployment
    annotationCheck: 'true'

Ejecución entre horas con frecuencia

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    repeat:
      properties:
        #format should be like "10m" or "2h" accordingly for minutes and hours
        minChaosInterval: "2m"   
      workHours:
        # format should be <starting-hour-number>-<ending-hour-number>(inclusive)
        includedHours: 0-12
  engineTemplateSpec:
    appinfo:
      appns: testing
      applabel: app.kubernetes.io/name=app-sample
      appkind: deployment
    annotationCheck: 'true'

Ejecuciones periódicas en días específicos

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    repeat:
      properties:
        #format should be like "10m" or "2h" accordingly for minutes and hours
        minChaosInterval: "2m"   
      workDays:
        includedDays: "Mon,Tue,Wed,Sat,Sun"
  engineTemplateSpec:
    appinfo:
      appns: testing
      applabel: app.kubernetes.io/name=app-sample
      appkind: deployment
    annotationCheck: 'true'

LitmusChaos + Load Test Performance con Apache Jmeter

Hasta el momento hemos realizado pruebas para validar cómo se comporta nuestro nodo de k8s bajo escenarios ideales, sin carga en el sistema por parte de los usuarios finales de la aplicación.

Por lo general, tendremos definidos SLIs/SLOs/SLAs los cuales hay que garantizar que cumplimos bajo cualquier eventualidad y para ello debemos de disponer de las herramientas adecuadas. En este caso, Litmus + Apache Jmeter nos facilitarán la tarea de simular múltiples escenarios de concurrencia con inyección de anomalías en el sistema. Durante esta fase de pruebas es posible que tengamos que realizar ajustes de dimensionamiento, modificar alguna política de escalado o incluso que identifiquemos cuellos de botella y los equipos de desarrollo tengan que ajustar algún componente.

Para no desvirtuar el objetivo del workshop con la definición de SLIs/SLOs/SLAs (más info aquí), únicamente vamos a utilizar la métrica “Ratio de error”, la cual vamos a establecer en < 2,00%.

Planteamos un escenario ficticio donde nuestra aplicación tiene 200 usuarios concurrentes durante la mayor parte del tiempo de servicio.

Procedemos a descargar el binario de JMeter y unos complementos para la visualización de gráficas:

JMeter requiere Java JRE. En el caso de no estar disponible en el sistema, puedes realizar la instalación de [OpenJDK](https://adoptopenjdk.net/index.html). En caso contrario, omite este paso.

curl -L https://ftp.cixug.es/apache//jmeter/binaries/apache-jmeter-5.4.1.tgz --output /tmp/apache-jmeter.tgz
tar zxvf /tmp/apache-jmeter.tgz && mv apache-jmeter-5.4.1 apache-jmeter

# install plugins-manager
curl -L https://jmeter-plugins.org/get/ --output apache-jmeter/lib/ext/jmeter-plugins-manager-1.6.jar

# install bzm - Concurrency Thread Group
curl -L https://repo1.maven.org/maven2/kg/apc/jmeter-plugins-casutg/2.9/jmeter-plugins-casutg-2.9.jar --output apache-jmeter/lib/ext/jmeter-plugins-casutg-2.9.jar
curl -L https://repo1.maven.org/maven2/kg/apc/jmeter-plugins-cmn-jmeter/0.6/jmeter-plugins-cmn-jmeter-0.6.jar --output apache-jmeter/lib/jmeter-plugins-cmn-jmeter-0.6.jar
curl -L https://repo1.maven.org/maven2/kg/apc/cmdrunner/2.2/cmdrunner-2.2.jar --output apache-jmeter/lib/cmdrunner-2.2.jar
curl -L https://repo1.maven.org/maven2/net/sf/json-lib/json-lib/2.4/json-lib-2.4.jar --output apache-jmeter/lib/json-lib-2.4-jdk15.jar


curl -L https://repo1.maven.org/maven2/kg/apc/jmeter-plugins-graphs-basic/2.0/jmeter-plugins-graphs-basic-2.0.jar --output apache-jmeter/lib/ext/jmeter-plugins-graphs-basic-2.0.jar
curl -L https://repo1.maven.org/maven2/kg/apc/jmeter-plugins-graphs-additional/2.0/jmeter-plugins-graphs-additional-2.0.jar --output apache-jmeter/lib/ext/jmeter-plugins-graphs-additional-2.0.jar

# Get url service
url=$(minikube service app-sample --url -n "${TESTING_NAMESPACE}")

HOST_APP_SAMPLE=$(echo ${url} | cut -d/ -f3 | cut -d: -f1)
PORT_APP_SAMPLE=$(echo ${url} | cut -d: -f3)

Vamos a validar que con el dimensionamiento actual cumplimos con los requisitos. Durante 60 segundos, ejecutamos 200 peticiones concurrentes, lo que se traduce en 12.000 peticiones. La petición será de tipo “GET” por el puerto 80 del balanceador.

Este es el aspecto que tiene la GUI de JMeter con el plan de pruebas.

TARGET_RATE=200
RAMP_UP_TIME=60
RAMP_UP_STEPS=1

# GUI mode
bash apache-jmeter/bin/jmeter.sh -t src/jmeter/litmus-k8s-workshop.jmx -f -l apache-jmeter/logs/result.jtl -j apache-jmeter/logs/jmeter.log -Jhost=${HOST_APP_SAMPLE} -Jport=${PORT_APP_SAMPLE} -Jtarget_rate=${TARGET_RATE} -Jramp_up_time=${RAMP_UP_TIME} -Jramp_up_steps=${RAMP_UP_STEPS}

Nuestro dimensionamiento base son dos réplicas de nuestro servicio app-sample:

kubectl get pods -n "${TESTING_NAMESPACE}"

# ----------------------------------------

NAME                         READY   STATUS    RESTARTS   AGE
app-sample-d9d578cfb-55flr   1/1     Running   8          3h1m
app-sample-d9d578cfb-klmxn   1/1     Running   0          3h2m

Ejecutamos el plan de pruebas sin GUI:

TARGET_RATE=200
RAMP_UP_TIME=60
RAMP_UP_STEPS=1

bash apache-jmeter/bin/jmeter.sh -n -t src/jmeter/litmus-k8s-workshop.jmx -f -l apache-jmeter/logs/result.jtl -j apache-jmeter/logs/jmeter.log -Jhost=${HOST_APP_SAMPLE} -Jport=${PORT_APP_SAMPLE} -Jtarget_rate=${TARGET_RATE} -Jramp_up_time=${RAMP_UP_TIME} -Jramp_up_steps=${RAMP_UP_STEPS}

rm -rf apache-jmeter/logs/report && bash apache-jmeter/bin/jmeter.sh -g apache-jmeter/logs/result.jtl -o apache-jmeter/logs/report

En la ruta “./apache-jmeter/logs/report/index.html” podéis ver un dashboard con los resultados.

Hemos realizado 12000 peticiones con 200 usuarios concurrentes durante 60s. Estos son los resultados:

Ratio de error: 0.00%

Vamos a realizar la misma prueba pero inyectando disrupcción de red en uno de los pods, lo que provocará que deje de responder (estado CrashLoopBackOff) y sólo tengamos disponible una réplica.

kubectl apply -f src/litmus/pod-network-loss/pod-network-loss-sa.yaml -n "${TESTING_NAMESPACE}"
kubectl apply -f src/litmus/pod-network-loss/chaos-engine-pod-network-loss.yaml  -n "${TESTING_NAMESPACE}"

TARGET_RATE=200
RAMP_UP_TIME=60
RAMP_UP_STEPS=1

bash apache-jmeter/bin/jmeter.sh -n -t src/jmeter/litmus-k8s-workshop.jmx -f -l apache-jmeter/logs/result.jtl -j apache-jmeter/logs/jmeter.log -Jhost=${HOST_APP_SAMPLE} -Jport=${PORT_APP_SAMPLE} -Jtarget_rate=${TARGET_RATE} -Jramp_up_time=${RAMP_UP_TIME} -Jramp_up_steps=${RAMP_UP_STEPS}

rm -rf apache-jmeter/logs/report && bash apache-jmeter/bin/jmeter.sh -g apache-jmeter/logs/result.jtl -o apache-jmeter/logs/report

¿Qué ha sucedido?

Al inyectar el experimento, uno de los pods ha dejado de responder. Si nos fijamos en la definición del deployment app-sample, tenemos un livenessProbe cuya propiedad periodSeconds está establecida a 5 segundos y failureThreshold a 1 intento. Según nuestra configuración, el balanceador envía el 50% aprox. del tráfico a cada uno de los pods. Durante 5 segundos tenemos que el pod al que hemos inyectado una disrupción de red mediante el experimento no responde, lo que se traduce en error en la petición. Transcurridos los 5 segundos, el balanceador deja de enviar tráfico a ese pod y sólo tendremos un pod recibiendo peticiones.

Teníamos establecido un requisito que nuestro servicio no puede superar el 2% de errores bajo ningún escenario y hemos obtenido un 5,03% (603 peticiones erróneas), por lo que debemos realizar algún ajuste para cumplir el objetivo.

¿Cuál es el resultado del experimento?

kubectl describe chaosresult app-sample-chaos-pod-network-loss  -n "${TESTING_NAMESPACE}"

#-------------------------

Events:
Type    Reason   Age    From                           Message
----    ------   ----   ----                           -------
Normal  Awaited  4m16s  pod-network-loss-uf6hms-sk47z  experiment: pod-network-loss, Result: Awaited
Normal  Pass     2m23s  pod-network-loss-uf6hms-sk47z  experiment: pod-network-loss, Result: Pass

Aunque nuestro requisito de ratio de error < 2,00% no se cumple, el experimento termina con resultado “Pass”. Esto es debido a que Litmus tiene como criterio de salida “Pass” si el pod vuelve a estar disponible, lo cual se cumple. Aquí estamos haciendo uso de litmus para inyectar errores en el sistema.

¿Cómo podemos conseguir reducir el ratio de error?

Únicamente con fines ilustrativos, para resolver el problema que nos ocupa, vamos a incrementar el número de réplicas a 4 en el HorizontalPodAutoscaler y en el deployment disminuir el valor de la propiedad periodSeconds de 5s a 2s. Con esto pasamos a distribuir el 25% del tráfico a cada pod y además, el tiempo que el pod afectado por la disrupción de tráfico pasa de 5s a 2s, lo que debe traducirse en una reducción del ratio de error.

ℹ️ Nuestro sistema debe estar diseñado para adaptarse a la demanda en base a métricas (CPU, memoria, peticiones por segundo, latencia, I/O, etc.) siempre manteniendo los mínimos recursos activos. Con la expansión de servicios gestionados de kubernetes en los principales proveedores cloud (EKS/GKE/AKS), disponemos de múltiples estrategias para conseguir dicho objetivo.

kubectl edit deployment app-sample -n "${TESTING_NAMESPACE}" 

kubectl edit HorizontalPodAutoscaler app-sample-ha -n "${TESTING_NAMESPACE}"

Volvemos a ejecutar nuestro test:

kubectl apply -f src/litmus/pod-network-loss/pod-network-loss-sa.yaml -n "${TESTING_NAMESPACE}"
kubectl apply -f src/litmus/pod-network-loss/chaos-engine-pod-network-loss.yaml  -n "${TESTING_NAMESPACE}"

TARGET_RATE=200
RAMP_UP_TIME=60
RAMP_UP_STEPS=1

bash apache-jmeter/bin/jmeter.sh -n -t src/jmeter/litmus-k8s-workshop.jmx -f -l apache-jmeter/logs/result.jtl -j apache-jmeter/logs/jmeter.log -Jhost=${HOST_APP_SAMPLE} -Jport=${PORT_APP_SAMPLE} -Jtarget_rate=${TARGET_RATE} -Jramp_up_time=${RAMP_UP_TIME} -Jramp_up_steps=${RAMP_UP_STEPS}

rm -rf apache-jmeter/logs/report && bash apache-jmeter/bin/jmeter.sh -g apache-jmeter/logs/result.jtl -o apache-jmeter/logs/report

Como podemos observar, nuestros cambios han provocado disminuir nuestro ratio de error a 1,60%, por lo que conseguimos cumplir nuestro objetivo de < 2,00%.

Litmus UI Portal

Litmus dispone de un portal para poder realizar experimentos sin necesidad de utilizar la consola. Dispone de las siguientes funcionalidades:

Gestión de workflows: dispone de todos los experimentos pre-cargados listos para ejecutar en tu k8s.
MyHubs: permite conectar a repositorios públicos/privados para hacer uso de tus propios experimentos.
Analytics: permite visualizar las ejecuciones de tus experimentos, así como estadísticas sobre los mismos. Además, permite conectar a otros DataSources como Prometheus.
Gestión de equipos y usuarios.

# install litmus portal
kubectl apply -f src/litmus/portal/portal.yaml

minikube service litmusportal-frontend-service -n  ${LITMUS_NAMESPACE} > /dev/null &

Guía Litmus para desarrolladores

En la actualidad, litmus dispone de 53 experimentos a través de Litmus ChaosHub. Están desarrollados principalmente en Go, aunque disponen de una SDK para python y ansible.

Los experimentos tienen una estructura bien definida (pre-checks, chaos-injection, litmus-probes, post-checks y result-updates) y es viable desarrollar experimentos que se ajusten a tus necesidades.

En este enlace encontraréis toda la información para desarrolladores.

Consideraciones finales

Debemos asumir que nuestro sistema no va a ser 100% tolerante a fallos pero ello no implica que pongamos todos los medios para minimizar los riesgos y en caso de producirse el desastre, lo hagamos de una forma relativamente controlada. La clave del éxito pasa por aplicar las prácticas de ingeniería del caos en fases tempranas del desarrollo, conocer las particularidades de la infraestructura donde ejecuta y disponer de herramientas adecuadas para automatizar las pruebas.

Un factor importante es dimensionar los esfuerzos en base a la criticidad del servicio que presta nuestro sistema: el esfuerzo en validar la resiliencia de un portal con información para empleados con 100 usuarios potenciales cuyo SLA es del 98% difiere mucho de una aplicación bancaria que realiza operaciones financieras a miles de usuarios concurrentes cuyo SLA es del 99.9XX%. En ambos casos el único método para verificar el cumplimiento del SLA es mediante test de resiliencia pero existe una notable diferencia respecto al esfuerzo que deberíamos dedicar.

En este workshop nos hemos centrado en Litmus y Kubernetes pero cabe recordar que dependiendo del sistema que estemos desarrollando, tengamos que complementar nuestras pruebas con otras herramientas, principalmente las enfocadas a la inyección de fallos sobre infraestructura (+ info).

Referencias

Licencia

Este workshop está licenciado bajo MIT (ver LICENSE para más detalle).

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

Ángel Maroco

AWS Cloud Architect

Ángel Maroco llevo en el sector IT más de una década, iniciando mi carrera profesional con el desarrollo web, pasando una buena etapa en distintas plataformas informacionales en entornos bancarios y los últimos 5 años dedicado al diseño de soluciones en entornos AWS.

En la actualidad, compagino mi papel de arquitecto junto al de responsable de la Pŕactica Cloud /bluetab, cuya misión es impulsar la cultura Cloud dentro de la compañía.

SOLUCIONES, SOMOS EXPERTOS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Te puede interesar

De documentos en papel a datos digitales con Fastcapture y Generative AI

June 7, 2023

Databricks on AWS – An Architectural Perspective (part 2)

March 5, 2024

Bank Fraud detection with automatic learning II

September 17, 2020

IBM to acquire Bluetab

July 9, 2021

Bank Fraud detection with automatic learning

September 17, 2020

CLOUD SERVICE DELIVERY MODELS

June 27, 2022

5 common errors in Redshift

December 15, 2020 by Bluetab

5 common errors in Redshift

Amazon Redshift can be considered to be one of the most important data warehouses currently and AWS offers it in its cloud. Working at Bluetab, we have had the pleasure of using it many times during our good/bad times as well as this year 2020. So we have created a list with the most common errors you will need to avoid and we hope this will be a great aid for you.

At Bluetab we have been working around data for over 10 years. In many of them, we have helped in the technological evolution of numerous companies by migrating from their traditional Data Warehouse analytics and BI environments to Big Data environments.

Additionally, at Cloud Practice we have been involved in cloud migrations and new developments of Big Data projects with Amazon Web Services and Google Cloud. All this experience has enabled us to create a group of highly qualified people who think/work in/for the cloud

To help you with your work in the cloud, we want to present the most common mistakes we have found when working with Redshift, the most important DW tool offered by AWS.

Here is the list:

Working as if it were a PostgreSQL.
Load data wrongly.
Dimensioning the cluster poorly.
Not making use of workload management (WLM).
Neglecting maintenance.

What is Redshift?

Amazon Redshift is a very fast, cloud-based analytical (OLAP) database, fully managed by AWS. It simplifies and enhances data analysis using standard SQL compatible with most existing BI tools.

The most important features of Amazon Redshift are:

Data storage in columns: instead of storing data as a series of rows, Amazon Redshift organises the data by column. Because only the columns involved in queries are processed and the data in columns are stored sequentially on storage media, column-based systems require much less I/O, which greatly improves query performance.
Advanced compression: column-based databases can be compressed much more than row-based databases because similar data is stored sequentially on disk.
Massively Parallel Processing (MPP): Amazon Redshift automatically distributes the data and query load across all nodes.
Redshift Spectrum: lets you run queries against exabytes of data stored in Amazon S3.
Materialized views: subsequent queries that refer to the materialized views use the pre-calculated results to run much faster. Materialized views can be created based on one or more source tables using filters, projections, inner joins, aggregations, groupings, functions and other SQL constructs.
Scalability: Redshift has the ability to scale its processing and storage by increasing the cluster size to hundreds of nodes.

Amazon Redshift is not the same as other SQL database systems. Good practices are required to take advantage of all its benefits, so that the cluster will perform optimally.

1. Working as if it were a PostgreSQL

A very common mistake made when starting to use Redshift is to assume that it is simply a vitamin-enriched PostgreSQL and that you can start working with Redshift based on a schema compatible with that. However, you could not be more wrong.

Although it is true that Redshift was based on an older version of PostgreSQL 8.0.2, its architecture has changed radically and has been optimised over the years to improve performance for its strictly analytical use. So you need to:

Design the tables appropriately.
Launch queries optimised for MPP environments.

Design the tables appropriately

When designing the database, bear in mind that some key table design decisions have a considerable influence on overall query performance. Some good practices are:

Select the optimum data distribution type:
- For fact tables choose the DISTKEY type. This will distribute the data to the various nodes grouped by the chosen key values. This will enable you to perform JOIN type queries on that column very efficiently.
- For dimension tables with a few million entries, choose the ALL type. It is advisable to copy those tables commonly used in joins of dictionary type to all the nodes. In that way the JOIN statement with much bigger fact tables will execute much faster.
- When you are not clear on how you are going to query a very large table or it simply has no relation to the rest, choose the EVEN type. The data will be distributed randomly in this way.
It uses automatic compression, allowing Redshift to select the optimal type for each column. It accomplishes this by scanning a limited number of items.

Use queries optimised for MPP environments

As Redshift is a distributed MPP environment, query performance needs to be maximised by following some basic recommendations. Some good practices are:

The tables need to be designed considering the queries that will be made. Therefore, if a query does not match, you need to review the design of the participating tables.
Avoid using SELECT *. and include only the columns you need.
Do not use cross-joins unless absolutely necessary.
Whenever you can, use the WHERE statement to restrict the amount of data to be read.
Use sort keys in GROUP BY and SORT BY clauses so that the query planner can use more efficient aggregation.

2. Loading data in that way

Loading very large datasets can take a long time and consume a lot of cluster resources. Moreover, if this loading is performed inappropriately, it can also affect query performance.

This makes it advisable to follow these guidelines:

Always use the COPY command to load data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB or from different data sources on remote hosts.

 copy customer from 's3://mybucket/mydata' iam_role 'arn:aws:iam::12345678901:role/MyRedshiftRole';

If possible, launch a single command instead of several. You can use a manifest file or patterns to upload multiple files at once.
Split the load data files so that they are:
- Of equal size, between 1 MB and 1 GB after compression.
- A multiple of the number of slices in your cluster.
To update data and insert new data efficiently when loading it, use a staging table.

  -- Create an staging table and load it with the data to be updated
  create temp table stage (like target); 

  insert into stage 
  select * from source 
  where source.filter = 'filter_expression';

  -- Use an inner join with the  staging table to remove the rows of the target table to be updated

  begin transaction;

  delete from target 
  using stage 
  where target.primarykey = stage.primarykey; 

  -- Insert all rows from the of the staging table.
  insert into target 
  select * from stage;

  end transaction;

  -- Drop the staging table.
  drop table stage;

3. Dimensioning the cluster poorly

Over the years we have seen many customers who had serious performance issues with Redshift due to design failures in their databases. Many of them had tried to resolve these issues by adding more resources to the cluster rather than trying to fix the root problem.

Due to this, I suggest you follow the flow below to dimension your cluster:

Collect information on the type of queries to be performed, data set size, expected concurrency, etc.
Design your tables based on the queries that will be made.
Select the type of Redshift instance (DC2, DS2 or RA3) depending on the type of queries (simple, long, complex…).
Taking the data set size into account, calculate the number of nodes in your cluster.

# of  Redshift nodes = (uncompressed data size) * 1.25 / (storage capacity of selected Redshift node type)

« For storage size calculation, having a larger margin for performing maintenance tasks is also recommended »

Perform load tests to check performance.
If it does not work adequately, optimise the queries, even modifying the design of the tables if necessary.
Finally, if this is not sufficient, iterate until you find the appropriate node and size dimensioning.

4. Not making use of workload management (WLM)

It is quite likely that your use case will require multiple sessions or users running queries at the same time. In these cases, some queries can consume cluster resources for extended periods of time and affect the performance of the other queries. In this situation, simple queries may have to wait until longer queries are complete.

By using WLM, you will be able to manage the priority and capacity of the different types of executions by creating different execution queues.

You can configure the Amazon Redshift WLM to run in two different ways:

Automatic WLM: the most advisable manner is to enable Amazon Redshift so that it manages how resources are split to run concurrent queries with automatic WLM. The user manages queue priority and Amazon Redshift determines how many queries run simultaneously and how much memory is allocated to each query submitted.
Manual WLM: alternatively, you can configure resource use for different queues manually. At run time, queries can be sent to different queues with different user-managed concurrency and memory parameters.

How WLM works

When a user runs a query, WLM assigns the query to the first matching queue, based on the WLM queue assignment rules.

If a user is logged in as a superuser and runs a query in the query group labelled superuser, the query is assigned to the superuser queue.
If a user belongs to a listed user group or runs a query within a listed query group, the query is assigned to the first matching queue.
If a query does not meet any criterion, the query is assigned to the default queue, which is the last queue defined in the WLM configuration.

5. Neglecting maintenance.

Database maintenance is a term we use to describe a set of tasks executed with the intention of improving the database. There are routines to help performance, free up disk space, check data errors, check hardware faults, update internal statistics and many other obscure (but important) things.

In the case of Redshift, there is a mistaken feeling that as it is a service fully managed by Amazon, there is no need for any. So you create the cluster and forget about it. While AWS makes it easy for you to manage numerous tasks (create, stop, start, destroy or perform back-ups), this could not be further from the truth.

The most important maintenance tasks you need to perform in Redshift are:

System monitoring: the cluster needs monitoring 24/7 and you need to perform periodic checks to confirm that the system is functioning properly (no bad queries or blocking, free space, adequate response times, etc.). You also need to create alarms to be able to anticipate any future service downtimes.
Compacting the DB: Amazon Redshift does not perform all compaction tasks automatically in all situations and you will sometimes need to run them manually. This process is called VACUUM and it needs to be run manually to be able to use SORT KEYS of the INTERLEAVED type. This is quite a long and expensive process that will need to be performed, if possible, during maintenance windows.
Data integrity: as with any data loading, you need to check that the ETL processes have worked properly. Redshift has system tables such as STV_LOAD_STATE where you can find information on the current status of the COPY instructions in progress. You should check them often to confirm that there are no data integrity errors.
Detection of heavy queries: Redshift continuously monitors all queries that are taking longer than expected and that could be negatively impacting service performance. So that you can analyse and investigate those queries, you can find them in system tables as STL_ALERT_EVENT_LOG or through the AWS web console itself.

Do you want to know more about what we offer and to see other success stories?

Álvaro Santos

Senior Cloud Solution Architect

My name is Álvaro Santos and I have been working as Solution Architect for over 5 years. I am certified in AWS, GCP, Apache Spark and a few others. I joined Bluetab in October 2018, and since then I have been involved in cloud Banking and Energy projects and I am also involved as a Cloud Master Partitioner. I am passionate about new distributed patterns, Big Data, open-source software and anything else cool in the IT world.

SOLUTIONS, WE ARE EXPERTS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Container vulnerability scanning with Trivy

March 22, 2024

Hashicorp Boundary

December 3, 2020

5 common errors in Redshift

December 15, 2020

Desplegando una plataforma CI/CD escalable con Jenkins y Kubernetes

September 22, 2021

El futuro del Cloud y GenIA en el Next ’23

September 19, 2023

Cómo preparar la certificación AWS Data Analytics – Specialty

November 17, 2021

Hashicorp Boundary

December 3, 2020 by Bluetab

Hashicorp Series Boundary

After the last HashiConf Digital, the Cloud Practice wants to present you one of the main innovations that were presented: Boundary. In this post we are going to discuss what offers this new tool, why it is interesting, what we have found and how we have tested it.

What is Hashicorp Boundary?

Hashicorp Boundary is, as themselves claim, a tool that allows access any system using identity as a fundamental piece. What does this really mean?
Traditionally, when a user acquires the permission to access a remote service, he or she also gets explicit permission to the network where the service resides. However, Boundary, following the minimum privilege principle, provides us with an identity-based system for users who need access to applications or machines. For example, it is an easy way of access to a server via SSH using ephemeral keys as authentication method.

This means that Boundary limits what resources you can connect to and also manages the different permissions and accesses to resources with an authentication.

It is especially interesting because in the future it will be marked by the strong integration that it will have with other Hashicorp tools, especially Vault for credentials management and audit capabilities.

In case you are curious, Hashicorp has released the source code of Boundary which you have available at Github and the official documentation can be read on their website:
boundaryproject.

How have we tested Boundary?

BBased on an example project from Hashicorp, we have developed a small proof of concept that deploys Boundary in a hybrid-cloud scenario in AWS and GCP. Although the reference architecture does not said nothing about this design, we wanted to complete the picture and
set up a small multi-cloud stage to see how this new product.

The final architecture in broad terms is:

Once the infrastructure has been deployed and the application configured, we have tested connecting to the instances through SSH. All the source code is based on terraform 0.13 and you can find it in Bluetab-boundary-hybrid-architecture, where you will also find a detailed README that specifies the actions you have to follow to reproduce the environment, in particular:

Authentication with your user (previously configured) in Boundary. To accomplish this, you have to set the Boundary controllers endpoint and execute the following command: boundary authenticate.
Execute: boundary connect ssh with the required parameters to point to the selected target (this target represents one or more instances or endpoints)

In this particular scenario, the target is composed by two different machines:
one in AWS and one in GCP. If Boundary is not told which particular instance you want to access from that target, it will provide access to one of them randomly. Automatically, once you have selected the machine you want to access, Boundary will route the request to the appropriate worker, who has access to that machine.

What did we like?

The ease of configuration. Boundary knows exactly which worker has to address the request taking into account which service or machine is being requested.
As the entire deployment (both infrastructure and application) has been done using terraform, the output of one deployment serves as the input of the other and everything is perfectly integrated.
It offers both graphic interface and CLI access. Despite being in a very early stage of development, the same binary includes (when configured as controller) a very clean graphical interface, in the same style as the rest of the Hashicorp tools. However, as not all functionality is currently implemented through the interface it is necessary to make configuration using the CLI.

What would we have liked to see?

Integration with Vault and indentity providers (IdPs) is still in the roadmap and until next versions it is not sure that it will be included.
The current management of the JWT token from the Boundary client to the control plane which involves installing a secret management tool.

What do we still need to test?

Considering the level of progress of the current product development, we would be missing for understanding and trying to:

Access management by modifying policies for different users.
Perform a deepest research on the components that manage resources (scopes, organizations, host sets, etc.)

Why do we think this product has great future?

Once the product has completed several phases in the roadmap that Hashicorp has established, it will greatly simplify resources access management through bastions in organizations. Access to instances can be managed simply by adding or modifying the permissions that a user has, without having to distribute ssh keys, perform manual operations on the machines, etc.

In summary, this product gives us a new way to manage access to different resources. Not only through SSH, but it will be a way to manage access through roles to machines, databases, portals, etc. minimizing the possible attack vector when permissions are given to contractors. In addition, it is presented as a free and open source tool, which will not only integrate very effectively if you have the Hashicorp ecosystem deployed, but will also work seamlessly without the rest of Hashicorp’s tools.

One More Thing…

We encountered a problem caused by the way in which the information about the network addresses of controllers and workers for subsequent communication was stored. After running our scenario with a workaround based on iptables we decided to open a issue on Github. In only one day, they solved the problem by updating their code. We downloaded the new version of the code, tested it and it worked perfectly. Point in favour for Hashicorp for the speed of response and the efficiency they demonstrated. In addition, recently it has been published a new release of Boundary, including this fix along with many other features Boundary v0.1.2.

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

SOLUCIONES, SOMOS EXPERTOS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Te puede interesar

CDKTF: Otro paso en el viaje del DevOps, introducción y beneficios.

May 9, 2023

¿Existe el Azar?

November 10, 2021

Databricks on Azure – An architecture perspective (part 2)

March 24, 2022

Snowflake Advanced Storage Guide

October 3, 2022

Some of the capabilities of Matillion ETL on Google Cloud

July 11, 2022

Databricks on Azure – An Architecture Perspective (part 1)

February 15, 2022

Cómo depurar una Lambda de AWS en local

October 8, 2020 by Bluetab

Cómo depurar una Lambda de AWS en local

AWS Lambda es un servicio serverless mediante el que se puede ejecutar código sin necesidad de levantar ni administrar máquinas. Se paga solamente por el tiempo consumido en la ejecución (15 minutos como máximo).

El servicio dispone de un IDE simple, pero por su propia naturaleza no permite añadir puntos de ruptura para depurar el código. Seguro que algunos de vosotros os habéis visto en esta situación y habéis tenido que hacer uso de métodos poco ortodoxos como prints o ejecutar el código directamente en vuestra máquina, pero esto último no reproduce las condiciones reales de ejecución del servicio.

Para permitir depurar con fiabilidad desde nuestro propio PC, AWS pone a disposición SAM (Serverless Application Model).

Instalación

Los requisitos necesarios son (se ha usado Ubuntu 18.04 LTS):

Python (2.7 ó >= 3.6)
Docker
IDE que se pueda enlazar a un puerto de debug (en nuestro caso usamos VS Code)
awscli

Para instalar la CLI de AWS SAM desde AWS recomiendan brew tanto para Linux como macOS, pero en este caso se ha optado por hacerlo con pip por homogeneidad:

python3 -m pip install aws-sam-cli

Configuración y ejecución

1. Iniciamos un proyecto SAM

sam init

Por simplicidad se selecciona “AWS Quick Start Templates” para crear un proyecto a través de plantillas predefinidas
Se elige la opción 9 – python3.6 como el lenguaje del código que contendrá nuestra lambda
Se selecciona la plantilla de “Hello World Example»

En este momento ya tenemos nuestro proyecto creado en la ruta especificada:

/helloworld: app.py con el código Python a ejecutar y requirements.txt con sus dependencias
/events: events.json con ejemplo de evento a enviar a la lambda para su ejecución. En nuestro caso el trigger será un GET a la API a http://localhost:3000/hello
/tests : test unitario
template.yaml: plantilla con los recursos de AWS a desplegar en formato YAML de CloudFormation. En esta aplicación de ejemplo sería un API gateway + lamba y se emulará ese despliegue localmente

2. Se levanta la API en local y se hace un GET al endpoint

sam local start-api

Concretamente el endpoint de nuestro HelloWorld
será http://localhost:3000/hello Hacemos un GET

Y obtenemos la respuesta de la API

3. Añadimos la librería ptvsd (Python Tools for Visual Studio) para debugging a requirements.txt quedando como:

requests
ptvsd

4. Habilitamos el modo debug en el puerto 5890 haciendo uso del siguiente código en helloworld/app.py

import ptvsd

ptvsd.enable_attach(address=('0.0.0.0', 5890), redirect_output=True)
ptvsd.wait_for_attach()

Añadimos también en app.py dentro de la función lambda_handler varios prints para usar en la depuración

print('punto de ruptura')

print('siguiente línea')

print('continúa la ejecución')

return {
    "statusCode": 200,
    "body": json.dumps({
        "message": "hello world",
        # "location": ip.text.replace("\n", "")
    }),
}

5. Aplicamos los cambios realizados y construimos el contenedor

sam build --use-container

6. Configuramos el debugger de nuestro IDE

En VSCode se utiliza el fichero launch.json. Creamos en la ruta principal de nuestro proyecto la carpeta .vscode y dentro el fichero

{
  "version": "0.2.0",
  "configurations": [
      {
          "name": "SAM CLI Python Hello World",
          "type": "python",
          "request": "attach",
          "port": 5890,
          "host": "127.0.0.1",
          "pathMappings": [
              {
                  "localRoot": "${workspaceFolder}/hello_world",
                  "remoteRoot": "/var/task"
              }
          ]
      }
  ]
}

7. Establecemos un punto de ruptura en el código en nuestro IDE

8. Levantamos nuestra aplicación con la API en el puerto de debug

sam local start-api --debug-port 5890

9. Hacemos de nuevo un GET a la URL del endpoint http://localhost:3000/hello

10. Lanzamos la aplicación desde VSCode en modo debug, seleccionando la configuración creada en launch.json

Y ya estamos en modo debug, pudiendo avanzar desde nuestro punto de ruptura

Alternativa: Se puede hacer uso de events/event.json para lanzar la lambda a través de un evento definido por nosotros

En este caso lo modificamos incluyendo un solo parámetro de entrada:

{
   "numero": "1"
}

Y el código de nuestra función para hacer uso del evento:

print('punto de ruptura número: ' + event["numero"])

De esta manera, invocamos a través del evento en modo debug:

sam local invoke HelloWorldFunction -d 5890 -e events/event.json

Podemos ir depurando paso a paso, viendo como en este caso se hace uso del evento creado:

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

SOLUCIONES, SOMOS EXPERTOS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Te puede interesar

Boost Your Business with GenAI and GCP: Simple and for Everyone

March 27, 2024

Azure Data Studio y Copilot

October 11, 2023

Cómo depurar una Lambda de AWS en local

October 8, 2020

$ docker run 2021

February 2, 2021

Oscar Hernández, new CEO of Bluetab LATAM.

May 16, 2024

Databricks on AWS – An Architectural Perspective (part 1)

March 5, 2024

Spying on your Kubernetes with Kubewatch

September 14, 2020 by Bluetab

Spying on your Kubernetes with Kubewatch

At Cloud Practice we aim to encourage adoption of the cloud as a way of working in the IT world. To help with this task, we are going to publish numerous good practice articles and use cases and others will talk about those key services within the cloud.

This time we will talk about Kubewatch.

What is Kubewatch?

Kubewatch is a utility developed by Bitnami Labs that enables notifications to be sent to various communication systems.

Supported webhooks are:

Slack
Hipchat
Mattermost
Flock
Webhook
Smtp

Kubewatch integration with Slack

The available images are published in the bitnami/kubewatch GitHub

You can download the latest version to test it in your local environment:

$ docker pull bitnami/kubewatch

Once inside the container, you can play with the options:

$ kubewatch -h

Kubewatch: A watcher for Kubernetes

kubewatch is a Kubernetes watcher that publishes notifications
to Slack/hipchat/mattermost/flock channels. It watches the cluster
for resource changes and notifies them through webhooks.

supported webhooks:
 - slack
 - hipchat
 - mattermost
 - flock
 - webhook
 - smtp

Usage:
  kubewatch [flags]
  kubewatch [command]

Available Commands:
  config      modify kubewatch configuration
  resource    manage resources to be watched
  version     print version

Flags:
  -h, --help   help for kubewatch

Use "kubewatch [command] --help" for more information about a command.

For what types of resources can you get notifications?

When will you receive a notification?

As soon as there is an action on a Kubernetes object, as well as creation, destruction or updating.

Configuration

Firstly, create a Slack channel and associate a webhook with it. To do this, go to the Apps section of Slack, search for “Incoming WebHooks” and press “Add to Slack”:

If there is no channel created for this purpose, register a new one:

In this example, the channel to be created will be called “k8s-notifications”. Then you have to configure the webhook at the “Incoming WebHooks” panel and adding a new configuration where you will need to select the name of the channel to which you want to send notifications. Once selected, the configuration will return a ”Webhook URL” that will be used to configure Kubewatch. Optionally, you can select the icon (“Customize Icon” option) that will be shown on the events and the name with which they will arrive (“Customize Name” option).

You are now ready to configure the Kubernetes resources. There are some example manifests and also the option of installing by Helm on the Kubewatch GitHub However, here we will build our own.

First, create a file “kubewatch-configmap.yml” with the ConfigMap that will be used to configure the Kubewatch container:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kubewatch
data:
  .kubewatch.yaml: |
    handler:
      webhook:
        url: https://hooks.slack.com/services/<your_webhook>
    resource:
      deployment: true
      replicationcontroller: true
      replicaset: false
      daemonset: true
      services: true
      pod: false
      job: false
      secret: true
      configmap: true
      persistentvolume: true
      namespace: false

You simply need to enable the types of resources on which you wish to receive notifications with “true” or disable them with “false”. Also set the url of the Incoming WebHook registered previously.

Now, for your container to have access the Kubernetes resources through its api, register the “kubewatch-service-account.yml” file with a Service Account, a Cluster Role and a Cluster Role Binding:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: kubewatch
rules:
- apiGroups: ["*"]
  resources: ["pods", "pods/exec", "replicationcontrollers", "namespaces", "deployments", "deployments/scale", "services", "daemonsets", "secrets", "replicasets", "persistentvolumes"]
  verbs: ["get", "watch", "list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kubewatch
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: kubewatch
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kubewatch
subjects:
  - kind: ServiceAccount
    name: kubewatch
    namespace: default

Finally, create a “kubewatch.yml” file to deploy the application:

apiVersion: v1
kind: Pod
metadata:
  name: kubewatch
  namespace: default
spec:
  serviceAccountName: kubewatch
  containers:
  - image: bitnami/kubewatch:0.0.4
    imagePullPolicy: Always
    name: kubewatch
    envFrom:
      - configMapRef:
          name: kubewatch
    volumeMounts:
    - name: config-volume
      mountPath: /opt/bitnami/kubewatch/.kubewatch.yaml
      subPath: .kubewatch.yaml
  - image: bitnami/kubectl:1.16.3
    args:
      - proxy
      - "-p"
      - "8080"
    name: proxy
    imagePullPolicy: Always
  restartPolicy: Always
  volumes:
  - name: config-volume
    configMap:
      name: kubewatch
      defaultMode: 0755

You will see that the value of the “mountPath” key will be the file path where the configuration of your ConfigMap will be written within the container (/opt/bitnami/kubewatch/.kubewatch.yaml). You can expand the information on how to mount configurations in Kubernetes here. In this example, you can see that our application deployment will be through a single pod. Obviously, in a production system you would need to define a Deployment with the number of replicas considered appropriate to keep it active, even in case of loss of the pod.

Once the manifests are ready apply them to your cluster:

$ kubectl apply  -f kubewatch-configmap.yml -f kubewatch-service-account.yml -f kubewatch.yml

The service will be ready in a few seconds:

$ kubectl get pods |grep -w kubewatch

kubewatch                                  2/2     Running     0          1m

The Kubewatch pod has two containers associated: Kubewatch and kube-proxy, the latter to connect to the API.

$   kubectl get pod kubewatch  -o jsonpath='{.spec.containers[*].name}'

kubewatch proxy

Check through the logs that the two containers have started up correctly and without error messages:

$ kubectl logs kubewatch kubewatch

==> Config file exists...
level=info msg="Starting kubewatch controller" pkg=kubewatch-daemonset
level=info msg="Starting kubewatch controller" pkg=kubewatch-service
level=info msg="Starting kubewatch controller" pkg="kubewatch-replication controller"
level=info msg="Starting kubewatch controller" pkg="kubewatch-persistent volume"
level=info msg="Starting kubewatch controller" pkg=kubewatch-secret
level=info msg="Starting kubewatch controller" pkg=kubewatch-deployment
level=info msg="Starting kubewatch controller" pkg=kubewatch-namespace
...

$ kubectl logs kubewatch proxy

Starting to serve on 127.0.0.1:8080

You could also access the Kubewatch container to test the cli, view the configuration, etc.:

$  kubectl exec -it kubewatch -c kubewatch /bin/bash

Your event notifier is now ready!

Now you need to test it. Let’s use the creation of a deployment as an example to test proper operation:

$ kubectl create deployment nginx-testing --image=nginx
$ kubectl logs -f  kubewatch kubewatch

level=info msg="Processing update to deployment: default/nginx-testing" pkg=kubewatch-deployment

The logs now alert you that the new event has been detected, so go to your Slack channel to confirm it:

The event has been successfully reported!

Now you can eliminate the test deployment:

$ kubectl delete deploy nginx-testing

Conclusions

Obviously, Kubewatch does not replace the basic warning and monitoring systems that all production orchestrators need to maintain, but it does provide an easy and effective way to extend control over the creation and modification of resources in Kubernetes. In this example case we performed a Kubewatch configuration across the whole cluster, “spying” on all kinds of events, some of which are perhaps useless if the platform is maintained as a service, as we would be aware of each of the pods created, removed or updated by each development team in its own namespace, which is common, legitimate and does not add value. It may be more appropriate to filter by the namespaces for which you wish to receive notifications, such as kube-system, which is where we generally host administrative services and where only administrators should have access. In that case, you would simply need to specify the namespace in your ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kubewatch
data:
  .kubewatch.yaml: |
    namespace: "kube-system"
    handler:
      webhook:
        url: https://hooks.slack.com/services/<your_webhook>
    resource:
      deployment: true
      replicationcontroller: true
      replicaset: false

Another interesting utility may be to “listen” to our cluster after a significant configuration adjustment, such as our self-scaling strategy, integration tools and so on, as it will always notify us of the scale ups and scale downs, which could be especially useful initially. In short, Kubewatch extends control over clusters, and we decide the scope we give it. In later articles we will look at how to manage logs and metrics productively.

Do you want to know more about what we offer and to see other success stories?

SOLUTIONS, WE ARE EXPERTS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Spying on your Kubernetes with Kubewatch

September 14, 2020

MICROSOFT FABRIC: Una nueva solución de análisis de datos, todo en uno

October 16, 2023

Data Mesh

July 27, 2022

Basic AWS Glue concepts

July 22, 2020

Bluetab is certified under the AWS Well-Architected Partner Program

October 19, 2020

Using Large Language Models on Private Information

March 11, 2024

Introduction to HashiCorp products

August 25, 2020 by Bluetab

Introduction to HashiCorp products

At Cloud Practice we look to encourage the use of HashiCorp products and we are going to publish thematic articles on each of them to do this.

Due to the numerous possibilities their products offer, this article will provide an overview, and we will go into detail on each of them in later publications, showing unconventional use cases that show the potential of HashiCorp products.

Why HashiCorp?

HashiCorp has been developing a variety of Open Source products over recent years that offer cross-cutting infrastructure management in cloud and on-premises environments. These products have set standards in infrastructure automation.

Its products now provide robust solutions in provisioning, security, interconnection and workload coordination fields.

The source code for its products is released under MIT licence, which has been well received within the Open Source community (over 100 developers are continually providing improvements)

In addition to the products we present in this article, they also have attractive Enterprise solutions.

As regards the impact of its solutions, Terraform has become a standard on the market. This means HashiCorp is doing things right, so understanding and learning about its solutions is appropriate.

Although we are focusing on their use in the Cloud environment, their solutions are widely used in on-premises environments, but they reach their full potential when working in the cloud.

What are their products?

HashiCorp has the following products:

Terraform: infrastructure as code.
Vagrant: virtual machines for test environments.
Packer: automated building of images.
Nomad: workload “orchestrator”.
Vault: management of secrets and data protection.
Consul: management and discovery of services in distributed environments.

We summarise the main product features below. We will go into detail on each of them in later publications.

Terraform has positioned itself as the most widespread product in the field of provisioning of Infrastructure as Code.

It uses a specific language (HCL) to deploy infrastructure through the use of various Cloud providers. Terraform also lets you manage resources from other technologies or services, such as GitLab or Kubernetes.

Terraform makes use of a declarative language and is based on having deployed exactly what is present in the code.

The typical example shown to see the difference with the paradigm followed, for example, by Ansible (procedural) is:

Initially we want to deploy 5 EC2 instances on AWS:

Ansible:

- ec2:
    count: 5
    image: ami-id
    instance_type: t2.micro

Terraform:

resource "aws_instance" "ejemplo" {
  count         = 5
  ami           = "ami-id"
  instance_type = "t2.micro"
}

As you can see, there are virtually no differences between the coding. Now we need to deploy two more instances to cope with a heavier workload:

Ansible:

- ec2:
    count: 2
    image: ami-id
    instance_type: t2.micro

Terraform:

resource "aws_instance" "ejemplo" {
  count         = 7
  ami           = "ami-id"
  instance_type = "t2.micro"
}

At this point, we can see that while in Ansible we would set the number of instances to be deployed at 2 so that a total of 7 are deployed, in Terraform we would set the number directly to 7 and Terraform knows that it needs to deploy 2 more because there are already 5 deployed.

Another important aspect is that Terraform does not need a Master server such as Chef or Puppet. It is a distributed tool whose common, centralised element is the tfstate file (explained in the next paragraph). It is launched from any site with access to the tfstate (which can be remote, stored in common storage such as AWS S3) and with Terraform installed, which is distributed as a binary downloadable from the HashiCorp website.

The last point to note about Terraform is that it is based on a file called tfstate, in which the information relating to infrastructure status is progressively stored and updated, which it consults to see whether changes need to be made to the infrastructure. It is very important to note that this state is what Terraform knows. Terraform will not connect to AWS to see what is deployed. This means it is necessary to avoid making changes manually to the infrastructure deployed by Terraform (and never Manual changes, ever), as the tfstate file is not updated and inconsistencies will therefore be created.

Vagrant enables you to deploy test environments locally, quickly and very simply, based on code. By default it is used with VirtualBox, but it is compatible with other providers such as VMware or Hyper-V. Machines can be deployed on Cloud providers such as AWS by installing plug-ins. I personally cannot find advantages in using Vagrant for this function.

The machines it deploys are based on boxes, and you indicate which box you want to deploy from the code. Its mode of operating can be compared with Docker in this aspect. However, the basis is completely different. Vagrant creates virtual machines with a virtualisation tool below it (VirtualBox), while Docker deploys containers and its support is a containerisation technology, Vagrant versus Docker.

Typical use cases can be to test Ansible playbooks, recreate a laboratory environment locally in just a few minutes, making a demo, etc.

Vagrant is based on a Vagrantfile. Once located in the directory where this Vagrantfile is located, by running vagrant up, the tool begins to deploy what is indicated within that file.

The steps to launch a virtual machine with Vagrant are:

1. Install Vagrant. It is distributed as a binary file, like all other HashiCorp products.

2. With an example Vagrantfile:

Vagrant.configure("2") do |config|
  config.vm.box = "gbailey/amzn2"
end

PS: (the parameter 2 within configure refers to the version of Vagrant, which I had to check for the post)

3. Run within the directory where the Vagrantfile is located

vagrant up

4. Enter the machine using ssh

vagrant ssh

5. Destroy the machine

vagrant destroy

Vagrant manages ssh access by creating a private key and storing it in the .vagrant directory, apart from other metadata. The machine can be accessed by running vagrant ssh. The machines deployed can also be displayed in the VirtualBox application.

With Packer you can build automated machine images. It can be used, for example, to build an image on a Cloud provider, such as AWS, already with an initial configuration, and to be able to deploy it an indefinite number of times. This means that you only need to provision it once, and when you deploy the instance with that image you already have the desired configuration without having to spend more time provisioning it.

A simple example would be:

1. Install Packer. Similarly, it is a binary that will need to be placed in a folder located on our path.

2. Create a file, e.g. builder.json. A small script is also created in bash (the link shown in builder.json is dummy):

{
  "variables": {
    "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
    "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
    "region": "eu-west-1"
  },
  "builders": [
    {
      "access_key": "{{user `aws_access_key`}}",
      "ami_name": "my-custom-ami-{{timestamp}}",
      "instance_type": "t2.micro",
      "region": "{{user `region`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "source_ami_filter": {
        "filters": {
            "virtualization-type": "hvm",
            "name": "amzn2-ami-hvm-2*",
            "root-device-type": "ebs"
        },
        "owners": ["amazon"],
        "most_recent": true
      },
      "ssh_username": "ec2-user",
      "type": "amazon-ebs"
    }
  ],
  "provisioners": [
    {
        "type": "shell",
        "inline": [
            "curl -sL https://raw.githubusercontent.com/example/master/userdata/ec2-userdata.sh | sudo bash -xe"
        ]
    }
  ]
}

Provisioners use builtin and third-party software to install and configure the machine image after booting. Our example ec2-userdata.sh

yum install -y \
  python3 \
  python3-pip \

pip3 install cowsay

3. Run:

packer build builder.json

And we would already have an AMI provisioned with cowsay installed. Now it will never be necessary for you, as a first step after launching an instance, to install cowsay because we will already have it as a base, as it should be.

As you would expect, Packer not only works with AWS, but also with any Cloud provider such as Azure or GCP. It also works with VirtualBox and VMware and a long list of builders. You can also create an image by nesting builders in the Packer configuration file that is the same for various Cloud providers. This is very interesting for multi-cloud environments, where different configurations need to be replicated.

Nomad is a workload orchestrator. Through its Task Drivers they execute a task in an isolated environment. The most common use case is to orchestrate Docker containers. It has two basic actors: Nomad Server and Nomad Client. The two actors are run with the same binary, but different configurations. Servers organise the cluster, while Agents run the tasks.

A little “Hello world” in Nomad could be:

1. Download and install Nomad and run it (in development mode; this mode must never be used in production):

nomad agent -dev

Once this has been done, you can access the Nomad UI at localhost:4646

2. Run an example job that will launch a Grafana image

nomad job run grafana.nomad

grafana.nomad:

job "grafana" {
    datacenters = [
        "dc1"
    ]

    group "grafana" {
        task "grafana"{
            driver = "docker"

            config {
                image = "grafana/grafana"
            }

            resources {
                cpu = 500
                memory = 256

            network {
                mbits = 10

                port "http" {
                    static = "3000"
                }
            }
            }
        }
    }
}

3. Access localhost:3000 to check that you can access grafana and access the Nomad UI (localhost:4646) to view the job

4. Destroy the job

nomad stop grafana

5. Stop the Nomad agent. If it has been run as indicated here, pressing Control-C on the terminal where it is running will suffice

In short, Nomad is a very light orchestrator, unlike Kubernetes. Logically, it does not have the same features as Kubernetes, but it offers the ability to run workloads at high availability easily and without the need for a large amount of resources. We will deal with examples in greater detail in the post on Nomad to be published in the future.

Vault manages secrets. Users or applications can request secrets or identities through its API. These users are authenticated in Vault, which has a connection to a trusted identity provider such as an Active Directory, for example. Vault works with two types of actors, like the other tools mentioned, Server and Agent.

When Vault starts up, it starts in a sealed state (Seal) and various tokens are generated for unsealing (Unseal). An actual use case would be too long for this article and we will look at this in detail in the article on Vault alone. However, you can download and test it in dev mode as noted previously for Nomad. In this mode, Vault starts up in Unsealed mode, storing everything in memory without need for authentication, without using TLS, and with a single sealing key. You can play with its features and explore it quickly and easily in this way.

See this link for further information.

1. Install Vault. Like the rest of HashiCorp’s products, it is distributed as a binary through its website. Once done, launch (this mode must never be used in production):

vault server -dev

2. Vault will display the root token used to authenticate against Vault. It will also display the single sealing key it creates. Vault is now accessible via localhost:8200

3. In another terminal, export the following variables to be able to conduct tests:

export VAULT_ADDR='http://127.0.0.1:8200'
export VAULT_DEV_ROOT_TOKEN_ID="token-showed-when-running-vault-dev"

4. Create a secret (in the terminal where the above variables were exported)

vault kv put secret/test supersecreto=iewdubwef293bd8dn0as90jdasd

5. Recover that secret

vault kv get secret/test

These steps are the Hello world of Vault, to check its operation. We will look in detail at the operation and installation of Vault in the article on it, as well as at further features such as the policies, exploring the UI, etc.

Consul is used to create a services network. It operates in a distributed manner and a Consul client will run in a location where there are services that you want to register. Consul includes Health Checking of services, a key value database and features to secure network traffic. Like Nomad and Vault, Consul has two primary actors, Server and Agent.

Just as we did with Nomad and Vault, we are going to run Consul in dev mode to show a small example defining a service:

1. Install Consul. Create a working folder and within it create a json file with a service to be defined (example folder name: consul.d).

web.json file (HashiCorp website example):

{
  "service": {
    "name": "web",
    "tags": [
      "rails"
    ],
    "port": 80
  }
}

2. Run the Consul agent by pointing it to the configuration directory created before where web.json is located (this mode must never be used in production):

consul agent -dev -enable-script-checks -config-dir=./consul.d

3. From now, although there is nothing actually running on the indicated port (80), a service has been created with a DNS name that can be requested to access it through that DNS name. Its name will be NAME.service.consul. In our case NAME is web. Run to query:

dig @127.0.0.1 -p 8600 web.service.consul

Note: Consul runs by default on port 8600.

Similarly, you can dig for a record of SRV type:

dig @127.0.0.1 -p 8600 web.service.consul SRV

You can also query through the service tags (TAG.NAME.service.consul):

dig @127.0.0.1 -p 8600 rails.web.service.consul

And you can query through the API:

curl http://localhost:8500/v1/catalog/service/web

Final note: Consul is used as a complement to other products, as it is used to create services on other tools already deployed. That is why this example is not too illustrative. Apart from the specific article on Consul, it will be found as an example in other articles, such as the one on Nomad.

Future articles will also explain Consul Service Mesh for interconnection of services via Sidecar proxies, federation of Nomad Datacentres in Consul to perform deployments or intentions, which are used to define rules under which services can communicate with another service.

Conclusions

All these products have great potential in their respective areas and offer cross-cutting management that can help to avoid the famous vendor lock-in by Cloud providers. But the most important thing is their interoperability and compatibility.

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

Jorge de Diego

Cloud DevOps Engineer

My name is Jorge de Diego and I specialise in Cloud environments. I usually work with AWS, although I have knowledge of GCP and Azure. I joined Bluetab in September 2019 and have been working on Cloud DevOps and security tasks since then. I am interested in everything related to technology, especially security models and Infrastructure fields. You can spot me in the office by my shorts.

SOLUTIONS, WE ARE EXPERTS

Tech

Workshop Ingeniería del caos sobre Kubernetes con Litmus

Objetivos del workshop

Preparación de consola

Clonación de repositorio

Creación de entorno de pruebas K8s con minikube

Creación de namespaces K8s

Despliegue de aplicación de test

En PANEL 2 ejecutar:

En PANEL 3 ejecutar:

En PANEL 4 ejecutar:

Despliegue Chaos Experiments

Despliegue servicios monitorización: Prometheus + Grafana

Creación de anotación "litmuschaos"

Detalle componentes de un experimento

Service Account, Role y RoleBinding

Definición ChaosEngine

Especificaciones generales

Especificaciones de componentes

Especificaciones de pruebas

Gestión de experimentos

Ejecución de experimentos

Container Kill

Pod autoscaler

Pod CPU Hog

Extra – Otros experimentos

Planificación de experimentos

LitmusChaos + Load Test Performance con Apache Jmeter

Litmus UI Portal

Guía Litmus para desarrolladores

Consideraciones finales

Referencias

Licencia

Navegación

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

5 common errors in Redshift

What is Redshift?

1. Working as if it were a PostgreSQL

Design the tables appropriately

Use queries optimised for MPP environments

2. Loading data in that way

3. Dimensioning the cluster poorly

4. Not making use of workload management (WLM)

How WLM works

5. Neglecting maintenance.

Do you want to know more about what we offer and to see other success stories?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Hashicorp Series Boundary

What is Hashicorp Boundary?

How have we tested Boundary?

What did we like?

What would we have liked to see?

What do we still need to test?

Why do we think this product has great future?

One More Thing…

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Cómo depurar una Lambda de AWS en local

Configuración y ejecución

Alternativa: Se puede hacer uso de events/event.json para lanzar la lambda a través de un evento definido por nosotros

Y el código de nuestra función para hacer uso del evento:

De esta manera, invocamos a través del evento en modo debug:

Podemos ir depurando paso a paso, viendo como en este caso se hace uso del evento creado:

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Spying on your Kubernetes with Kubewatch

What is Kubewatch?

Kubewatch integration with Slack

For what types of resources can you get notifications?

When will you receive a notification?

Configuration