Definitive DevOps Engineer Interview Questions & Answers (2026)

Here is a comprehensive set of interview questions for a DevOps Engineer position in early 2026, encompassing cloud-native practices, GitOps, observability, DevSecOps, FinOps, and SRE principles.

1. DevOps Fundamentals & Culture

What does DevOps mean to you in 2026? How has it evolved since ~2020?

In 2026, DevOps is less about just CI/CD and "breaking silos" (which is now baseline) and more about Platform Engineering, DevSecOps, and SRE. It's about providing self-service paved roads for developers, treating infrastructure as software, and driving observable, reliable, and secure product delivery at scale without cognitive overload.

Explain the CALMS framework. How do you apply it in practice?

CALMS stands for Culture, Automation, Lean, Measurement, and Sharing. I apply it by fostering a blameless post-mortem culture, automating toil, optimizing flow (reducing batch sizes), tightly measuring DORA metrics, and sharing knowledge through internal tech talks and runbooks.

What is the difference between DevOps, SRE, and Platform Engineering?

DevOps is a philosophy and set of practices bridging development and operations. SRE (Site Reliability Engineering) is a specific implementation of DevOps focused strictly on reliability, error budgets, and SLA/SLO/SLIs. Platform Engineering builds internal developer platforms (IDPs) to abstract infrastructure, offering self-service capabilities to devs.

How do you measure DevOps success / DORA metrics in 2026?

Through the four key DORA metrics: Deployment Frequency (hitting multiple times a day), Lead Time for Changes (driving it under an hour), Mean Time to Recovery (MTTR - resolving incidents in <15 mins), and Change Failure Rate (keeping it <5%). High performers automate the collection of these metrics via platforms like LinearB or Datadog.

2. Version Control & Collaboration (Git)

Explain Git rebase vs merge. When would you force-push in a team setting?

merge creates a new commit joining two histories, preserving branch topology. rebase rewrites history by moving commits to the tip of another branch, resulting in a clean, linear history. Force-pushing (git push --force-with-lease) is only acceptable on personal feature branches after a rebase, never on shared branches like main.

What is GitOps? Compare ArgoCD vs Flux.

GitOps uses a Git repository as the single source of truth for declarative infrastructure and applications. ArgoCD provides a robust UI, multi-cluster management, and SSO, making it great for enterprise teams. Flux is more tightly integrated into the Kubernetes API, highly composable, lightweight, and excels in programmatic automation via Helm/Kustomize controllers.

How do you handle secrets in Git repositories?

Secrets should absolutely never be committed in plain text. Use tools like External Secrets Operator to sync securely from AWS Secrets Manager/Vault, or use Mozilla SOPS / Bitnami Sealed Secrets to encrypt secrets in Git, decrypting them only inside the cluster via an operator.

3. CI/CD & Automation

Design a modern CI/CD pipeline for a cloud-native microservices app.

CI (GitHub Actions): Linting, Unit Tests, SAST (Trivy), Build Docker Image, Container Scan, Push to OCI Registry, Sign Image (Cosign).
CD (ArgoCD): Update the image tag in the Git Manifests repo. ArgoCD detects drift and syncs the new deployment using a progressive delivery controller (Argo Rollouts) for an automated canary release.

How do you implement blue-green or canary delivery? Which signals trigger rollback?

Use Argo Rollouts or Flagger. Canary shifts fractions of traffic (e.g., 10% -> 50% -> 100%). Rollbacks are triggered autonomously by evaluating Prometheus queries against Golden Signals (e.g., 5xx error rate spikes, or latency p99 exceeds 200ms).

Explain artifact promotion vs immutable artifacts.

Immutable artifacts mean a Docker image built once (e.g., v1.2.3) is never overwritten. Artifact promotion involves moving that exact same image digest through environments (Dev -> Staging -> Prod) rather than rebuilding it, guaranteeing what you reliably tested is precisely what runs in production.

4. Containers & Orchestration

Docker best practices in production.

Use multi-stage builds to minimize image size and attack surface. Run containers as non-root users. Omit OS package managers in final images (e.g., using distroless or Alpine). Scan images for CVEs natively in CI before registry pushing.

Explain Kubernetes pod lifecycle and probes (liveness, readiness, startup).

Startup probe: Checks if the app initialized (protects slow-starting legacy apps). Readiness probe: Checks if the pod can accept traffic (adds/removes it from Service endpoints). Liveness probe: Checks if the pod is deadlocked (restarts the pod immediately if failed).

Describe HPA, Cluster Autoscaler, and VPA.

HPA scales Pod replicas based on CPU/Memory or custom metrics (via Prometheus adapter). VPA resizes Pod requests/limits (CPU/Memory) based on historical usage observations. Cluster Autoscaler / Karpenter rapidly provisions or removes actual Worker Nodes when Pods are unschedulable or nodes are underutilized.

5. Infrastructure as Code (IaC) & Configuration Management

Terraform vs Pulumi vs Crossplane in 2026.

Terraform remains the industry standard declarative IaC using HCL framework. Pulumi allows writing IaC in general-purpose languages (TS, Python), bringing standard software engineering testing to IaC. Crossplane provisions infrastructure acting as a Kubernetes controller, natively enabling GitOps for cloud resources via Custom Resource Definitions (CRDs).

How do you manage Terraform state securely at scale?

Store state remotely in an S3 bucket (or Azure Blob) encrypted at rest with KMS. Enable DynamoDB for state locking to meticulously prevent concurrent mutation corruption. Restrict access via strict IAM roles.

6. Cloud & Multi-Cloud

Explain FinOps practices you’ve implemented.

Establishing strict tagging taxonomies for granular cost allocation. Deploying Kubecost to explicitly attribute K8s namespace spend. Utilizing Spot instances for stateless workloads smoothly via Karpenter. Automating instance start/stop schedules for Dev environments, and setting up Anomaly Alerts in Billing.

Describe IAM least-privilege strategy in a large organization.

Ban permanent static credentials. Use strictly bounded OIDC federation (e.g., GitHub Actions to AWS IAM via OIDC provider) for CI/CD. Use Kubernetes IRSA for pods. Grant broad access via role assumption (STS) bounded by strict condition keys and active time-to-live restrictions.

7. Observability, Monitoring & Reliability

What is the modern observability stack in 2026?

OpenTelemetry (OTel) acting as the universally unified agent/collector for traces, metrics, and logs, feeding into a robust backend like Prometheus (metrics), Tempo/Honeycomb (traces), and Loki (logs), seamlessly presented under Grafana.

Explain the three pillars of observability + RED/USE signals.

Pillars: Logs (events), Metrics (aggregations), Traces (request flow).
RED (Services): Rate, Errors, Duration.
USE (Resources): Utilization, Saturation, Errors.

How do you implement SLOs, SLIs, error budgets?

SLI (Indicator): True/False quantitative metric (e.g., HTTP 200s < 200ms).
SLO (Objective): Target percentage (99.9% of SLI events over 30 days).
Error Budget: The 0.1% allowance for total failure. Once systematically depleted, automated feature freezes trigger to focus purely on reliability engineering.

8. Security (DevSecOps)

How do you shift security left in the pipeline?

Integrate automated tools at the fundamental PR level: SAST (SonarQube) for static code, secret scanning (TruffleHog), IaC robust linting (Checkov), and container CVE scanning (Trivy), fundamentally blocking merges automatically on critical vulnerabilities.

Explain OPA/Gatekeeper vs Kyverno for K8s policy.

Both act as precise Validation/Mutation Admission Webhooks. OPA uses Rego, a domain-specific logic language, making it highly powerful but steep to comprehensively learn. Kyverno uses native Kubernetes YAML declarations, making it phenomenally more accessible for standard cluster admins.

What is supply chain security (SLSA framework, Sigstore)?

Aggressively securing how software is built to systematically prevent tampering. Implementing Sigstore (Cosign) to cryptographically sign container images inside CI, and actively verifying that signature via Kyverno before the cluster logically admits the image. SLSA formally defines tiers of strict build provenance guarantees.

9. Advanced / System Design

Design a zero-downtime deployment system for a global e-commerce platform.

Utilize robust active-active multi-region cloud architecture dynamically routed via Route53 latency-based routing. Deploy responsive microservices to EKS. Use Argo Rollouts for automated intelligent canary deployments, continuously validated against Datadog metrics for automated regional rollback. Databases use multi-region active-read replicas with schema migrations carefully performed non-destructively in advance of live code deploys.

You get paged at 3 AM — API latency spiked 10x. Walk through your troubleshooting process.

1. Acknowledge the active page. 2. Open the primary Grafana dashboard to rapidly isolate the specific ailing microservice and endpoint, explicitly checking RED signals. 3. Check OTel distributed traces to precisely determine if the latency is compute-bound, network-bound, or database-bound (e.g., a missing index fueling an N+1 query). 4. If an immediate mitigation exists (smoothly rollback recent deploy, forcefully scale up replicas), definitively apply it to stop the immediate bleeding. 5. Perform root cause analysis and cooperatively write a blameless post-mortem later the next day.