Role Overview
We’re seeking a seasoned Senior DevOps Engineer to lead the design, automation, and governance of cloud-native platforms and CI/CD across product teams. You will own enterprise-grade GitHub Actions pipelines, EKS cluster operations, Kafka observability and inspection, Infrastructure as Code (Terraform), and GitOps with Argo CD. You’ll set standards for security posture (including AWS Inspector), drive app configuration strategies via AWS AppConfig, and build automation with AWS Lambda. The ideal candidate is hands-on, thrives in high-scale environments, and mentors teams to operational excellence.
Key Responsibilities
Platform Engineering & Reliability
- Own and evolve managed EKS clusters (multi-tenant, multi-AZ): cluster lifecycle, upgrades, node groups, autoscaling, network policies, ingress, and cost governance.
- Implement Kafka inspection & observability: monitor consumer lag, ISR, partition health, throughput, and end-to-end pipeline reliability (producers/consumers).
CI/CD & Automation
- Architect GitHub Actions workflows: reusable workflows/actions, matrix builds, environment protections, OIDC to AWS, secret governance, artifact/versioning.
- Standardize GitOps with Argo CD: app-of-apps pattern, progressive rollout, automated sync, drift detection, rollback playbooks.
- Build event-driven automation with AWS Lambda (e.g., compliance checks, drift remediation, pipeline notifications, Inspector auto-triage).
Infrastructure as Code & Config
- Lead Terraform module architecture, state strategy (remote backends, workspaces), testing/linting (e.g., tflint/terraform validate), and policy-as-code (OPA/Sentinel).
- Define application configuration via AWS AppConfig: configuration profiles, feature flags, canary/linear deployment strategies, governance and rollback.
Security, Compliance & Monitoring
- Implement AWS Inspector for EC2/EKS/containers; establish triage workflows, remediation SLAs, reporting dashboards.
- Enforce IAM least privilege, secure CI/CD via OIDC, container/image signing & scanning, SBOM generation, secrets management.
- Drive SLOs/SLIs with robust observability (CloudWatch/Prometheus/Grafana/OpenTelemetry); optimize alerting to reduce MTTR.
Leadership & Collaboration
- Mentor engineers; review designs/pipelines/IaC; lead blameless postmortems and operational readiness reviews.
- Partner with architecture, platform, and security to deliver scalable, compliant, and cost-effective solutions across environments.
Required Skills & Expertise
- CI/CD & GitOps: GitHub Actions (reusable workflows, environments, OIDC), Argo CD (applications, sync policies, progressive delivery).
- Cloud & Containers: AWS EKS operations, Helm/Kustomize, container lifecycle, autoscaling, network policies, ingress controllers.
- Streaming & Messaging: Kafka inspection/monitoring (consumer lag, ISR, partitioning, throughput), tooling for observability.
- IaC: Terraform (modules, state backends, workspaces, testing), policy-as-code (OPA/Sentinel), drift detection/remediation.
- Serverless & Config: AWS Lambda (Python/Node), AWS AppConfig (feature flags, config governance, safe deployments).
- Security & Monitoring: AWS Inspector, IAM design, vulnerability management, image scanning/signing, SBOMs, observability stack.
- Scripting: Bash/Python for automation, tooling, and operational workflows.
- Soft Skills: Technical leadership, stakeholder communication, cross-team collaboration, documentation excellence.
Qualifications
- Bachelor’s/Master’s in Computer Science, Engineering, or related field.
- 12+ years in DevOps/SRE/Platform Engineering; 5+ years on AWS; 3+ years on Kubernetes/EKS.
- Proven experience operating mission-critical systems with strong reliability, security, and compliance posture.
Nice-to-Have
- Experience with Crossplane, Service Mesh (Istio/Linkerd), multi-account AWS org governance (Control Tower/SCPs).
- FinOps practices; multi-region DR; blue/green & canary deployments; supply chain security (SLSA); OpenTelemetry.
- Kafka capacity planning, schema governance (e.g., Schema Registry), and disaster recovery strategies.