The Role:
We are seeking a Site Reliability Engineer (SRE) to ensure our multi-cloud networking platform meets and exceeds the stringent reliability, performance, and availability targets our enterprise customers demand. This is not a traditional operations role you will apply a software engineering mindset to solve complex infrastructure challenges and automate solutions at scale. You will be the guardian of our production environment, responsible for the uptime of our services and the architect of the systems that allow us to scale with confidence. Your work is critical to building and maintaining the trust of our customers.
Responsibilities:
Define and Manage Reliability: Establish and own the Service Level Objectives (SLOs) and Service Level Indicators (SLIs) that define the reliability of our platform. Participate in a blameless post-incident analysis culture and an on-call rotation to manage and resolve production incidents.
Build and Own the Observability Stack: Design, implement, and manage our complete observability stack, leveraging tools like Prometheus for metrics, Grafana for visualization, Elasticsearch for logging, and Jaeger/OpenTelemetry for distributed tracing to provide end-to-end visibility into our distributed system.
Automate Everything: Write robust automation and tooling in Python or Go to eliminate manual operational tasks, from incident response to infrastructure provisioning.
Infrastructure as Code (IaC): Use Terraform and Ansible to manage our multi-cloud infrastructure as code, ensuring our environments are consistent, repeatable, and auditable.
Kubernetes and Cloud Operations: Manage, troubleshoot, and scale our Kubernetes clusters across our multi-cloud footprint (AWS, Azure, GCP). You will be the expert on running our application reliably in a containerized environment.
CI/CD and Release Engineering: Collaborate with development teams to enhance our CI/CD pipelines, ensuring that every release is safe, reliable, and can be deployed with high velocity.
Required Qualifications:
3-5+ years of experience in a Site Reliability Engineering (SRE), DevOps, or similar infrastructure-focused software engineering role.
Strong programming and automation skills in Python or Go.
Deep, hands-on expertise with a modern observability stack, including Prometheus, Grafana, and the ELK Stack (Elasticsearch, Logstash/Fluentd, Kibana).
Proven experience with Infrastructure as Code (Terraform) and configuration management (Ansible).
In-depth knowledge of running, managing, and troubleshooting applications on Kubernetes in a production, multi-cloud environment.
A rigorous, data-driven approach to reliability and a deep understanding of distributed systems, their failure modes, and how to make them resilient.
Preferred Qualifications:
Experience with distributed tracing using Jaeger or OpenTelemetry.
A strong understanding of cloud networking concepts (VPCs, subnets, routing, security groups).
Experience defining and tracking SLOs and error budgets.
Experience in a fast-paced startup environment.
Relevant certifications such as Certified Kubernetes Administrator (CKA) or cloud provider certifications (AWS, Azure, GCP).