Role Overview
We are looking for a Site Reliability Engineer with a strong developer mindset and 4 5 years of experience to help us build and operate scalable, reliable systems. This is not a "just keep the lights on" SRE role youll be writing production-grade code , building tools, automating platforms, and contributing directly to both infrastructure and developer productivity.
The ideal candidate brings together SRE discipline, deep networking knowledge, Kubernetes expertise , and solid software engineering skills , with an understanding of GPU-based workloads and AI/ML inference/training pipelines .
About the Platform
You ll be working on a cutting-edge platform designed to train and serve large-scale machine learning models . The platform supports everything from small-scale experimentation to massive, distributed training jobs running on GPU clusters . It provides developers and ML engineers with the tools to quickly onboard, monitor, and scale their workloads whether it s a lightweight prototype or a production-grade deep learning model powering real-world applications.
Key platform features include:
- Dynamic GPU orchestration using Kubernetes
- Built-in support for training and inference workflows
- End-to-end observability and cost tracking
- High developer velocity via self-service tooling
Your contributions will directly impact the reliability, scalability, and efficiency of this platform and enable AI teams to innovate faster.
What You ll Do
- Build tools, APIs, and platforms to improve reliability, deployment, and performance.
- Architect and scale Kubernetes-based infrastructure for high-performance workloads (including GPU).
- Write clean, efficient, and maintainable code in Python, Go, or similar languages.
- Own and evolve CI/CD pipelines, infrastructure-as-code systems (Terraform, Helm), and service observability.
- Troubleshoot and resolve complex system, network, and application-level issues.
- Participate in blameless incident response, root cause analysis, and reliability reviews.
What You Bring
Core Requirements:
- 4 5 years of experience in SRE, DevOps, or platform engineering roles.
- Strong development background with fluency in Python, Go, or similar languages .
- Solid understanding of Kubernetes internals , workload orchestration, and Helm.
- Deep knowledge of networking fundamentals : DNS, TCP/IP, routing, VPNs, firewalls.
- Experience with infrastructure automation and configuration management.
- Understanding of GPU scheduling, resource allocation , and NVIDIA ecosystem tools.
- Familiarity with service mesh, observability stacks (Prometheus, Grafana, OpenTelemetry), and cloud-native patterns.
Bonus Points:
- Experience supporting AI/ML pipelines , especially GPU-based training/inference.
- Contributions to open-source projects or internal developer platforms.
Why This Role
- You ll write software, not just YAML.
- You ll get to work on real AI infrastructure challenges (not just buzzwords).
- You ll have impact across developer productivity, platform scalability , and service reliability .
- You ll join a team that values code quality, systems thinking, and ownership .
.