[Remote] GPU Infrastructure Engineer india 4 years None Not disclosed Remote Full Time

We are seeking a skilled Site Reliability Engineer (SRE) with a passion for managing large-scale, GPU-accelerated infrastructure. In this role, you’ll help architect and maintain robust, cloud-native platforms that power cutting-edge AI and ML workloads across multi-cloud and on-premise environments. Kubernetes Management: Design, optimize, and maintain scalable multi-cluster Kubernetes deployments across AWS , Google Cloud , and on-prem infrastructure. Potential expansion into Azure or Oracle Cloud environments. Infrastructure Automation: Use Terraform , Pulumi , and GitOps methodologies (e.g., Argo CD , Flux ) to provision and manage cloud-native resources. CI/CD Pipeline Reliability: Maintain high-availability build and deployment pipelines , ensuring rapid, safe delivery with strong rollback strategies. GPU Infrastructure Operations: Operate and scale GPU fleets with NVIDIA driver management , MIG partitioning , auto-scaling , and firmware lifecycle handling . Familiarity with AMD/ROCm and upcoming GPU platforms is a plus. Monitoring & Observability: Scale and tune observability systems including Prometheus and Grafana . Define and track SLIs/SLOs and enable proactive capacity monitoring. On-Call & Incident Response: Participate in a rotating on-call schedule, lead incident resolution efforts, and contribute to incident retrospectives and runbook documentation. Process Development & Mentorship: Help shape and evolve SRE processes, and mentor engineers across the organization on reliability practices and tooling. 4+ years as a Site Reliability Engineering Deep understanding of Kubernetes architecture and experience with managing large-scale clusters in production. Strong hands-on skills with AWS and Google Cloud Platform ; any exposure to Azure or Oracle Cloud is beneficial. Proven experience with Infrastructure as Code (Terraform, Pulumi) and GitOps practices. Solid knowledge of Linux internals , system-level debugging, and networking fundamentals. Direct experience operating GPU clusters (preferably NVIDIA , including MIG usage); bonus points for experience with ROCm or GPU-focused providers like Lambda or Nebius . Proficiency with Prometheus , Grafana , and maintaining observability at scale. Comfortable working in English (written and spoken). Paid Vacations Annual Bonus: 1-month salary This is a full-time position requiring 40 hours per week, but it will be structured as contractor work. Devices: You will be expected to use your own computer to perform the work. Sole Employment: No second job is permitted.

Login to

Please Verify Your Phone or Email

Confirm Action

Canaan Advisors

Before You Leave... Find Your Perfect Job!

Login to

Please Verify Your Phone or Email

Confirm Action

Contact Us

Canaan Advisors