Jobs
Interviews

Canaan Advisors

1 Job openings at Canaan Advisors
[Remote] GPU Infrastructure Engineer india 4 years None Not disclosed Remote Full Time

We are seeking a skilled Site Reliability Engineer (SRE) with a passion for managing large-scale, GPU-accelerated infrastructure. In this role, you’ll help architect and maintain robust, cloud-native platforms that power cutting-edge AI and ML workloads across multi-cloud and on-premise environments. Kubernetes Management: Design, optimize, and maintain scalable multi-cluster Kubernetes deployments across AWS , Google Cloud , and on-prem infrastructure. Potential expansion into Azure or Oracle Cloud environments. Infrastructure Automation: Use Terraform , Pulumi , and GitOps methodologies (e.g., Argo CD , Flux ) to provision and manage cloud-native resources. CI/CD Pipeline Reliability: Maintain high-availability build and deployment pipelines , ensuring rapid, safe delivery with strong rollback strategies. GPU Infrastructure Operations: Operate and scale GPU fleets with NVIDIA driver management , MIG partitioning , auto-scaling , and firmware lifecycle handling . Familiarity with AMD/ROCm and upcoming GPU platforms is a plus. Monitoring & Observability: Scale and tune observability systems including Prometheus and Grafana . Define and track SLIs/SLOs and enable proactive capacity monitoring. On-Call & Incident Response: Participate in a rotating on-call schedule, lead incident resolution efforts, and contribute to incident retrospectives and runbook documentation. Process Development & Mentorship: Help shape and evolve SRE processes, and mentor engineers across the organization on reliability practices and tooling. 4+ years as a Site Reliability Engineering Deep understanding of Kubernetes architecture and experience with managing large-scale clusters in production. Strong hands-on skills with AWS and Google Cloud Platform ; any exposure to Azure or Oracle Cloud is beneficial. Proven experience with Infrastructure as Code (Terraform, Pulumi) and GitOps practices. Solid knowledge of Linux internals , system-level debugging, and networking fundamentals. Direct experience operating GPU clusters (preferably NVIDIA , including MIG usage); bonus points for experience with ROCm or GPU-focused providers like Lambda or Nebius . Proficiency with Prometheus , Grafana , and maintaining observability at scale. Comfortable working in English (written and spoken). Paid Vacations Annual Bonus: 1-month salary This is a full-time position requiring 40 hours per week, but it will be structured as contractor work. Devices: You will be expected to use your own computer to perform the work. Sole Employment: No second job is permitted.