Kubernetes Management:
Design, optimize, and maintain scalable multi-cluster Kubernetes deployments across
AWS
,
Google Cloud
, and
on-prem
infrastructure. Potential expansion into
Azure
or
Oracle Cloud
environments.
Infrastructure Automation:
Use
Terraform
,
Pulumi
, and
GitOps
methodologies (e.g.,
Argo CD
,
Flux
) to provision and manage cloud-native resources.
CI/CD Pipeline Reliability:
Maintain high-availability
build and deployment pipelines
, ensuring rapid, safe delivery with strong rollback strategies.
GPU Infrastructure Operations:
Operate and scale GPU fleets with
NVIDIA driver management
,
MIG partitioning
,
auto-scaling
, and
firmware lifecycle handling
. Familiarity with
AMD/ROCm
and upcoming GPU platforms is a plus.
Monitoring & Observability:
Scale and tune observability systems including
Prometheus
and
Grafana
. Define and track SLIs/SLOs and enable proactive capacity monitoring.
On-Call & Incident Response:
Participate in a rotating on-call schedule, lead incident resolution efforts, and contribute to incident retrospectives and runbook documentation.
Process Development & Mentorship:
Help shape and evolve SRE processes, and mentor engineers across the organization on reliability practices and tooling.

<Your Background>

4+ years as a
Site Reliability Engineering
Deep understanding of
Kubernetes architecture
and experience with managing large-scale clusters in production.
Strong hands-on skills with
AWS
and
Google Cloud Platform
; any exposure to
Azure
or
Oracle Cloud
is beneficial.
Proven experience with
Infrastructure as Code
(Terraform, Pulumi) and GitOps practices.
Solid knowledge of
Linux internals
, system-level debugging, and networking fundamentals.
Direct experience operating
GPU clusters
(preferably
NVIDIA
, including MIG usage); bonus points for experience with
ROCm
or
GPU-focused providers
like
Lambda
or
Nebius
.
Proficiency with
Prometheus
,
Grafana
, and maintaining observability at scale.
Comfortable working in English (written and spoken).

<Benefits>

Paid Vacations
Annual Bonus: 1-month salary

<Note>

This is a full-time position requiring 40 hours per week, but it will be structured as contractor work.
Devices: You will be expected to use your own computer to perform the work.
Sole Employment: No second job is permitted.

More Jobs at Canaan Advisors

[Remote] GPU Infrastructure Engineer

india

4.0 - 4.0 yrs

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

Canaan Advisors

RecommendedJobs for You

[Remote] GPU Infrastructure Engineer

Canaan Advisors

india

[Remote] GPU Infrastructure Engineer

Canaan Advisors

india

Before You Leave... Find Your Perfect Job!

Login to

Please Verify Your Phone or Email

Confirm Action

Contact Us

[Remote] GPU Infrastructure Engineer

Experience & Salary

Skills Required

Work Mode

Job Type

Job Description

Site Reliability Engineer (SRE)

<What you'll do>

Kubernetes Management:

AWS

Google Cloud

on-prem

Azure

Oracle Cloud

Infrastructure Automation:

Terraform

Pulumi

GitOps

Argo CD

Flux

CI/CD Pipeline Reliability:

build and deployment pipelines

GPU Infrastructure Operations:

NVIDIA driver management

MIG partitioning

auto-scaling

firmware lifecycle handling

AMD/ROCm

Monitoring & Observability:

Prometheus

Grafana

On-Call & Incident Response:

Process Development & Mentorship:

<Your Background>

Site Reliability Engineering

Kubernetes architecture

AWS

Google Cloud Platform

Azure

Oracle Cloud

Infrastructure as Code

Linux internals

GPU clusters

NVIDIA

ROCm

GPU-focused providers

Lambda

Nebius

Prometheus

Grafana

<Benefits>

<Note>

More Jobs at Canaan Advisors

[Remote] GPU Infrastructure Engineer

Mock Interview

Start Your Job Search Today

Please Verify Your Phone or Email

Job Application AI Bot

Download the Mobile App

Setup Job Alerts

Enhance Your Skills

RecommendedJobs for You

[Remote] GPU Infrastructure Engineer

[Remote] GPU Infrastructure Engineer

AI Job Matching Summary

Pros

Cons

Summary