[Remote] GPU Infrastructure Engineer

4 years

0 Lacs

Posted:5 days ago| Platform: Linkedin logo

Apply

Work Mode

Remote

Job Type

Full Time

Job Description

Site Reliability Engineer (SRE)

<What you'll do>

  • Kubernetes Management:

    Design, optimize, and maintain scalable multi-cluster Kubernetes deployments across

    AWS

    ,

    Google Cloud

    , and

    on-prem

    infrastructure. Potential expansion into

    Azure

    or

    Oracle Cloud

    environments.
  • Infrastructure Automation:

    Use

    Terraform

    ,

    Pulumi

    , and

    GitOps

    methodologies (e.g.,

    Argo CD

    ,

    Flux

    ) to provision and manage cloud-native resources.
  • CI/CD Pipeline Reliability:

    Maintain high-availability

    build and deployment pipelines

    , ensuring rapid, safe delivery with strong rollback strategies.
  • GPU Infrastructure Operations:

    Operate and scale GPU fleets with

    NVIDIA driver management

    ,

    MIG partitioning

    ,

    auto-scaling

    , and

    firmware lifecycle handling

    . Familiarity with

    AMD/ROCm

    and upcoming GPU platforms is a plus.
  • Monitoring & Observability:

    Scale and tune observability systems including

    Prometheus

    and

    Grafana

    . Define and track SLIs/SLOs and enable proactive capacity monitoring.
  • On-Call & Incident Response:

    Participate in a rotating on-call schedule, lead incident resolution efforts, and contribute to incident retrospectives and runbook documentation.
  • Process Development & Mentorship:

    Help shape and evolve SRE processes, and mentor engineers across the organization on reliability practices and tooling.


<Your Background>

  • 4+ years as a

    Site Reliability Engineering

  • Deep understanding of

    Kubernetes architecture

    and experience with managing large-scale clusters in production.
  • Strong hands-on skills with

    AWS

    and

    Google Cloud Platform

    ; any exposure to

    Azure

    or

    Oracle Cloud

    is beneficial.
  • Proven experience with

    Infrastructure as Code

    (Terraform, Pulumi) and GitOps practices.
  • Solid knowledge of

    Linux internals

    , system-level debugging, and networking fundamentals.
  • Direct experience operating

    GPU clusters

    (preferably

    NVIDIA

    , including MIG usage); bonus points for experience with

    ROCm

    or

    GPU-focused providers

    like

    Lambda

    or

    Nebius

    .
  • Proficiency with

    Prometheus

    ,

    Grafana

    , and maintaining observability at scale.
  • Comfortable working in English (written and spoken).


<Benefits>

  • Paid Vacations
  • Annual Bonus: 1-month salary


<Note>

  • This is a full-time position requiring 40 hours per week, but it will be structured as contractor work.
  • Devices: You will be expected to use your own computer to perform the work.
  • Sole Employment: No second job is permitted.



Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You