SDE III - GPU Engineer

4 - 9 years

37 - 45 Lacs

Posted:4 weeks ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

We are looking for a

Senior Software Engineer (SDE III)

who will build, profile, and optimize GPU workloads powering next-generation generative AI experiences from

Stable Diffusion

image generation to transformer-based multimodal models. you'll work closely with research and infrastructure teams to make model inference faster, more cost-efficient, and production-ready.
This role is ideal for engineers passionate about

pushing GPUs to their limits

, writing high-performance kernels, and turning cutting-edge research into scalable systems.

 

Key Responsibilities

  • Develop, optimize, and maintain

    GPU kernels

    (CUDA, Triton, ROCm) for diffusion, attention, and convolution operators.
  • Profile end-to-end inference pipelines (data movement, kernel scheduling, memory transfers) to identify and resolve bottlenecks.
  • Apply techniques like

    operator fusion, tiling, caching, and mixed-precision compute

    to maximize GPU throughput.
  • Collaborate with researchers to productionize experimental layers or model architectures.
  • Build

    benchmarking tools

    and micro-tests for latency, memory, and throughput regressions.
  • Integrate kernel improvements into serving stacks, ensuring

    reliability and repeatable performance

    .
  • Work with platform teams to tune runtime configurations and job scheduling for GPU utilization.

Required Qualifications

  • 4+ years of experience in systems or ML engineering, with 2+ years working on

    GPU or accelerator optimization

    .
  • Strong hands-on skills with

    CUDA programming

    , memory hierarchies, warps, threads, and shared memory.
  • Familiarity with profiling tools (Nsight, nvprof, CUPTI) and performance analysis.
  • Working knowledge of PyTorch, JAX, or TensorFlow internals.
  • Proficiency in

    C++

    and

    Python

    .
  • Experience with

    mixed precision

    , FP16/BF16, or quantization.
  • Deep curiosity about system bottlenecks and numerical correctness.

Preferred Qualifications

  • Experience building fused operators or integrating custom kernels with PyTorch extensions.
  • Understanding of NCCL / distributed inference frameworks.
  • Contributions to open-source GPU or compiler projects (Triton, TVM, XLA, TensorRT).
  • Familiarity with multi-GPU / multi-node training and inference setups.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Glance logo
Glance

Technology, Mobile Advertising

N/A

RecommendedJobs for You