Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in bengaluru
>
CG-VAK Software & Exports Ltd.
>
AI Runtime Lead (LLM DevOps, PyTorch)

AI Runtime Lead (LLM DevOps, PyTorch)

CG-VAK Software & Exports Ltd.

5 - 7 years

0 Lacs

bengaluru karnataka india

Posted:2 days ago| Platform: Foundit logo

Apply

Skills Required

failure recovery multi-gpu go resource orchestration multi-node job scheduling distributed training frameworks low-level performance tuning containerized workloads cloud-native ai workloads

Work Mode

On-site

Job Type

Full Time

Job Description

Role & ResponsibilitiesAs Lead/Staff AI Runtime Engineer, you'll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond). This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure. You'll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.

What You'll Do

Lead Runtime Design & Development:

Own the core runtime architecture supporting AI training and inference at scale.
Design resilient and elastic runtime features (e.g. dynamic node scaling, job recovery) within our custom PyTorch stack.
Optimize distributed training reliability, orchestration, and job-level fault tolerance.

Drive Performance At Scale

Profile and enhance low-level system performance across training and inference pipelines.
Improve packaging, deployment, and integration of customer models in production environments.
Ensure consistent throughput, latency, and reliability metrics across multi-node, multi- GPU setups.

Build Internal Tooling & Frameworks

Design and maintain libraries and services that support model lifecycle: training, check pointing, fault recovery, packaging, and deployment.
Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.
Champion best practices in CI/CD, testing, and software quality across the AI Runtime stack.

Collaborate & Mentor

Work cross-functionally with Research, Infrastructure, and Product teams to align runtime development with customer and platform needs.
Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team's capabilities.

Ideal Candidate

5+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.
Experience in delivering PaaS services.
Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.
Strong programming skills in Python and C++ (Go or Rust is a plus).
Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.
Experience working with multi-GPU, multi-node, or cloud-native AI workloads.
Solid understanding of containerized workloads, job scheduling, and failure recovery inproduction environments.

Skills: ai runtime,software,llm,deep learning,infrastructure,pytorch,stack

More Jobs at CG-VAK Software & Exports Ltd.

AI Runtime Lead (LLM DevOps, PyTorch)

bengaluru, karnataka, india

5.0 - 7.0 yrs

Salary: Not disclosed

Business Development Associate

mumbai metropolitan region

1.0 - 4.0 yrs

Salary: Not disclosed

Playwright Automation Testing Engineer- ServiceNow

india

10.0 - 10.0 yrs

Salary: Not disclosed

Salesforce Lead Engineer

hyderabad, telangana, india

4.0 - 4.0 yrs

Salary: Not disclosed

AI Runtime Engineering (LLM DevOps, PyTorch)

bengaluru, karnataka, india

5.0 - 5.0 yrs

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

CG-VAK Software & Exports Ltd.

RecommendedJobs for You

AI Runtime Lead (LLM DevOps, PyTorch)

CG-VAK Software & Exports Ltd.

bengaluru, karnataka, india

Login to

Please Verify Your Phone or Email

Confirm Action

AI Runtime Lead (LLM DevOps, PyTorch)