Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in karnataka
>
NVIDIA
>
Principal Site Reliability Engineer, AI Infrastructure

Principal Site Reliability Engineer, AI Infrastructure

NVIDIA

15 - 19 years

0 Lacs

karnataka

Posted:3 months ago| Platform: Shine logo

Apply

Skills Required

production engineering python c go rust kubernetes reliability engineering incident management root cause analysis sre cloud infrastructure linuxunix systems engineering cpugpu scheduling microservice orchestration container lifecycle management observability frameworks infrastructure as code site reliability engineering concepts distributed tracing architectural fault tolerance written verbal communication deep learning frameworks orchestration frameworks hardware fleet observability predictive failure analysis postmortem processes

Work Mode

On-site

Job Type

Full Time

Job Description

Role Overview: NVIDIA, a leading technology company, is seeking a talented and creative individual to join their team. As an NVIDIAN, you will be part of a dynamic and innovative environment where you can contribute to groundbreaking advancements in AI and computing. Your role will involve architecting, leading, and scaling globally distributed production systems while collaborating with cross-functional teams to drive technical decisions and innovations. Key Responsibilities: - Architect, lead, and scale globally distributed production systems supporting AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments. - Design and lead implementation of automation frameworks to reduce manual tasks, promote resilience, and uphold standard methodologies for system health, change safety, and release velocity. - Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing approaches for sophisticated distributed systems. - Lead cross-organizational efforts to assess operational maturity, address systemic risks, and establish long-term reliability strategies in collaboration with engineering, infrastructure, and product teams. - Pioneer initiatives that influence NVIDIAs AI platform roadmap, participate in co-development efforts with internal partners and external vendors, and stay ahead of academic and industry advances. - Publish technical insights (papers, patents, whitepapers) and drive innovation in production engineering and system design. - Lead and mentor global teams in a technical capacity, participate in recruitment, design reviews, and develop standard methodologies in incident response, observability, and system architecture. Qualifications Required: - 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure, with a strong track record of leading platform-scale efforts and high-impact programs. - Deep expertise in Linux/Unix systems engineering and public/private cloud platforms such as AWS, GCP, Azure, OCI. - Expert-level programming in Python and one or more languages like C++, Go, or Rust. - Demonstrated experience with Kubernetes at scale, CPU/GPU scheduling, microservice orchestration, and container lifecycle management in production. - Hands-on expertise in observability frameworks (Prometheus, Grafana, ELK, Loki, etc.) and Infrastructure as Code (Terraform, CDK, Pulumi). - Proficiency in Site Reliability Engineering concepts like error budgets, SLOs, distributed tracing, and architectural fault tolerance. - Ability to influence multi-functional collaborators and drive technical decisions through effective written and verbal communication. - Proven track record of completing long-term, forward-looking platform strategies. - Degree in Computer Science or related field, or equivalent experience. Additional Company Details: NVIDIA is widely recognized as one of the most desirable employers in the technology industry. They offer highly competitive salaries and a comprehensive benefits package. Join NVIDIA and be a part of building the infrastructure that powers the world's most advanced AI technologies. Apply now to make your mark at NVIDIA!,

More Jobs at NVIDIA

Senior System Software Engineer – Simulation and Virtualization

Mumbai Metropolitan Region

5 - 5 yrs

Salary: Not disclosed

Senior System Software Engineer – Simulation and Virtualization

Gurugram, Haryana, India

5 - 5 yrs

Salary: Not disclosed

Senior System Software Engineer

Pune, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Senior Site Reliability Engineer

Pune, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Senior System Software Engineer, GPU Firmware

Pune, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

NVIDIA

Login to

Please Verify Your Phone or Email

Confirm Action

Principal Site Reliability Engineer, AI Infrastructure