Principal Site Reliability Engineer, AI Infrastructure

15 - 19 years

0 Lacs

Posted:1 day ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Role Overview: NVIDIA, a leading technology company, is seeking a talented and creative individual to join their team. As an NVIDIAN, you will be part of a dynamic and innovative environment where you can contribute to groundbreaking advancements in AI and computing. Your role will involve architecting, leading, and scaling globally distributed production systems while collaborating with cross-functional teams to drive technical decisions and innovations. Key Responsibilities: - Architect, lead, and scale globally distributed production systems supporting AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments. - Design and lead implementation of automation frameworks to reduce manual tasks, promote resilience, and uphold standard methodologies for system health, change safety, and release velocity. - Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing approaches for sophisticated distributed systems. - Lead cross-organizational efforts to assess operational maturity, address systemic risks, and establish long-term reliability strategies in collaboration with engineering, infrastructure, and product teams. - Pioneer initiatives that influence NVIDIAs AI platform roadmap, participate in co-development efforts with internal partners and external vendors, and stay ahead of academic and industry advances. - Publish technical insights (papers, patents, whitepapers) and drive innovation in production engineering and system design. - Lead and mentor global teams in a technical capacity, participate in recruitment, design reviews, and develop standard methodologies in incident response, observability, and system architecture. Qualifications Required: - 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure, with a strong track record of leading platform-scale efforts and high-impact programs. - Deep expertise in Linux/Unix systems engineering and public/private cloud platforms such as AWS, GCP, Azure, OCI. - Expert-level programming in Python and one or more languages like C++, Go, or Rust. - Demonstrated experience with Kubernetes at scale, CPU/GPU scheduling, microservice orchestration, and container lifecycle management in production. - Hands-on expertise in observability frameworks (Prometheus, Grafana, ELK, Loki, etc.) and Infrastructure as Code (Terraform, CDK, Pulumi). - Proficiency in Site Reliability Engineering concepts like error budgets, SLOs, distributed tracing, and architectural fault tolerance. - Ability to influence multi-functional collaborators and drive technical decisions through effective written and verbal communication. - Proven track record of completing long-term, forward-looking platform strategies. - Degree in Computer Science or related field, or equivalent experience. Additional Company Details: NVIDIA is widely recognized as one of the most desirable employers in the technology industry. They offer highly competitive salaries and a comprehensive benefits package. Join NVIDIA and be a part of building the infrastructure that powers the world's most advanced AI technologies. Apply now to make your mark at NVIDIA!,

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You