Principal Site Reliability Engineer, AI Infrastructure

15 - 19 years

0 Lacs

Posted:2 days ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Role Overview: NVIDIA is on the lookout for a skilled individual to join their team as a Production Systems Architect. As part of this role, you will be responsible for architecting, leading, and scaling globally distributed production systems that support AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments. Your primary focus will be on designing and implementing automation frameworks, defining reliability metrics, and leading cross-organizational efforts to ensure the long-term reliability of systems. Additionally, you will play a pivotal role in driving innovation in production engineering and system design while mentoring global teams in a technical capacity. Key Responsibilities: - Architect, lead, and scale globally distributed production systems supporting AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments. - Design and lead implementation of automation frameworks to reduce manual tasks, promote resilience, and uphold standard methodologies for system health, change safety, and release velocity. - Define platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing approaches for sophisticated distributed systems. - Lead cross-organizational efforts to assess operational maturity, address systemic risks, and establish long-term reliability strategies in collaboration with engineering, infrastructure, and product teams. - Pioneer initiatives that influence NVIDIAs AI platform roadmap, participating in co-development efforts with internal partners and external vendors, and staying ahead of academic and industry advances. - Publish technical insights (papers, patents, whitepapers) and drive innovation in production engineering and system design. - Lead and mentor global teams in a technical capacity, participating in recruitment, design reviews, and developing standard methodologies in incident response, observability, and system architecture. Qualifications Required: - 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure, with a strong track record of leading platform-scale efforts and high-impact programs. - Deep expertise in Linux/Unix systems engineering and public/private cloud platforms (AWS, GCP, Azure, OCI). - Expert-level programming in Python and one or more languages such as C++, Go, or Rust. - Demonstrated experience with Kubernetes at scale, CPU/GPU scheduling, microservice orchestration, and container lifecycle management in production. - Hands-on expertise in observability frameworks (Prometheus, Grafana, ELK, Loki, etc.) and Infrastructure as Code (Terraform, CDK, Pulumi). - Proficiency in Site Reliability Engineering concepts like error budgets, SLOs, distributed tracing, and architectural fault tolerance. - Ability to influence multi-functional collaborators and drive technical decisions through effective written and verbal communication. - Proven track record of completing long-term, forward-looking platform strategies. - Degree in Computer Science or related field, or equivalent experience. Additional Company Details: NVIDIA, widely considered to be one of the technology world's most desirable employers, offers highly competitive salaries and a comprehensive benefits package to its employees. If you're looking to make a lasting impact in the world of advanced AI infrastructure, seize this opportunity and apply to join NVIDIA's innovative team today!,

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You