As a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team at NVIDIA, you will play a crucial role in designing, building, and maintaining services that support advanced data science and machine learning applications. Your responsibilities will involve implementing software and systems engineering practices to ensure high efficiency and availability of the platform, applying SRE principles to improve production systems, and collaborating with customers to implement changes while monitoring capacity, latency, and performance. Key Responsibilities: - Develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases. - Gain a deep understanding of system operations, scalability, interactions, and failures to identify improvement opportunities and risks. - Create tools and automation to reduce operational overhead and eliminate manual tasks. - Establish frameworks, processes, and standard methodologies to enhance operational maturity, team efficiency, and accelerate innovation. - Define meaningful and actionable reliability metrics to track and improve system and service reliability. - Oversee capacity and performance management to facilitate infrastructure scaling across public and private clouds globally. - Build tools to improve service observability for faster issue resolution. - Practice sustainable incident response and blameless postmortems. Qualifications Required: - Minimum of 6+ years of experience in SRE, Cloud platforms, or DevOps with large-scale microservices in production environments. - Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience. - Strong understanding of SRE principles, including error budgets, SLOs, and SLAs. - Proficiency in incident, change, and problem management processes. - Skilled in problem-solving, root cause analysis, and optimization. - Experience with streaming data infrastructure services, such as Kafka and Spark. - Expertise in building and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus). - Proficiency in programming languages such as Python, Go, Perl, or Ruby. - Hands-on experience with scaling distributed systems in public, private, or hybrid cloud environments. - Experience in deploying, supporting, and supervising services, platforms, and application stacks. Additional Company Details: NVIDIA leads the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, their invention, serves as the visual cortex of modern computers and is at the heart of their products and services. NVIDIA's work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for exceptional individuals to help accelerate the next wave of artificial intelligence. As a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team at NVIDIA, you will play a crucial role in designing, building, and maintaining services that support advanced data science and machine learning applications. Your responsibilities will involve implementing software and systems engineering practices to ensure high efficiency and availability of the platform, applying SRE principles to improve production systems, and collaborating with customers to implement changes while monitoring capacity, latency, and performance. Key Responsibilities: - Develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases. - Gain a deep understanding of system operations, scalability, interactions, and failures to identify improvement opportunities and risks. - Create tools and automation to reduce operational overhead and eliminate manual tasks. - Establish frameworks, processes, and standard methodologies to enhance operational maturity, team efficiency, and accelerate innovation. - Define meaningful and actionable reliability metrics to track and improve system and service reliability. - Oversee capacity and performance management to facilitate infrastructure scaling across public and private clouds globally. - Build tools to improve service observability for faster issue resolution. - Practice sustainable incident response and blameless postmortems. Qualifications Required: - Minimum of 6+ years of experience in SRE, Cloud platforms, or DevOps with large-scale microservices in production environments. - Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience. - Strong understanding of SRE principles, including error budgets, SLOs, and SLAs. - Proficiency in incident, change, and problem management processes. - Skilled in problem-solving, root cause analysis, and optimization. - Experience with streaming data infrastructure services,

More Jobs at NVIDIA Corporation

Brand Strategy Manager

pune, maharashtra

3.0 - 7.0 yrs

Salary: Not disclosed

Brand Strategy Manager

pune, maharashtra

3.0 - 7.0 yrs

Salary: Not disclosed

Senior Site Reliability Engineer, ML Platforms

gurugram, all india

6.0 - 10.0 yrs

Salary: Not disclosed

Senior System Software Engineer - AI Development Tools

gurugram, all india

5.0 - 9.0 yrs

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

NVIDIA Corporation

Login to

Please Verify Your Phone or Email

Confirm Action

Senior Site Reliability Engineer, ML Platforms