Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in hyderabad
>
NVIDIA
>
Senior Site Reliability Engineer, AI Infrastructure

Senior Site Reliability Engineer, AI Infrastructure

NVIDIA

12 - 16 years

0 Lacs

hyderabad telangana

Posted:3 months ago| Platform: Shine logo

Apply

Skills Required

python go perl ruby linux windows aws oci azure gcp gitlab jax cc terraform elk prometheus loki pytorch tensorflow ray

Work Mode

On-site

Job Type

Full Time

Job Description

As an aspiring member of the NVIDIA team, you will be joining a group of innovative and hardworking individuals who are at the forefront of technology. NVIDIA has a rich history of transforming computer graphics, PC gaming, and accelerated computing over the past 30 years, and is now leveraging the power of AI to shape the future of computing. If you are someone who is creative, autonomous, and driven to make a difference, we invite you to be a part of our AI Infrastructure Production engineering team and contribute to groundbreaking advancements in the field. Role Overview: - Develop and maintain large-scale systems that support critical AI Infrastructure use cases, ensuring reliability, operability, and scalability in global public and private clouds. - Implement Site Reliability Engineering (SRE) fundamentals, such as incident management, monitoring, and performance optimization, while creating automation tools to streamline manual processes. - Create tools and frameworks to enhance observability, define reliability metrics, and facilitate quick problem resolution to continuously improve system performance. - Establish operational frameworks, lead incident response procedures, and conduct postmortems to enhance team efficiency and system resilience. - Collaborate with engineering teams to deliver innovative solutions, mentor peers, maintain high coding and infrastructure standards, and participate in team building activities. Key Responsibilities: - Bachelor's degree in Computer Science or related field, or equivalent experience with a minimum of 12 years in Software Development, SRE, or Production Engineering. - Proficiency in Python and at least one other language (e.g., C/C++, Go, Perl, Ruby). - Expertise in systems engineering on Linux or Windows environments and various cloud platforms (AWS, OCI, Azure, GCP). - Strong grasp of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK). - Hands-on experience with observability platforms (ELK, Prometheus, Loki) and CI/CD systems (GitLab). - Effective communication skills to convey technical concepts to diverse audiences. - Dedication to promoting diversity, curiosity, and continuous improvement within the team. Qualifications Required: - Experience in AI training, inferencing, and data infrastructure services preferred. - Proficiency in deep learning frameworks like PyTorch, TensorFlow, JAX, and Ray would be advantageous. - Solid background in hardware health monitoring and system reliability is a plus. - Hands-on expertise in operating and scaling distributed systems with stringent SLAs for high availability and performance. - Proven track record in incident, change, and problem management processes within sophisticated environments. Should you possess the required qualifications and are eager to contribute to the cutting-edge work at NVIDIA, we look forward to welcoming you to our team.,

More Jobs at NVIDIA

Senior System Software Engineer – Simulation and Virtualization

Mumbai Metropolitan Region

5 - 5 yrs

Salary: Not disclosed

Senior System Software Engineer – Simulation and Virtualization

Gurugram, Haryana, India

5 - 5 yrs

Salary: Not disclosed

Senior System Software Engineer

Pune, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Senior Site Reliability Engineer

Pune, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Senior System Software Engineer, GPU Firmware

Pune, Maharashtra, India

Experience: Not specified

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

NVIDIA

Login to

Please Verify Your Phone or Email

Confirm Action

Senior Site Reliability Engineer, AI Infrastructure