Site Reliability Engineer

3 - 5 years

7 - 11 Lacs

Posted:1 day ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

On-prem infrastructure management

Manage Nvidia s on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers.

Guard SLAs

Observability

Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK.

Improve monitoring systems by adding custom alerts based on business needs.

Automation & Optimization

Help in capacity planning, optimization and better utilization efforts.

Day-to-Day Support

Support user reported issues & issues. Monitor alerts and take necessary action.

Actively participate in WAR room for critical issues

Collaboration & Documentation

Create and maintain documentation for operational procedures, configurations, and troubleshooting guides.

Tech stack

Baremetal data center machine management tools like IPMI, Redfish, KVM etc.

Automation using Jenkins, Python, Go, Bash.

Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK.

Any familiarity with Nvidia hardware like GPU & Tegras is a plus

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Natobotics Technologies logo
Natobotics Technologies

Robotics and Automation

Innovate City

RecommendedJobs for You