Location: Pan India
Experience:
5–10 Years
Role:
On-Prem Infrastructure Engineer / Site Reliability Engineer (SRE)
Job Summary
We are seeking a skilled On-Prem Infrastructure Engineer / SRE to manage and support NVIDIA’s on-prem engineering cloud infrastructure across multiple data centers. The ideal candidate will have strong experience in bare-metal infrastructure management, observability tools, automation, and production support. This role is critical in ensuring uptime, reliability, and operational excellence for engineering services.
Key Responsibilities
- On-Prem Infrastructure Management
Manage and operate NVIDIA’s on-prem infrastructure across distributed data centers.
Maintain high availability, reliability, and readiness of on-prem engineering cloud environments.Perform lifecycle management of bare-metal servers and underlying hardware.
Guard and maintain Service Level Agreements (SLAs) for mission-critical engineering services.
Implement and maintain monitoring, alerting, and incident response workflows.Drive root cause analysis (RCA), conduct post-mortems, and ensure corrective and preventive actions.
- Observability & Monitoring
Deploy, configure, and manage observability tools such as
Prometheus, Grafana, ELK Stack
.Maintain KPI monitoring pipelines using
Jenkins, Python, and ELK
.Develop and enhance custom monitoring dashboards and business-specific alerting rules.
- Automation & Optimization
Contribute to capacity planning, resource optimization, and performance tuning initiatives.Develop automation scripts/tools using
Python, Go, Bash
, or Jenkins pipelines.Improve operational efficiency through continuous automation.
- Day-to-Day Operations & Support
Monitor system alerts, troubleshoot incidents, and resolve user-reported issues.Participate in
WAR rooms
during major or high-impact incidents.Ensure timely escalation and resolution of production issues.
- Collaboration & Documentation
Create and maintain technical documentation for operational procedures, architectures, and troubleshooting steps.Work closely with engineering, DevOps, hardware, and data center teams to improve overall infrastructure reliability.
Required Skills & Experience
Strong hands-on experience in
bare-metal server management
using tools such as:
IPMI, Redfish, KVM
or similar technologies.
Experience With Automation And Scripting Using
Python, Go, Bash, Jenkins (CI/CD pipelines)
.
Practical Experience With Infrastructure Tools
Kubernetes, MySQL, Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana)
.Solid understanding of system performance, capacity planning, and datacenter operations.Strong troubleshooting, incident-response, and operational debugging skills.Ability to work in fast-paced environments and handle production-critical scenarios.
Nice-to-Have Skills
Familiarity with
NVIDIA hardware
: GPUs, Tegra systems, DGX platforms, etc.Experience in large-scale distributed systems or high-performance computing environments.
Soft Skills
Strong communication and collaboration abilities.Analytical mindset with a focus on problem-solving.Ability to maintain composure under pressure in incident environments.Detail-oriented with strong documentation habits.ocumentation habits.