Posted:1 month ago|
Platform:
Hybrid
Full Time
Job Description:
Site Reliability Engineering (SRE) Manager Observability & ITOM
Indicative years of total experience: 14 – 16 years
Location:
Pune
Department:
Engineering / IT Operations
This role will report to Program Manager
Job Type:
Full-Time (Hybrid)
Job Summary:
We are seeking a seasoned SRE Manager to lead our Observability & Reliability Engineering team, with a strong focus on IT Operations Management (ITOM) practices. This role will be responsible for driving end-to-end reliability, performance, and operational excellence across our infrastructure and applications. The ideal candidate will also oversee the ServiceNow ITOM module, ensuring seamless integration and automation of IT operations workflows.
Key Responsibilities:
Leadership & Strategy
- Lead and mentor a team of SREs and Observability Engineers.
- Define and drive the strategic roadmap for reliability, observability, and ITOM practices.- Collaborate with cross-functional teams (DevOps, Platform Engineering, Application Development, and ITSM) to align reliability goals with business objectives.
Observability & Monitoring
- Own the observability stack including metrics, logs, traces, and dashboards.
- Implement and manage tools like Prometheus, Grafana, ELK, Splunk, Datadog, or similar.- Drive proactive monitoring, alerting, and anomaly detection to reduce MTTR and improve system health.
Reliability Engineering
- Champion SRE principles such as SLIs, SLOs, and error budgets.
- Lead incident response and postmortem processes to ensure continuous improvement.- Automate operational tasks and improve system resilience through chaos engineering and fault injection.
ITOM Practice Management
- Oversee the implementation and optimization of ServiceNow ITOM modules (Discovery, Event Management, Orchestration, CMDB).
- Ensure accurate and up-to-date CMDB data to support incident, problem, and change management processes.- Drive automation of IT operations workflows using ServiceNow and other orchestration tools.
Process & Governance
- Establish and enforce best practices for change management, incident management, and problem resolution.
- Ensure compliance with internal and external audit requirements related to IT operations.
Stakeholder Engagement
- Act as a key liaison between engineering, operations, and business stakeholders.
- Provide regular updates and reports on system reliability, performance, and operational KPIs.
Principal Global Services
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
19.0 - 34.0 Lacs P.A.
25.0 - 30.0 Lacs P.A.
hyderabad
50.0 - 55.0 Lacs P.A.
chennai
6.0 - 9.0 Lacs P.A.
chennai
8.0 - 12.0 Lacs P.A.
bengaluru
8.0 - 11.0 Lacs P.A.
mohali, dehradun, nagpur
10.0 - 13.0 Lacs P.A.
12.0 - 16.0 Lacs P.A.
bengaluru
12.0 - 13.0 Lacs P.A.
40.0 - 45.0 Lacs P.A.