Responsibilities
Job Description: Site Reliability Engineering (SRE) Manager - Observability & ITOMIndicative years of total experience: 14 - 16 yearsLocation:Pune/Hyderabad
Department
Engineering / IT Operations
Reporting Relationship
This role will report to Program Manager
Job Type
Full-Time (Hybrid)
Job Summary
We are seeking a seasoned SRE Manager to lead our Observability & Reliability Engineering team, with a strong focus on IT Operations Management (ITOM) practices. This role will be responsible for driving end-to-end reliability, performance, and operational excellence across our infrastructure and applications. The ideal candidate will also oversee the ServiceNow ITOM module, ensuring seamless integration and automation of IT operations workflows.
Key Responsibilities
Leadership & Strategy
- Lead and mentor a team of SREs and Observability Engineers.
- Define and drive the strategic roadmap for reliability, observability, and ITOM practices.
- Collaborate with cross-functional teams (DevOps, Platform Engineering, Application Development, and ITSM) to align reliability goals with business objectives.
Observability & Monitoring
- Own the observability stack including metrics, logs, traces, and dashboards.
- Implement and manage tools like Prometheus, Grafana, ELK, Splunk, Datadog, or similar.
- Drive proactive monitoring, alerting, and anomaly detection to reduce MTTR and improve system health.
Reliability Engineering
- Champion SRE principles such as SLIs, SLOs, and error budgets.
- Lead incident response and postmortem processes to ensure continuous improvement.
- Automate operational tasks and improve system resilience through chaos engineering and fault injection.
ITOM Practice Management
- Oversee the implementation and optimization of ServiceNow ITOM modules (Discovery, Event Management, Orchestration, CMDB).
- Ensure accurate and up-to-date CMDB data to support incident, problem, and change management processes.
- Drive automation of IT operations workflows using ServiceNow and other orchestration tools.
Process & Governance
- Establish and enforce best practices for change management, incident management, and problem resolution.
- Ensure compliance with internal and external audit requirements related to IT operations.
Stakeholder Engagement
- Act as a key liaison between engineering, operations, and business stakeholders.
- Provide regular updates and reports on system reliability, performance, and operational KPIs.
Qualifications
Required Qualifications:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
- 10+ years of experience in IT operations, DevOps, or SRE roles.
- 3+ years in a leadership or managerial role.
- Hands-on experience with observability tools and practices.
- Strong expertise in ServiceNow ITOM modules and CMDB management.
- Excellent communication, leadership, and stakeholder management skills.
Preferred Skills
- Certifications in SRE, ServiceNow ITOM & cloud platforms (AWS, Azure, GCP).
- Experience with infrastructure as code (Terraform, Ansible).
- Familiarity with container orchestration (Kubernetes, Docker).
- Knowledge of ITIL processes and frameworks.
Additional Information
Required Behavioral Competency:
- Make sound business decisions
- Embrace Change
- Build strong Partnership
- Get results
- Act Strategically
- Lead Cultivate Talent