Overview
As a Site Reliability Engineer 2 (SRE 2), you play a dual role of a senior technical contributor and a team leader within the SRE team. In addition to ensuring system reliability, scalability, and performance, you will manage shift schedules, guide SRE 1 engineers, and ensure compliance with ITSM processes. Your focus will be both technical execution and operational excellence, ensuring that the team delivers high-quality, consistent support and reliability across environments.
Key Responsibilities
- Infrastructure Reliability and Technical Leadership
- Ensure high availability, scalability, and performance of systems through proactive monitoring, automation, and continuous improvement.
- Lead efforts in improving infrastructure observability using tools like Prometheus, Alertmanager, Grafana, and other telemetry systems.
- Serve as an escalation point for complex technical incidents and outages, providing guidance to SRE 1 engineers.
- Team Oversight and Performance Management
- Provide technical and operational leadership to SRE 1 engineers, ensuring daily tasks are executed as per standards.
- Review SRE 1 work regularly to ensure adherence to best practices, SOPs, and incident response protocols.
- Mentor and train junior team members to enhance their technical skills and operational understanding.
- Conduct regular feedback sessions and contribute to performance evaluations.
- Shift Management and 24/7 Coverage
- Design, implement, and manage rotating shift schedules to ensure optimal 24/7 support coverage.
- Monitor shift adherence, workload distribution, and overall team health.
- Ensure proper handovers between shifts with complete documentation and context sharing.
- ITSM Process and Compliance
- Own and enforce ITSM processes, including Incident Management, Change Management, Problem Management, and Service Request Fulfillment.
- Ensure that all incidents, changes, and problems are logged, categorized, and resolved or escalated as per SLA.
- Continuously assess and improve ITSM processes in collaboration with internal stakeholders and audit teams.
- Incident and Problem Management
- Lead major incident investigations and coordinate response efforts across teams.
- Oversee root cause analysis and implementation of long-term fixes for recurring issues.
- Maintain detailed incident logs and postmortem reports for high-priority incidents.
- Change and Maintenance Oversight
- Review and approve change requests initiated by SRE 1 or other team members.
- Ensure execution of maintenance tasks adheres to predefined SOPs and does not impact system stability.
- Track and analyze impact of changes to continuously improve reliability metrics.
- Reporting and Stakeholder Communication
- Create and present weekly/monthly reports on SRE metrics, team performance, incident trends, and capacity planning.
- Collaborate with cross-functional teams, including engineering, QA, support, and product, to align operational goals.
- Provide updates to leadership on key incidents, system health, and team productivity.
Skills:- Linux/Unix, Docker, Jenkins, grafana, Terraform, cicd, Python, Git and GitHub