Site Reliability Engineer 2 (SRE 2)

0 years

5 - 7 Lacs

Posted:4 days ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Overview

As a Site Reliability Engineer 2 (SRE 2), you play a dual role of a senior technical contributor and a team leader within the SRE team. In addition to ensuring system reliability, scalability, and performance, you will manage shift schedules, guide SRE 1 engineers, and ensure compliance with ITSM processes. Your focus will be both technical execution and operational excellence, ensuring that the team delivers high-quality, consistent support and reliability across environments.

Key Responsibilities

  • Infrastructure Reliability and Technical Leadership
  • Ensure high availability, scalability, and performance of systems through proactive monitoring, automation, and continuous improvement.
  • Lead efforts in improving infrastructure observability using tools like Prometheus, Alertmanager, Grafana, and other telemetry systems.
  • Serve as an escalation point for complex technical incidents and outages, providing guidance to SRE 1 engineers.
  • Team Oversight and Performance Management
  • Provide technical and operational leadership to SRE 1 engineers, ensuring daily tasks are executed as per standards.
  • Review SRE 1 work regularly to ensure adherence to best practices, SOPs, and incident response protocols.
  • Mentor and train junior team members to enhance their technical skills and operational understanding.
  • Conduct regular feedback sessions and contribute to performance evaluations.
  • Shift Management and 24/7 Coverage
  • Design, implement, and manage rotating shift schedules to ensure optimal 24/7 support coverage.
  • Monitor shift adherence, workload distribution, and overall team health.
  • Ensure proper handovers between shifts with complete documentation and context sharing.
  • ITSM Process and Compliance
  • Own and enforce ITSM processes, including Incident Management, Change Management, Problem Management, and Service Request Fulfillment.
  • Ensure that all incidents, changes, and problems are logged, categorized, and resolved or escalated as per SLA.
  • Continuously assess and improve ITSM processes in collaboration with internal stakeholders and audit teams.
  • Incident and Problem Management
  • Lead major incident investigations and coordinate response efforts across teams.
  • Oversee root cause analysis and implementation of long-term fixes for recurring issues.
  • Maintain detailed incident logs and postmortem reports for high-priority incidents.
  • Change and Maintenance Oversight
  • Review and approve change requests initiated by SRE 1 or other team members.
  • Ensure execution of maintenance tasks adheres to predefined SOPs and does not impact system stability.
  • Track and analyze impact of changes to continuously improve reliability metrics.
  • Reporting and Stakeholder Communication
  • Create and present weekly/monthly reports on SRE metrics, team performance, incident trends, and capacity planning.
  • Collaborate with cross-functional teams, including engineering, QA, support, and product, to align operational goals.
  • Provide updates to leadership on key incidents, system health, and team productivity.
Skills:- Linux/Unix, Docker, Jenkins, grafana, Terraform, cicd, Python, Git and GitHub

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You