10 - 15 years

10 - 20 Lacs

Posted:1 hour ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

The Manager - SRE plays a pivotal role in driving the success of multiple teams within the organization. This position involves leading, guiding, and empowering the teams responsible for Major Incident Management, Event Monitoring, Release Engineering, and Automation Testing.

The role is critical in ensuring the reliability, performance, and availability of software systems within the organization. This position involves overseeing a team of Site Reliability Engineers (SREs) and collaborating with cross-functional teams to maintain robust and efficient systems.

Responsibilities

1. Incident Management:

  • Lead the Major Incident Management team in handling critical incidents.
  • Establish and maintain incident response procedures.
  • Ensure timely communication, tracking, and resolution of major incidents.
  • Coordinate cross-functional efforts during major incidents.
  • Analyze equipment failure data, performance reports, and incidents to identify trends and areas for improvement.
  • Focus on root cause analysis and implement long-term solutions to prevent recurring issues.
  • Influence and improve the incident management lifecycle to identify, mitigate, and learn from reliability risks.
  • Embed into SRE projects and on-call rotations to stay close to operational workflows and address issues promptly.
  • Lead and manage a team of SREs responsible for monitoring, automating, and improving system reliability.
  • Foster a healthy work environment, promote collaboration, and ensure the teams professional development.
  • Compile key performance indicators (KPIs) and advocate for best practices related to performance and reliability.

2. Event Monitoring:

  • Oversee the Event Monitoring Team’s activities.
  • Proactively identify and address potential incidents.
  • Ensure effective detection and response to critical events.
  • Automate processes to enhance system reliability and performance.
  • Build effective monitoring systems to proactively detect and address anomalies.

3. Release Engineering:

  • Provide leadership to the Release Engineering team.
  • Manage incidents related to software deployments, updates, and releases.
  • Collaborate with other teams to resolve deployment-related issues.
  • Lead the Automation Testing team.
  • Address incidents related to automated testing processes and tools.
  • Optimize testing workflows and ensure efficient resolution of issues.

Qualifications

  • Minimum 8 years of experience in team leadership or management roles.
  • Proficiency in incident management and crisis resolution.
  • Familiarity with ITIL and ITSM practices.
  • Technical knowledge in areas such as cloud platforms (AWS, Azure) networking, and Infrastructure support.
  • Strong situational awareness and decisive decision-making skills.

    Role & responsibilities

Preferred candidate profile

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Digit Insurance logo
Digit Insurance

Insurance

Bengaluru Karnataka

RecommendedJobs for You

baddi, sonipat, kundli