Senior Incident Management Reliability Engineer

15 - 20 years

8 - 12 Lacs

Posted:1 day ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Our Team:
Service Quality cultivates a culture of service excellence where quality is more than a benchmark it's a shared purpose. Through synergistic collaboration, advanced monitoring, and empathetic customer advocacy, we strive to elevate every interaction and transform challenges into opportunities for growth.
 
Main responsibilities:
The Incident Management Reliability Engineer is responsible for ensuring the stability, resilience, and reliability of critical IT services. This role combines strong incident management expertise with reliability engineering principles to minimize disruptions, drive rapid recovery from major incidents, and continuously improve system performance and availability.

Incident Management

  • Lead the end-to-end management of Major Incidents (P1/P2), ensuring timely resolution and effective stakeholder communication.
  • Act as command centre lead during critical outages, coordinating across technical and business teams.
  • Ensure accurate and detailed incident documentation, including root cause, timeline and resolution steps.
  • Drive post-incident-reviews and ensure action items are implemented to prevent recurrence.
  • Maintain consistent communication and escalation processes aligned with ITSM best practices (e.g. ITIL)

Reliability Engineering

  • Collaborate with service owners and platform teams to enhance service reliability, observability, and fault tolerance.
  • Implement proactive monitoring, alerting, and automated recovery mechanisms.
  • Analyse incident trends and develop reliability improvement plans.
  • Participate in capacity planning, change reviews, and failure mode analysis to anticipate and mitigate risks.
  • Develop and track SLOs/SLIs/SLAs to measure service health and performance.

Continuous Improvement

  • Partner with problem management to identify recurring issues and lead root cause elimination initiatives.
  • Automate operational tasks and enhance service recovery using scripts, runbooks, and AIOps tools.
  • Contribute to the evolution of the Major Incident Process, ensuring best practices are embedded across the organization.

Key Performance Indicators

  • Mean Time to Resolve (MTTR) and Mean Time to Detect (MTTD).
  • Reduction in number and impact of recurring incidents.
  • Adherence to SLA/SLO targets.
  • Completion rate of post-incident actions.
  • Stakeholder satisfaction and transparency during incidents.

About you

Experience :

  • 15+ years' experience.

Preferred Certifications:

  • ITIL v4 or Service Operations certification.
  • SRE Foundation / Practitioner certification.
  • Cloud certifications (AWS, Azure, or GCP).
  • Incident Command System (ICS) or equivalent leadership training in crisis response.

Soft skills :

  • Communication (verbal and written).

Technical skills :

  • Virtualization
  • Cloud Technologies
  • Database
  • Networking
  • Containerization
  • Automation
  • Middleware/Scheduling
  • Infrastructure as code

Languages :

  • English

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Sanofi logo
Sanofi

Pharmaceutical Manufacturing

Paris France

RecommendedJobs for You