Incident Manager

14 years

0 Lacs

Posted:23 hours ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Roles and Responsibilities:

  • Act as the primary point of contact for major incidents and escalations, ensuring rapid response and communication across technical and business teams.
  • Lead and coordinate incident resolution efforts involving multiple support teams and stakeholders to restore service as quickly as possible.
  • Manage the end-to-end incident lifecycle – detection, logging, categorization, prioritization, resolution, and closure.
  • Conduct detailed Root Cause Analysis (RCA) for high-severity incidents and drive implementation of permanent fixes.
  • Work closely with AWS cloud infrastructure teams to identify and resolve platform-level or configuration issues.
  • Collaborate with architecture and development teams to identify patterns, improve system reliability, and strengthen incident prevention strategies.
  • Develop and maintain incident management processes, playbooks, and metrics to improve response efficiency and reduce recurrence.
  • Manage communications and stakeholder expectations during critical incidents and post-incident reviews.
  • Participate in on-call rotations and ensure 24x7 support coverage as required.
  • Continuously drive improvements in monitoring, alerting, and automation to minimize incident impact and MTTR (Mean Time to Recovery).

Required Skills & Qualifications:

  • 8–14 years of experience in Incident Management / Production Support / Site Reliability / IT Operations roles.
  • Strong experience in managing incidents within complex distributed architectures and cloud-based environments (AWS preferred).
  • Expertise in AWS services such as EC2, S3, Lambda, CloudWatch, RDS, and related monitoring and logging tools.
  • Exposure to Redis and Elasticsearch for cache management, data indexing, and performance optimization.
  • Excellent communication and coordination skills to handle high-pressure situations and interact with senior stakeholders.
  • Proven ability to perform Root Cause Analysis (RCA) and implement corrective and preventive measures.
  • Experience with ITIL processes (Incident, Problem, Change Management).
  • Familiarity with tools such as

    ServiceNow, Jira, CloudWatch, PagerDuty

    , etc.
  • Strong analytical and problem-solving skills with a proactive approach to issue resolution.
  • Ability to work in 24x7 production support environments and handle critical incident escalations effectively.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Talentoj logo
Talentoj

Human Resources

Talent City

RecommendedJobs for You

bengaluru, karnataka, india

bengaluru, karnataka, india