Incident & Availability Manager

5 years

0 Lacs

Posted:3 days ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Part Time

Job Description

    5 - 7 Years
    1 Opening
    Trivandrum


Role description

1. Role Purpose

The Incident & Availability Manager is responsible for managing the complete lifecycle of incidents, including coordination of Major Incidents (MIM), to restore normal service as quickly as possible and minimize business impact. The role also governs service availability and reliability, ensuring that agreed SLAs, OLAs, and uptime targets are consistently met.

2. Key Responsibilities

Incident Management

  • Manage end-to-end handling of high-priority (P1/P2) incidents across infrastructure, applications, and business services.

  • Oversee triage, impact assessment, and stakeholder communication throughout the incident lifecycle.

  • Ensure incidents are logged, prioritized, and resolved per ITIL standards.

  • Lead technical bridge calls with resolver groups and vendors for quick restoration.

  • Conduct post-incident reviews and track corrective/preventive actions.

  • Analyze incident trends and recommend improvement measures.

  • Provide timely updates to users, management, and stakeholders.

Major Incident Management (MIM)

  • Lead all Major Incidents (P1) to ensure fast recovery and effective communication.

  • Act as the single point of accountability during critical outages.

  • Manage Major Incident bridges, coordinate technical teams, and update leadership in real time.

  • Prepare and share MIM communications — initial notifications, progress updates, and closure summaries.

  • Produce post-MIM reports including business impact, RCA summary, and recovery actions.

  • Ensure RCA and preventive actions are completed in coordination with Problem Management.

Availability Management

  • Monitor and report on availability of critical IT systems and services.

  • Define, measure, and track SLAs, OLAs, and uptime metrics.

  • Identify and address recurring availability issues with Problem and Capacity teams.

  • Support proactive monitoring, redundancy, and resilience improvements.

  • Participate in DR testing, failover validation, and service continuity initiatives.

Governance & Reporting

  • Maintain dashboards and reports for incident and availability KPIs.

  • Present weekly/monthly operations reviews to leadership and stakeholders.

  • Work with Change and Problem Management to reduce incidents and operational risks.

  • Contribute to ITSM process improvement and service maturity initiatives.

Stakeholder Communication

  • Act as the main point of contact for stakeholders during major incidents.

  • Provide timely and clear updates to leadership, clients, and users.

  • Deliver executive summaries and post-incident reports.

  • Manage escalation paths and vendor coordination effectively.

3. Required Skills & Experience

Technical & Process Skills

  • Strong experience in Incident and Major Incident Management in a 24x7 enterprise environment.

  • Hands-on experience with ITSM tools (ManageEngine, ServiceNow, Jira Service Management).

  • Sound understanding of ITIL processes (Incident, Problem, Change, Availability, Capacity).

  • Familiarity with key infrastructure areas (Cloud, Network, Server, End User).

  • Proven ability to coordinate multiple technical teams during high-severity incidents.

  • Knowledge of monitoring tools (SolarWinds, Dynatrace, CloudWatch, Splunk, etc.).

Soft Skills

  • Excellent communication and stakeholder management skills.

  • Calm, decisive, and effective under pressure.

  • Strong analytical and problem-solving abilities.

  • Proven leadership and team coordination skills.

  • Highly organized and process-driven.

4. Qualifications

  • Bachelor’s Degree in Information Technology or equivalent.

  • 8–12 years of IT Operations experience, including 3+ years in Major Incident or Availability Management.

  • Certifications:

    • ITIL v4 Intermediate or Expert (mandatory)

    • Major Incident / Problem Management certification (preferred)

    • AWS or Azure Foundations certification (desirable)

5. Tools & Platforms

  • ITSM: ManageEngine, ServiceNow, Jira Service Management

  • Monitoring: SolarWinds, Dynatrace, CloudWatch, PRTG, Splunk

  • Collaboration: Microsoft Teams, Outlook, SharePoint



Support Coverage: 24x7 (On-call rotation for Major Incident & Problem Management support)

Skills

Servicenow,Incident Management,Manage Engine,Jira service Management

About UST

UST is a global digital transformation solutions provider. For more than 20 years, UST has worked side by side with the world’s best companies to make a real impact through transformation. Powered by technology, inspired by people and led by purpose, UST partners with their clients from design to operation. With deep domain expertise and a future-proof philosophy, UST embeds innovation and agility into their clients’ organizations. With over 30,000 employees in 30 countries, UST builds for boundless impact—touching billions of lives in the process.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
UST Global logo
UST Global

Information Technology Services

Oxnard

RecommendedJobs for You