Posted:3 days ago|
                                Platform:
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                 
                                
                                
                                
                                
                            
On-site
Part Time
1. Role Purpose
The Incident & Availability Manager is responsible for managing the complete lifecycle of incidents, including coordination of Major Incidents (MIM), to restore normal service as quickly as possible and minimize business impact. The role also governs service availability and reliability, ensuring that agreed SLAs, OLAs, and uptime targets are consistently met.
Incident Management
Manage end-to-end handling of high-priority (P1/P2) incidents across infrastructure, applications, and business services.
Oversee triage, impact assessment, and stakeholder communication throughout the incident lifecycle.
Ensure incidents are logged, prioritized, and resolved per ITIL standards.
Lead technical bridge calls with resolver groups and vendors for quick restoration.
Conduct post-incident reviews and track corrective/preventive actions.
Analyze incident trends and recommend improvement measures.
Provide timely updates to users, management, and stakeholders.
Major Incident Management (MIM)
Lead all Major Incidents (P1) to ensure fast recovery and effective communication.
Act as the single point of accountability during critical outages.
Manage Major Incident bridges, coordinate technical teams, and update leadership in real time.
Prepare and share MIM communications — initial notifications, progress updates, and closure summaries.
Produce post-MIM reports including business impact, RCA summary, and recovery actions.
Ensure RCA and preventive actions are completed in coordination with Problem Management.
Availability Management
Monitor and report on availability of critical IT systems and services.
Define, measure, and track SLAs, OLAs, and uptime metrics.
Identify and address recurring availability issues with Problem and Capacity teams.
Support proactive monitoring, redundancy, and resilience improvements.
Participate in DR testing, failover validation, and service continuity initiatives.
Governance & Reporting
Maintain dashboards and reports for incident and availability KPIs.
Present weekly/monthly operations reviews to leadership and stakeholders.
Work with Change and Problem Management to reduce incidents and operational risks.
Contribute to ITSM process improvement and service maturity initiatives.
Stakeholder Communication
Act as the main point of contact for stakeholders during major incidents.
Provide timely and clear updates to leadership, clients, and users.
Deliver executive summaries and post-incident reports.
Manage escalation paths and vendor coordination effectively.
Technical & Process Skills
Strong experience in Incident and Major Incident Management in a 24x7 enterprise environment.
Hands-on experience with ITSM tools (ManageEngine, ServiceNow, Jira Service Management).
Sound understanding of ITIL processes (Incident, Problem, Change, Availability, Capacity).
Familiarity with key infrastructure areas (Cloud, Network, Server, End User).
Proven ability to coordinate multiple technical teams during high-severity incidents.
Knowledge of monitoring tools (SolarWinds, Dynatrace, CloudWatch, Splunk, etc.).
Soft Skills
Excellent communication and stakeholder management skills.
Calm, decisive, and effective under pressure.
Strong analytical and problem-solving abilities.
Proven leadership and team coordination skills.
Highly organized and process-driven.
Bachelor’s Degree in Information Technology or equivalent.
8–12 years of IT Operations experience, including 3+ years in Major Incident or Availability Management.
Certifications:
ITIL v4 Intermediate or Expert (mandatory)
Major Incident / Problem Management certification (preferred)
AWS or Azure Foundations certification (desirable)
ITSM: ManageEngine, ServiceNow, Jira Service Management
Monitoring: SolarWinds, Dynatrace, CloudWatch, PRTG, Splunk
Collaboration: Microsoft Teams, Outlook, SharePoint
Support Coverage: 24x7 (On-call rotation for Major Incident & Problem Management support)
Servicenow,Incident Management,Manage Engine,Jira service Management
 
                UST Global
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
 
        Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
 
            
         
                        
                     
    thiruvananthapuram
Salary: Not disclosed
trivandrum, kerala, india
Salary: Not disclosed
thiruvananthapuram
Salary: Not disclosed
trivandrum, kerala, india
Salary: Not disclosed