Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in hyderabad
>
F5
>
Site Reliability Engineer - Incident Management

Site Reliability Engineer - Incident Management

3 - 8 years

4 - 7 Lacs

hyderabad bengaluru

Posted:2 weeks ago| Platform:

Apply

Skills Required

incident management python site reliability iac site reliability engineering cybersecurity ansible grafana devops splunk terraform aws infrastructure as code cloudformation azure

Work Mode

Work from Office

Job Type

Full Time

Job Description

Position Summary

The Reliability Engineer will be a critical contributor within the Site Reliability Engineering (SRE) and Incident Management team, focusing on ensuring the availability, reliability, and performance of critical systems and services. This role is responsible for managing and facilitating major incident response efforts, ensuring that service disruptions are quickly identified, triaged, and resolved. As an incident facilitator, the Reliability Engineer will take the lead during high-pressure situations, collaborating with cross-functional teams to restore service and drive root cause analysis to prevent future issues.Clear and consistent communication will be critical to the success of the incident management team and processes.

In addition to incident management, the Reliability Engineer will apply technical expertise to design, deploy, and manage modern observability tools, including synthetic monitoring and infrastructure monitoring solutions. The ideal candidate will demonstrate a mix of strong technical skills, effective communication, and the ability to remain composed and solutions-oriented under pressure.

Key Responsibilities

Incident Response and Management

Lead the resolution of major incidents by managing the end-to-end incident lifecycle, including detection, escalation, troubleshooting, and resolution.

Serve as the incident facilitator during escalations, ensuring effective, clear, and timely communication between all stakeholders to drive collaborative problem-solving.

Ensure appropriate handoffs and escalations between global engineering and incident management teams.

Coordinate root cause analysis (RCA) efforts, facilitating discussions to identify contributing factors, lessons learned, and long-term corrective actions to reduce the likelihood of recurrence.

Create, document, and improve incident response and management processes, defining clear roles and responsibilities for all participants during incidents.

Ensure stakeholders and leadership across business and technical teams are kept informed with clear, concise updates during incidents, minimizing customer and business impact.

Ensure open lines of communication by ensuring engineering teams engage in communication processes during incidents and have a clear understanding of their responsibilities.

Observability Tools Design and Implementation

Design, implement, and manage end-to-end observability solutions, including synthetic monitoring, infrastructure monitoring,tracing and metricsmonitoring systems.

Evaluate, deploy, and maintain observability and monitoring tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic or similar platforms.

Maintain and manage escalation tooling such as VictorOps or PagerDuty to ensure teams across have up to date schedules and escalation processes.

Build and maintain monitoring and alerting for critical systems, ensuring that warnings and issues are quickly identified and actionable in real time.

Drive the standardization of monitoring practices across teams, ensuring critical applications, systems, and infrastructure components are well-instrumented and monitored.

Develop infrastructure monitoring pipelines leveraging telemetry, logging, tracing, metrics, and visualization tools to provideaccurate insights into production system health.

Process Development and Automation

Support efforts to define and document standard operating procedures for managing incidents, alerts, system failures, and post-incident reviews across global teams.

Collaborate with development, infrastructure, and security teams to improve system reliability through efficient processes and workflows.

Advocate for the development and implementation of SLAs, SLOs, and error budgets to support decision-making and prioritization in reliability efforts.

Identify and implement opportunities to automate manual operational tasks to further reduce incident response and resolution times.

Work closely with service desk to ensure consistent incident management practices and appropriate escalations to major incident management team.

Collaboration and Communication

Partner with engineering, operations, and security teams to confirm observability tools and monitoring approaches meet their needs and align with organizational standards.

Actively engage during incident scenarios to ensure identification and mobilization of the appropriate resources, facilitating collaboration across teams and ensuring best practices are followed.

Contribute to a culture of shared responsibility and blameless postmortems by documenting and communicating findings from incident responses.

Proactively provide input to the SRE Manager to recommend improvements in processes, tools, and systems to enhance team capabilities and outcomes.

Qualifications

Education: Bachelors degree in Computer Science, Information Technology, or a related field (or equivalent professional experience).

3+ years of professional experience in Site Reliability Engineering (SRE), System Engineering, DevOps, or IT Operations roles.

Highly experienced as a major incident manager, incident commander, or similar role, with a proven ability to facilitate, communicate, and drive resolution of technical incidents.

Strong understanding of ITIL principles and their application in incident management.

Experience with observability tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic, or similar technologies.

Experience with synthetic monitoring, infrastructure monitoring, andmetrics and tracing monitoring tools.

Experience with hybrid infrastructure environments and understand monitoring signals from static on-premise infrastructure, cloud based ephemeral infrastructure, and SaaS applications.

Strong understanding of telemetry, logging, tracing, and their roles in system monitoring and observability pipelines.

Experience with Python, Go, Bash, or a similar language to develop and maintain monitoring and automation scripts.

Proven ability to remain calm and effective during high-pressure situations, facilitating resolution in a methodical, professional manner.

Preferred Qualifications

Certifications:AWS Certified Solutions Architect (Associate or higher) or Microsoft Certified: Azure Administrator/Architect.

ITIL Foundation Certification.

Experience with Infrastructure-as-Code (IaC) tools such as Terraform, CloudFormation, or Ansible as part of observability and monitoring pipelines.

Experience building tooling using modern infrastructure patterns such as containerization and serverless.

Experience implementing SLAs, SLOs, and error budgets in environments operating under Site Reliability Engineering or ITIL frameworks.

Knowledge of network and system security, including secure configurations, traffic monitoring, and network observability.

More Jobs at F5

Technical Training Developer

Hyderabad

2 - 5 yrs

INR 4 - 8 Lacs

Sr. Principal Site Reliability Engineer

Bengaluru

7 - 11 yrs

INR 17 - 22 Lacs

Engineer III, Software

Hyderabad

2 - 6 yrs

INR 8 - 12 Lacs

Network Support Engineer (Hyderabad)

Hyderabad

2 - 5 yrs

INR 4 - 7 Lacs

Sr Engineer, Software

Bengaluru

3 - 6 yrs

INR 6 - 11 Lacs

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

Login to

Please Verify Your Phone or Email

Confirm Action

Site Reliability Engineer - Incident Management