Site Reliability Lead

5 - 10 years

5 - 12 Lacs

Posted:4 days ago| Platform: Naukri logo

Apply

Work Mode

Hybrid

Job Type

Full Time

Job Description

Role & responsibilities:

This role consults on the maintenance of a reliable site environment to ensure the stability and security of multiple systems/platforms, developing and implementing improvements for all aspects of software reliability. This role also includes ensuring system reliability by meeting service-level objectives (SLOs), driving automation of operational tasks, defining and tracking key performance indicators (KPIs), designing scalable systems, managing incident responses, and collaborating with development teams to ensure software reliability and scalability.

Preferred candidate profile

Collaborates with internal teams to evaluate the health, stability and reliability of systems/platforms. Consults on architecture and programming design decisions related to availability and resilience.

Conducts localized failure modes when new features and architecture patterns are introduced. Facilitates post-incident reviews for any client-impacting events local to the product family.
Manages the planning and execution of chaos experiments to meet the development and maintenance requirements of systems/platforms for the product family. Coordinates performance tests for the product family. Assists product teams with triage and troubleshooting during client impacting incidents. Ensures alignment between service level indicators and objectives within the product family. Maintains product-level runbooks for incident response, in collaboration with SRE Practitioners on each product team, to document the step-by-step process to recover from specific components within a system. Makes final decisions regarding usage of tools, libraries, and standards for SRE in situations where multiple options have been provided by SRE. Participates in special projects and performs other duties as assigned. Drives automation for key operational tasks such as monitoring, deployments, and system scaling to improve speed and reduce manual errors. Designs resilient systems that can scale effectively during peak loads or sudden traffic spikes to ensure uninterrupted service. Defines and tracks KPIs like availability, latency, and error rates to monitor system performance and guide optimization efforts. Collaborates closely with software development teams to ensure new code and changes align with performance and reliability objectives.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
EY logo
EY

Professional Services

London

RecommendedJobs for You

hyderabad, telangana, india