Site Reliability Lead

8 - 13 years

20 - 25 Lacs

Posted:4 days ago| Platform: Naukri logo

Apply

Work Mode

Hybrid

Job Type

Full Time

Job Description

Role & responsibilities:

The Site Reliability Lead, Specialist role is responsible for ensuring the stability, availability and performance of systems and applications by implementing reliability engineering best practices, automation and incident management strategies. This role involves monitoring and optimizing mobile platforms (Android, iOS), leveraging observability and resiliency tools and collaborating with development teams to enhance system reliability. The role plays a key role in troubleshooting incidents, automating operational tasks, improving CI/CD pipelines and driving continuous improvements in site reliability and performance.

Preferred candidate profile:

Ensure system reliability, stability and performance by maintaining service-level objectives (SLOs) and minimizing downtime and incidents.
Collaborate with internal teams to assess system health, stability and resilience, providing architectural and design recommendations for reliability. Lead incident management and post-incident reviews, diagnosing issues, deploying fixes and implementing preventive measures. Drive automation of operational tasks, including deployments, monitoring, scaling and system recovery, to improve efficiency and reduce manual intervention. Define and track key performance indicators (KPIs) such as availability, latency and error rates to optimize system performance and inform decision-making. Plan and execute chaos engineering experiments to test system resilience and coordinate performance testing for reliability improvements. Ensure alignment between service-level indicators (SLIs) and service-level objectives (SLOs) across the product family. Develop and maintain product-level runbooks for incident response, collaborating with SRE teams to ensure effective recovery processes. Provide leadership in tool selection and best practices for site reliability engineering (SRE), making final decisions on tools, libraries and standards. Work closely with development teams to improve software reliability, scalability and resilience by offering feedback on design and architecture. Lead troubleshooting and triage efforts during user-impacting incidents, ensuring swift resolution and minimal disruption. Participate in special projects and continuous improvement initiatives, supporting long-term reliability and scalability goals.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
EY logo
EY

Professional Services

London

RecommendedJobs for You

hyderabad, telangana, india