Hybrid
Full Time
The Site Reliability Lead, Specialist role is responsible for ensuring the stability, availability and performance of systems and applications by implementing reliability engineering best practices, automation and incident management strategies. This role involves monitoring and optimizing mobile platforms (Android, iOS), leveraging observability and resiliency tools and collaborating with development teams to enhance system reliability. The role plays a key role in troubleshooting incidents, automating operational tasks, improving CI/CD pipelines and driving continuous improvements in site reliability and performance.
Ensure system reliability, stability and performance by maintaining service-level objectives (SLOs) and minimizing downtime and incidents.
Collaborate with internal teams to assess system health, stability and resilience, providing architectural and design recommendations for reliability. Lead incident management and post-incident reviews, diagnosing issues, deploying fixes and implementing preventive measures. Drive automation of operational tasks, including deployments, monitoring, scaling and system recovery, to improve efficiency and reduce manual intervention. Define and track key performance indicators (KPIs) such as availability, latency and error rates to optimize system performance and inform decision-making. Plan and execute chaos engineering experiments to test system resilience and coordinate performance testing for reliability improvements. Ensure alignment between service-level indicators (SLIs) and service-level objectives (SLOs) across the product family. Develop and maintain product-level runbooks for incident response, collaborating with SRE teams to ensure effective recovery processes. Provide leadership in tool selection and best practices for site reliability engineering (SRE), making final decisions on tools, libraries and standards. Work closely with development teams to improve software reliability, scalability and resilience by offering feedback on design and architecture. Lead troubleshooting and triage efforts during user-impacting incidents, ensuring swift resolution and minimal disruption. Participate in special projects and continuous improvement initiatives, supporting long-term reliability and scalability goals.
EY
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
hyderabad
20.0 - 25.0 Lacs P.A.
hyderabad
5.0 - 12.0 Lacs P.A.
hyderabad, telangana, india
8.0 - 13.0 Lacs P.A.
Experience: Not specified
3.6 - 7.2 Lacs P.A.
thane, panvel, navi mumbai
Experience: Not specified
1.5 - 3.25 Lacs P.A.
20.0 - 35.0 Lacs P.A.
12.0 - 22.0 Lacs P.A.
3.5 - 7.0 Lacs P.A.
6.0 - 15.0 Lacs P.A.
bengaluru
Experience: Not specified
0.72 - 1.44 Lacs P.A.