Hybrid
Full Time
This role consults on the maintenance of a reliable site environment to ensure the stability and security of multiple systems/platforms, developing and implementing improvements for all aspects of software reliability. This role also includes ensuring system reliability by meeting service-level objectives (SLOs), driving automation of operational tasks, defining and tracking key performance indicators (KPIs), designing scalable systems, managing incident responses, and collaborating with development teams to ensure software reliability and scalability.
Collaborates with internal teams to evaluate the health, stability and reliability of systems/platforms. Consults on architecture and programming design decisions related to availability and resilience.
Conducts localized failure modes when new features and architecture patterns are introduced. Facilitates post-incident reviews for any client-impacting events local to the product family.
Manages the planning and execution of chaos experiments to meet the development and maintenance requirements of systems/platforms for the product family. Coordinates performance tests for the product family. Assists product teams with triage and troubleshooting during client impacting incidents. Ensures alignment between service level indicators and objectives within the product family. Maintains product-level runbooks for incident response, in collaboration with SRE Practitioners on each product team, to document the step-by-step process to recover from specific components within a system. Makes final decisions regarding usage of tools, libraries, and standards for SRE in situations where multiple options have been provided by SRE. Participates in special projects and performs other duties as assigned. Drives automation for key operational tasks such as monitoring, deployments, and system scaling to improve speed and reduce manual errors. Designs resilient systems that can scale effectively during peak loads or sudden traffic spikes to ensure uninterrupted service. Defines and tracks KPIs like availability, latency, and error rates to monitor system performance and guide optimization efforts. Collaborates closely with software development teams to ensure new code and changes align with performance and reliability objectives.
EY
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
hyderabad
20.0 - 25.0 Lacs P.A.
hyderabad
5.0 - 12.0 Lacs P.A.
hyderabad, telangana, india
8.0 - 13.0 Lacs P.A.
Experience: Not specified
3.6 - 7.2 Lacs P.A.
thane, panvel, navi mumbai
Experience: Not specified
1.5 - 3.25 Lacs P.A.
20.0 - 35.0 Lacs P.A.
12.0 - 22.0 Lacs P.A.
3.5 - 7.0 Lacs P.A.
6.0 - 15.0 Lacs P.A.
bengaluru
Experience: Not specified
0.72 - 1.44 Lacs P.A.