Job
Description
Your key responsibilities
Ensure system reliability, stability and performance by maintaining service-level objectives (SLOs) and minimizing downtime and incidents. Collaborate with internal teams to assess system health, stability and resilience, providing architectural and design recommendations for reliability. Lead incident management and post-incident reviews, diagnosing issues, deploying fixes and implementing preventive measures. Drive automation of operational tasks, including deployments, monitoring, scaling and system recovery, to improve efficiency and reduce manual intervention. Define and track key performance indicators (KPIs) such as availability, latency and error rates to optimize system performance and inform decision-making. Plan and execute chaos engineering experiments to test system resilience and coordinate performance testing for reliability improvements. Ensure alignment between service-level indicators (SLIs) and service-level objectives (SLOs) across the product family. Develop and maintain product-level runbooks for incident response, collaborating with SRE teams to ensure effective recovery processes. Provide leadership in tool selection and best practices for site reliability engineering (SRE), making final decisions on tools, libraries and standards. Work closely with development teams to improve software reliability, scalability and resilience by offering feedback on design and architecture. Lead troubleshooting and triage efforts during user-impacting incidents, ensuring swift resolution and minimal disruption. Participate in special projects and continuous improvement initiatives, supporting long-term reliability and scalability goals.
Skills and attributes for success
A team player with strong analytical, communication and interpersonal skills Constantly updating yourself about new technologies in the market A winning personality and the ability to become a trusted advisor to the stakeholders
To qualify for the role, you must have
Minimum 8 years of related experience, with at least 5 years in software development. Bachelors degree (B.E./B.Tech) in Computer Science or IT, or Bachelors in Computer Applications (BCA) from a recognized institution. Expertise in Site Reliability Engineering (SRE), DevOps, and system reliability, ensuring high availability and performance. Strong experience in mobile platform reliability (Android, iOS), including performance monitoring and optimization. Proficiency in observability and resiliency tools such as Splunk, Honeycomb, Datadog, Prometheus, or Grafana. Hands-on experience with cloud platforms (AWS, Azure, GCP) and containerization/orchestration tools like Kubernetes, Docker, ECS, or Fargate. Solid understanding of automation, Infrastructure-as-Code (IaC), and configuration management using Terraform, Ansible, or CloudFormation. Strong programming and scripting skills in Python, Go, Bash, or Java, with experience in automating operational tasks. Experience with CI/CD pipelines, deployment automation, and version control tools like GitHub, Bitbucket, Jenkins, or Bamboo. Deep knowledge of incident management, root cause analysis, and post-incident reviews, focusing on continuous improvement.
Ideally, youll also have
Strong verbal and written communication, facilitation, relationship-building, presentation and negotiation skills. Be highly flexible, adaptable, and creative. Comfortable interacting with senior executives (within the firm and at the client)