Description :
Senior Site Reliability Engineer
Experience :
5 to 10 Years
Location :
Hyderabad/Ahmedabad (Hybrid work mode)
Shift Timing :
12 :30 PM to 9 :30 PM & 4 :30 AM to 1 :30 Pm (1 Week in every 2 months, can do work from home in this shift)
About TechBlocks
TechBlocks is a global digital product engineering company with 16+ years of experience helping Fortune 500 enterprises and high-growth brands accelerate innovation, modernize technology, and drive digital transformation. From cloud solutions and data engineering to experience design and platform modernization, we help businesses solve complex challenges and unlock new growth opportunities.At TechBlocks, we believe technology is only as powerful as the people behind it. We foster a culture of collaboration, creativity, and continuous learning, where big ideas turn into real impact. Whether you're building seamless digital experiences, optimizing enterprise platforms, or tackling complex integrations, you'll be part of a dynamic, fast-moving team that values innovation and ownership.Join us and shape the future of digital transformation.
Summary
As an SRE, you will ensure platform reliability, incident management, and performance optimization. You'll define SLIs/SLOs, contribute to robust observability practices, and drive proactive reliability engineering across services.
Experience Required
- 5- 10 years of SRE or infrastructure engineering experience in cloud-native environments.
Mandatory
Technical Knowledge & Skills :
- Cloud : GCP (GKE, Load Balancing, VPN, IAM)
- Observability : Prometheus, Grafana, ELK, Datadog
- Containers & Orchestration : Kubernetes, Docker
- Incident Management : On-call, RCA, SLIs/SLOs
- IaC : Terraform, Helm
- Incident Tools : PagerDuty, OpsGenie
Nice To Have
- GCP Monitoring, Skywalking
- Service Mesh, API Gateway
- GCP Spanner, MongoDB (basic)
Scope
- Drive operational excellence and platform resilience
- Reduce MTTR, increase service availability
- Own incident and RCA processes
Roles And Responsibilities
- Define and measure Service Level Indicators (SLIs), Service Level Objectives (SLOs), and manage error budgets across services.
- Lead incident management for critical production issues drive root cause analysis (RCA) and postmortems.
- Create and maintain runbooks and standard operating procedures for high availability services.
- Design and implement observability frameworks using ELK, Prometheus, and Grafana, drive telemetry adoption.
- Coordinate cross-functional war-room sessions during major incidents and maintain response logs.
- Develop and improve automated system recovery, alert suppression, and escalation logic.
- Use GCP tools like GKE, Cloud Monitoring, and Cloud Armor to improve performance and security posture.
- Collaborate with DevOps and Infrastructure teams to build highly available and scalable systems.
- Analyze performance metrics and conduct regular reliability reviews with engineering leads.
- Participate in capacity planning, failover testing, and resilience architecture reviews.
(ref:hirist.tech)