Senior Site Reliability Engineer (SRE) – HealthTech SaaS

8 years

9 Lacs

Posted:19 hours ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

About the Role

We are looking for a Senior Site Reliability Engineer (SRE) to lead the reliability strategy of our mission-critical HealthTech SaaS platform. This role is designed for a hands-on engineer who can architect and operate large-scale, high-availability systems, establish a 24×7 SRE practice, and enforce reliability standards through SLAs, SLOs, and error budgets. You will be responsible for ensuring uptime, performance, observability, and seamless deployments for a system serving hospitals, clinicians, and critical healthcare operations.

Key Responsibilities

1. Build & Lead the SRE Practice (24×7 Model)

● Establish a round-the-clock SRE operation with robust on-call processes.

● Define escalation paths, runbooks, SOPs, and reliability governance.

● Mentor and onboard SRE team members to build a high-performing reliability culture.

2. Reliability & Performance Engineering

● Own service uptime, latency, and error rate metrics; ensure adherence to defined SLAs/SLOs.

● Create and manage Error Budgets, drive conversations with engineering to maintain reliability.

● Conduct capacity planning, load forecasting, and performance tuning.

3. Observability & Monitoring (Hands-on with RUM/APM)

● Implement and manage tools such as: ○ Real User Monitoring (RUM)

○ APM tools (New Relic, Grafana Tempo, Dynatrace, DataDog, AppDynamics, etc.)

○ Infrastructure monitoring (Prometheus, Grafana, ELK/EFK, CloudWatch/Stackdriver)

● Build dashboards, alerts, tracing flows, synthetic monitoring, and anomaly detection systems.

4. Incident Management & Root Cause Analysis

● Lead major incidents and outages with calm, structured execution.

● Drive after-action reviews with 5-Why, fishbone, RCA documents.

● Collaborate with engineering and DevOps teams to implement preventive fixes.

5. Deployment, Automation & Reliability Tooling

● Improve CI/CD pipelines to ensure safe, predictable deployments.

● Implement:

○ Canary deployments

○ Blue/green deployments

○ Auto-remediation scripting

○ Chaos engineering practice (preferred)

● Automate repeatable operational tasks to reduce toil.

6. Infrastructure & System Architecture

● Work with cloud platforms (AWS/GCP/Azure) to optimize performance and cost.

● Manage:

○ Kubernetes clusters

○ Service meshes

○ Distributed systems

○ Database reliability

● Ensure zero-downtime releases and robust failover strategies.

Required Skills & Experience Technical Skills

● 8–12 years of SRE/DevOps/Production Engineering experience.

● Strong hands-on experience with RUM & APM tools.

● Deep understanding of:

○ Distributed systems

○ Microservices

○ Containers & Kubernetes

○ Networking fundamentals

○ Load balancers, CDNs, caching layers

● Strong scripting skills (Python, Bash, Go preferred).

● Experience with SQL/NoSQL databases and performance tuning.

● Expertise in observability stacks (Prometheus, Grafana, Loki, Jaeger, Kibana). SRE Practice Skills

● Proven ability to define and enforce SLA, SLO, SLI frameworks.

● Experience building or scaling 24×7 support models.

● Strong grounding in incident management, change management, and release processes.

● Understanding of security, compliance, and audit readiness—important for healthcare (HIPAA/NDHM awareness is a plus).

Soft Skills

● Excellent communication skills; ability to simplify technical issues for leadership.

● Strong ownership, accountability, and customer-centric thinking.

● Ability to coordinate across engineering, DevOps, product, and infrastructure teams.

Nice-to-Have Skills

● Experience with healthcare SaaS or critical systems.

● Knowledge of OTEL (OpenTelemetry) instrumentation.

● Chaos engineering tools (LitmusChaos, Gremlin).

● Experience with automation frameworks for alert triage.

What Success Looks Like

● 99.9%+ uptime with measurable SLO tracking.

● Full 24×7 SRE team established with rotation and playbooks.

● Reduction in Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).

● Predictable and low-risk production deployments.

● Highly observable system with actionable monitoring and automated alerts.

Job Type: Full-time

Pay: From ₹900,000.00 per year

Benefits:

  • Paid time off

Education:

  • Bachelor's (Preferred)

Experience:

  • Production Engineering: 8 years (Preferred)
  • RUM & APM tools: 8 years (Preferred)
  • Python: 8 years (Preferred)
  • SQL/NoSQL databases: 8 years (Preferred)
  • performance tuning: 8 years (Preferred)
  • observability stacks : 8 years (Preferred)

Work Location: In person

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now