We are looking for a Senior Site Reliability Engineer (SRE) to lead the reliability strategy of our mission-critical HealthTech SaaS platform. This role is designed for a hands-on engineer who can architect and operate large-scale, high-availability systems, establish a 24×7 SRE practice, and enforce reliability standards through SLAs, SLOs, and error budgets. You will be responsible for ensuring uptime, performance, observability, and seamless deployments for a system serving hospitals, clinicians, and critical healthcare operations.

Key Responsibilities

1. Build & Lead the SRE Practice (24×7 Model)

● Establish a round-the-clock SRE operation with robust on-call processes.

● Define escalation paths, runbooks, SOPs, and reliability governance.

● Mentor and onboard SRE team members to build a high-performing reliability culture.

2. Reliability & Performance Engineering

● Own service uptime, latency, and error rate metrics; ensure adherence to defined SLAs/SLOs.

● Create and manage Error Budgets, drive conversations with engineering to maintain reliability.

● Conduct capacity planning, load forecasting, and performance tuning.

3. Observability & Monitoring (Hands-on with RUM/APM)

● Implement and manage tools such as: ○ Real User Monitoring (RUM)

○ APM tools (New Relic, Grafana Tempo, Dynatrace, DataDog, AppDynamics, etc.)

○ Infrastructure monitoring (Prometheus, Grafana, ELK/EFK, CloudWatch/Stackdriver)

● Build dashboards, alerts, tracing flows, synthetic monitoring, and anomaly detection systems.

4. Incident Management & Root Cause Analysis

● Lead major incidents and outages with calm, structured execution.

● Drive after-action reviews with 5-Why, fishbone, RCA documents.

● Collaborate with engineering and DevOps teams to implement preventive fixes.

5. Deployment, Automation & Reliability Tooling

● Improve CI/CD pipelines to ensure safe, predictable deployments.

● Implement:

○ Canary deployments

○ Blue/green deployments

○ Auto-remediation scripting

○ Chaos engineering practice (preferred)

● Automate repeatable operational tasks to reduce toil.

6. Infrastructure & System Architecture

● Work with cloud platforms (AWS/GCP/Azure) to optimize performance and cost.

● Manage:

○ Kubernetes clusters

○ Service meshes

○ Distributed systems

○ Database reliability

● Ensure zero-downtime releases and robust failover strategies.

Required Skills & Experience Technical Skills

● 8–12 years of SRE/DevOps/Production Engineering experience.

● Strong hands-on experience with RUM & APM tools.

● Deep understanding of:

○ Distributed systems

○ Microservices

○ Containers & Kubernetes

○ Networking fundamentals

○ Load balancers, CDNs, caching layers

● Strong scripting skills (Python, Bash, Go preferred).

● Experience with SQL/NoSQL databases and performance tuning.

● Expertise in observability stacks (Prometheus, Grafana, Loki, Jaeger, Kibana). SRE Practice Skills

● Proven ability to define and enforce SLA, SLO, SLI frameworks.

● Experience building or scaling 24×7 support models.

● Strong grounding in incident management, change management, and release processes.

● Understanding of security, compliance, and audit readiness—important for healthcare (HIPAA/NDHM awareness is a plus).

Soft Skills

● Excellent communication skills; ability to simplify technical issues for leadership.

● Strong ownership, accountability, and customer-centric thinking.

● Ability to coordinate across engineering, DevOps, product, and infrastructure teams.

Nice-to-Have Skills

● Experience with healthcare SaaS or critical systems.

● Knowledge of OTEL (OpenTelemetry) instrumentation.

● Chaos engineering tools (LitmusChaos, Gremlin).

● Experience with automation frameworks for alert triage.

What Success Looks Like

● 99.9%+ uptime with measurable SLO tracking.

● Full 24×7 SRE team established with rotation and playbooks.

● Reduction in Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).

● Predictable and low-risk production deployments.

● Highly observable system with actionable monitoring and automated alerts.

Job Type: Full-time

Pay: From ₹900,000.00 per year

Benefits:

Paid time off

Education:

Bachelor's (Preferred)

Experience:

Production Engineering: 8 years (Preferred)
RUM & APM tools: 8 years (Preferred)
Python: 8 years (Preferred)
SQL/NoSQL databases: 8 years (Preferred)
performance tuning: 8 years (Preferred)
observability stacks : 8 years (Preferred)

Work Location: In person

More Jobs at Artem Healthtech Private Limited

Database Administrator (DBA)

india

5.0 - 5.0 yrs

INR 18 - 22 Lacs

IT Support Executive

thaltej, ahmedabad, gujarat

1.0 - 1.0 yrs

INR 2 - 2 Lacs

IT Support Executive

india

1.0 - 1.0 yrs

INR 2 - 2 Lacs

AI/ML Python Developer

india

2.0 - 3.0 yrs

INR 6 - 8 Lacs

Business Development Manager (BDM)

india

5.0 - 7.0 yrs

INR 12 - 24 Lacs

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.