8 years
9 Lacs
Posted:19 hours ago|
Platform:
On-site
Full Time
About the Role
We are looking for a Senior Site Reliability Engineer (SRE) to lead the reliability strategy of our mission-critical HealthTech SaaS platform. This role is designed for a hands-on engineer who can architect and operate large-scale, high-availability systems, establish a 24×7 SRE practice, and enforce reliability standards through SLAs, SLOs, and error budgets. You will be responsible for ensuring uptime, performance, observability, and seamless deployments for a system serving hospitals, clinicians, and critical healthcare operations.
Key Responsibilities
1. Build & Lead the SRE Practice (24×7 Model)
● Establish a round-the-clock SRE operation with robust on-call processes.
● Define escalation paths, runbooks, SOPs, and reliability governance.
● Mentor and onboard SRE team members to build a high-performing reliability culture.
2. Reliability & Performance Engineering
● Own service uptime, latency, and error rate metrics; ensure adherence to defined SLAs/SLOs.
● Create and manage Error Budgets, drive conversations with engineering to maintain reliability.
● Conduct capacity planning, load forecasting, and performance tuning.
3. Observability & Monitoring (Hands-on with RUM/APM)
● Implement and manage tools such as: ○ Real User Monitoring (RUM)
○ APM tools (New Relic, Grafana Tempo, Dynatrace, DataDog, AppDynamics, etc.)
○ Infrastructure monitoring (Prometheus, Grafana, ELK/EFK, CloudWatch/Stackdriver)
● Build dashboards, alerts, tracing flows, synthetic monitoring, and anomaly detection systems.
4. Incident Management & Root Cause Analysis
● Lead major incidents and outages with calm, structured execution.
● Drive after-action reviews with 5-Why, fishbone, RCA documents.
● Collaborate with engineering and DevOps teams to implement preventive fixes.
5. Deployment, Automation & Reliability Tooling
● Improve CI/CD pipelines to ensure safe, predictable deployments.
● Implement:
○ Canary deployments
○ Blue/green deployments
○ Auto-remediation scripting
○ Chaos engineering practice (preferred)
● Automate repeatable operational tasks to reduce toil.
6. Infrastructure & System Architecture
● Work with cloud platforms (AWS/GCP/Azure) to optimize performance and cost.
● Manage:
○ Kubernetes clusters
○ Service meshes
○ Distributed systems
○ Database reliability
● Ensure zero-downtime releases and robust failover strategies.
Required Skills & Experience Technical Skills
● 8–12 years of SRE/DevOps/Production Engineering experience.
● Strong hands-on experience with RUM & APM tools.
● Deep understanding of:
○ Distributed systems
○ Microservices
○ Containers & Kubernetes
○ Networking fundamentals
○ Load balancers, CDNs, caching layers
● Strong scripting skills (Python, Bash, Go preferred).
● Experience with SQL/NoSQL databases and performance tuning.
● Expertise in observability stacks (Prometheus, Grafana, Loki, Jaeger, Kibana). SRE Practice Skills
● Proven ability to define and enforce SLA, SLO, SLI frameworks.
● Experience building or scaling 24×7 support models.
● Strong grounding in incident management, change management, and release processes.
● Understanding of security, compliance, and audit readiness—important for healthcare (HIPAA/NDHM awareness is a plus).
Soft Skills
● Excellent communication skills; ability to simplify technical issues for leadership.
● Strong ownership, accountability, and customer-centric thinking.
● Ability to coordinate across engineering, DevOps, product, and infrastructure teams.
Nice-to-Have Skills
● Experience with healthcare SaaS or critical systems.
● Knowledge of OTEL (OpenTelemetry) instrumentation.
● Chaos engineering tools (LitmusChaos, Gremlin).
● Experience with automation frameworks for alert triage.
What Success Looks Like
● 99.9%+ uptime with measurable SLO tracking.
● Full 24×7 SRE team established with rotation and playbooks.
● Reduction in Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
● Predictable and low-risk production deployments.
● Highly observable system with actionable monitoring and automated alerts.
Job Type: Full-time
Pay: From ₹900,000.00 per year
Benefits:
Education:
Experience:
Work Location: In person
Artem Healthtech Private Limited
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python Now9.0 - 9.0 Lacs P.A.
9.0 - 9.0 Lacs P.A.