Senior Site Reliability Engineer

4 - 9 years

8 - 12 Lacs

Posted:5 hours ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

  • Be the founding SRE for India within the DevOps Platform Team, establishing operating rhythms, guardrails, and best practices that raise reliability across hundreds of services and 30+ Kubernetes clusters.
  • Lead global incident management from India time zones: triage and drive resolution as Incident Commander, coordinate war rooms, manage stakeholder communications, and publish timely status page updates.
  • Maintain automations to enable on-call rotations, escalation policies, and incident workflows in PagerDuty, Datadog and Slack.
  • Create actionable runbooks to reduce MTTA/MTTR.
  • Define and operationalize SLIs/SLOs and error budgets with product and engineering teams; coach teams on using error budgets for release decisions and reliability trade-offs.
  • Create high-signal observability: instrument services, tune alerts to reduce noise, and build reliability dashboards in Datadog.
  • Own planned maintenance: plan and schedule maintenance windows, coordinate execution across teams and environments (AWS, Azure, on-prem), communicate broadly, and verify recovery with clear rollback plans.
  • Eliminate toil through automation: build ChatOps, status page automation, auto-remediation workflows, and runbooks-as-code; integrate incident and maintenance workflows into CI/CD (Jenkins, Argo).
  • Drive production readiness: define PRR checklists, bake reliability gates into pipelines, and improve deployment strategies (blue/green, progressive delivery).
  • Partner with DevOps Platform Engineers to harden the Internal Developer Platform and improve developer experience while maintaining compliance requirements (e.g., ISO27001, SOC2, PCI).
  • Lead blameless postmortems, track corrective actions, and maintain a reliability backlog that measurably improves availability, latency, and change success rate.
  • Mentor engineers and evangelize SRE principles through documentation, training, and a reliability guild/community of practice.

What were looking for

  • 4+ years in SRE/Production Engineering/DevOps operating distributed systems and microservices at scale, including Kubernetes and containerized workloads.
  • Proven incident response leadership: incident triage and coordination, clear stakeholder/customer communications, status page management, and creation of robust runbooks.
  • Strong observability skills: ideally in Datadog (metrics, logs, traces, dashboards, monitors) or familiarity with Prometheus/Grafana, NewRelic, Dynatrace, or similar tools.
  • Expertise designing actionable alerts tied to SLIs/SLOs and managing error budgets.
  • Hands-on with CI/CD and release engineering: GitHub Actions, Argo (or similar), progressive delivery, feature flags, and safe rollout/rollback patterns.
  • Proficiency in at least one programming language (Golang preferred) plus Bash.
  • Ability to automate incident workflows, status page updates, and remediation tasks via APIs and ChatOps.
  • Solid foundations in Linux, networking, web protocols, DNS/TLS, load balancers/CDNs, and performance/capacity analysis.
  • Experience with databases and messaging systems is a plus.
  • Cloud fluency in Kubernetes, AWS and/or Azure understanding of multi-tenant, multi-region, and hybrid/on-prem environments.
  • Security-minded and comfortable working within compliance frameworks.
  • Infrastructure as Code experience (Terraform, Ansible, Kubernetes or similar) and Git-centric workflows.
  • Excellent written and verbal communication skills. Able to translate technical detail into concise business updates under pressure.
  • Self-starter comfortable with ambiguity and a founding-role mindset: high ownership, bias for action, data-driven decision making, and a passion for eliminating toil.
  • Willingness to participate in on-call during India hours and collaborate with global teams for follow-the-sun coverage.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You