Site Reliability Engineer

5 - 8 years

10 - 20 Lacs

Posted:-1 days ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Role and Responsibilities

  • Build, deploy, and manage reliability, monitoring, and observability solutions across application, infrastructure, cloud, and on-prem environments
  • Design, configure, and maintain dashboards and alerts using tools such as AppDynamics, Grafana, Sumo Logic, Datadog, Splunk, Dynatrace, or equivalent platforms.
  • Implement automation for deployments, monitoring, remediation, and operational workflows using Python, Bash, PowerShell, Terraform, or Ansible.
  • Develop self-healing solutions to reduce operational toil and improve system resilience.
  • Troubleshoot complex production issues, lead incident response during P1/P2 events, perform root cause analysis (RCA), and drive preventive actions.
  • Create and maintain incident playbooks and escalation workflows.
  • Support and manage containerized workloads, Kubernetes clusters, and orchestration platforms following SRE and IaC best practices.
  • Define, measure, monitor, and report SLIs, SLOs, and error budgets for critical applications and infrastructure components.
  • Integrate observability platforms with ITSM tools to automate ticket creation, alert enrichment, correlation, and resolution workflows.
  • Analyze logs, metrics, and traces for anomaly detection, trend analysis, capacity planning, and performance optimization.
  • Partner with Application, DevOps, Infrastructure, Network, Database, and Security teams to improve reliability and reduce operational toil.
  • Support CI/CD pipelines and release processes by enforcing reliability, scalability, and resiliency standards.
  • Contribute to DR/BCP planning, resilience testing, and platform hardening initiatives.
  • Maintain operational documentation, architecture diagrams, SOPs, and runbooks.
  • Identify opportunities to improve system stability, observability coverage, and operational efficiency.

Required Skills

  • Minimum 5 - 8 years of experience in Site Reliability Engineering, Application Support, DevOps, or Infrastructure Engineering roles.
  • Build, deploy, and manage reliability, monitoring, and observability solutions across application, infrastructure, cloud, and on-prem environments.
  • Design, configure, and maintain dashboards and alerts using tools such as AppDynamics, Grafana, Sumo Logic, Datadog, Splunk, Dynatrace, or equivalent platforms.
  • Implement automation and self-healing solutions for deployments, monitoring, remediation, and operational workflows using Python, Bash, PowerShell, Terraform, or Ansible.
  • Troubleshoot complex production issues, lead incident response during P1/P2 events, perform RCA, and drive preventive actions to avoid recurrence.
  • Support and manage containerized workloads, Kubernetes clusters, and container orchestration platforms following SRE and IaC best practices.
  • Define, measure, monitor, and report SLIs, SLOs, and error budgets for critical applications and infrastructure components.
  • Integrate observability platforms with ITSM tools to automate ticket creation, alert enrichment, correlation, and resolution workflows.
  • Analyze logs, metrics, and traces to perform anomaly detection, trend analysis, capacity planning, and performance optimization.
  • Partner closely with Application, DevOps, Infrastructure, Network, Database, and Security teams to improve reliability and reduce operational toil.
  • Support CI/CD pipelines and release processes by enforcing reliability, scalability, and resiliency standards.
  • Contribute to DR/BCP planning, resilience testing, and platform hardening initiatives.
  • Create and maintain operational documentation, architecture diagrams, runbooks, SOPs, and incident playbooks.
  • Continuously identify opportunities to improve system stability, observability coverage, and operational efficiency.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
GSPANN logo
GSPANN

Information Technology and Services

Plymouth

RecommendedJobs for You

bengaluru east, karnataka, india

itanagar, arunachal pradesh, india