3 years

20 - 28 Lacs

Posted:10 hours ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Role & Responsibilities
  • Operate and improve platform reliability for cloud-native services: set SLIs/SLOs, define error budgets, and drive uptime and performance improvements.
  • Design and maintain Infrastructure-as-Code and automated CI/CD pipelines (Terraform/CloudFormation, GitHub Actions/Jenkins) to ship safely and quickly.
  • Build observability and alerting: instrument services with metrics, logs, and traces (Prometheus, Grafana, ELK/EFK, Jaeger) and manage alerting runbooks.
  • Lead incident response and postmortems—triage, mitigate, automate remediation, and implement long-term fixes to reduce repeat incidents.
  • Automate operational tasks and scaling (autoscaling policies, capacity planning, cost optimizations) to keep systems efficient and resilient.
  • Collaborate with product and engineering teams to design reliable architectures, provide operational guidance, and embed reliability early in the delivery lifecycle.

Skills & Qualifications

Must-Have

  • 3+ years experience in SRE/DevOps/Platform engineering or equivalent hands-on systems engineering role.
  • Strong Linux administration skills and production troubleshooting experience.
  • Proven experience with containerization and orchestration (Docker & Kubernetes).
  • Hands-on with at least one major cloud provider (AWS, GCP or Azure) and IaC tools (Terraform or CloudFormation).
  • Practical scripting or programming skills (Python, Go, or Bash) to automate operations and build reliability tooling.
  • Experience implementing monitoring, alerting and distributed tracing (Prometheus/Grafana, ELK/EFK, Jaeger) and designing SLIs/SLOs.

Preferred

  • Experience with service meshes (Istio/Linkerd), Helm/Kustomize and chaos engineering tools.
  • Familiarity with security hardening, cost-optimization practices and multi-cloud deployments.
  • Knowledge of platform observability automation, canary releases, and progressive delivery patterns.
Benefits & Culture Highlights
  • Hybrid working model with flexible hours and focus on work-life balance (India).
  • Competitive compensation, health benefits and learning & development allowance to upskill in cloud and SRE practices.
  • Collaborative, blameless postmortem culture that rewards ownership, experimentation, and continuous improvement.
Keywords: Site Reliability Engineer, SRE, Kubernetes, AWS, Terraform, CI/CD, Prometheus, Grafana, Observability, Incident Response, SLIs/SLOs, Linux, Cloud Infrastructure.
Skills: aws,eks,kubernetes,python

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You

Bengaluru, Karnataka, India