Site Reliability Engineer

Viraaj HR Solutions Private Limited

3 years

20 - 28 Lacs

Hyderabad Telangana India

Posted:4 months ago| Platform:

Apply

Skills Required

reliability drive design code terraform github jenkins metrics jaeger triage automate remediation scaling autoscaling planning engineering devops linux troubleshooting containerization orchestration docker aws gcp azure scripting programming python monitoring service linkerd helm kustomize security optimization automation model compensation learning development collaborative experimentation kubernetes

Work Mode

On-site

Job Type

Full Time

Job Description

Role & Responsibilities

Operate and improve platform reliability for cloud-native services: set SLIs/SLOs, define error budgets, and drive uptime and performance improvements.
Design and maintain Infrastructure-as-Code and automated CI/CD pipelines (Terraform/CloudFormation, GitHub Actions/Jenkins) to ship safely and quickly.
Build observability and alerting: instrument services with metrics, logs, and traces (Prometheus, Grafana, ELK/EFK, Jaeger) and manage alerting runbooks.
Lead incident response and postmortems—triage, mitigate, automate remediation, and implement long-term fixes to reduce repeat incidents.
Automate operational tasks and scaling (autoscaling policies, capacity planning, cost optimizations) to keep systems efficient and resilient.
Collaborate with product and engineering teams to design reliable architectures, provide operational guidance, and embed reliability early in the delivery lifecycle.

Skills & Qualifications

Must-Have

3+ years experience in SRE/DevOps/Platform engineering or equivalent hands-on systems engineering role.
Strong Linux administration skills and production troubleshooting experience.
Proven experience with containerization and orchestration (Docker & Kubernetes).
Hands-on with at least one major cloud provider (AWS, GCP or Azure) and IaC tools (Terraform or CloudFormation).
Practical scripting or programming skills (Python, Go, or Bash) to automate operations and build reliability tooling.
Experience implementing monitoring, alerting and distributed tracing (Prometheus/Grafana, ELK/EFK, Jaeger) and designing SLIs/SLOs.

Preferred

Experience with service meshes (Istio/Linkerd), Helm/Kustomize and chaos engineering tools.
Familiarity with security hardening, cost-optimization practices and multi-cloud deployments.
Knowledge of platform observability automation, canary releases, and progressive delivery patterns.

Benefits & Culture Highlights

Hybrid working model with flexible hours and focus on work-life balance (India).
Competitive compensation, health benefits and learning & development allowance to upskill in cloud and SRE practices.
Collaborative, blameless postmortem culture that rewards ownership, experimentation, and continuous improvement.

Keywords: Site Reliability Engineer, SRE, Kubernetes, AWS, Terraform, CI/CD, Prometheus, Grafana, Observability, Incident Response, SLIs/SLOs, Linux, Cloud Infrastructure.
Skills: aws,eks,kubernetes,python

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.