Site Reliability Engineer

9 years

13 - 40 Lacs

Posted:3 hours ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

We are looking for an accomplished Site Reliability Engineer (SRE) for one of our client, to lead the observability and monitoring strategy for our AI-integrated ASOC platform and its associated products. This role requires a strong foundation in SDLC, agile practices, automated testing, and deep expertise in building reliable, scalable, and data-intensive systems.

What You’ll Be Doing

  • Design and implement observability and monitoring systems for data analytics, SIEM, and AI platforms using Prometheus, Grafana, and related tools.
  • Automate operational tasks with Kubernetes, Python, Terraform, and ArgoCD Workflows to enhance deployment speed and reliability.
  • Define and prioritize SLOs, SLIs, and SLAs in collaboration with cross-functional teams.
  • Champion proactive incident management with automated alerts, reducing MTTD and MTTR for Sev1/Sev2 incidents.
  • Conduct postmortems and root cause analyses (RCA) to drive continuous improvement.
  • Ensure secure and gradual deployment practices with strong testing and fail-fast validation.
  • Perform capacity planning and performance tuning to support scalable infrastructure.
  • Foster collaboration across Engineering, Observability, MonOps, and CloudOps teams.
  • Maintain transparent communication about service status, incidents, and resolutions.
  • Advocate for and implement cutting-edge automation, observability, and monitoring technologies.

What We Need To See

  • 9+ years of experience in Site Reliability Engineering / DevOps with large-scale, data-intensive systems.
  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
  • Expertise in observability & monitoring tools (Prometheus, Grafana, Kubernetes).
  • Strong knowledge of cloud platforms (AWS preferred).
  • Proven experience in Docker, Kubernetes, Python, Terraform, Ansible, CI/CD pipelines (GitLab CI, ArgoCD).
  • Solid grasp of SDLC and Agile methodologies with experience defining SLOs, SLIs, and SLAs.
  • Strong automation and scripting skills for incident resolution.
  • Exceptional analytical and problem-solving skills with hands-on debugging of performance bottlenecks and dependency issues in production.
  • Experience with capacity planning and performance monitoring tools (e.g., Locust, Prometheus).
  • Familiarity with secure deployment practices and automated testing frameworks.
  • Self-starter with a “get things done” attitude, able to work independently.
  • (Nice to Have) Background in cybersecurity or data lake platforms.

Job Type: Full-time

Pay: ₹1,350,000.00 - ₹4,050,000.00 per year

Benefits:

  • Food provided
  • Health insurance
  • Life insurance
  • Provident Fund

Work Location: In person

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You