We are looking for an accomplished Site Reliability Engineer (SRE) for one of our client, to lead the observability and monitoring strategy for our AI-integrated ASOC platform and its associated products. This role requires a strong foundation in SDLC, agile practices, automated testing, and deep expertise in building reliable, scalable, and data-intensive systems.

What You’ll Be Doing

Design and implement observability and monitoring systems for data analytics, SIEM, and AI platforms using Prometheus, Grafana, and related tools.
Automate operational tasks with Kubernetes, Python, Terraform, and ArgoCD Workflows to enhance deployment speed and reliability.
Define and prioritize SLOs, SLIs, and SLAs in collaboration with cross-functional teams.
Champion proactive incident management with automated alerts, reducing MTTD and MTTR for Sev1/Sev2 incidents.
Conduct postmortems and root cause analyses (RCA) to drive continuous improvement.
Ensure secure and gradual deployment practices with strong testing and fail-fast validation.
Perform capacity planning and performance tuning to support scalable infrastructure.
Foster collaboration across Engineering, Observability, MonOps, and CloudOps teams.
Maintain transparent communication about service status, incidents, and resolutions.
Advocate for and implement cutting-edge automation, observability, and monitoring technologies.

What We Need To See

9+ years of experience in Site Reliability Engineering / DevOps with large-scale, data-intensive systems.
Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
Expertise in observability & monitoring tools (Prometheus, Grafana, Kubernetes).
Strong knowledge of cloud platforms (AWS preferred).
Proven experience in Docker, Kubernetes, Python, Terraform, Ansible, CI/CD pipelines (GitLab CI, ArgoCD).
Solid grasp of SDLC and Agile methodologies with experience defining SLOs, SLIs, and SLAs.
Strong automation and scripting skills for incident resolution.
Exceptional analytical and problem-solving skills with hands-on debugging of performance bottlenecks and dependency issues in production.
Experience with capacity planning and performance monitoring tools (e.g., Locust, Prometheus).
Familiarity with secure deployment practices and automated testing frameworks.
Self-starter with a “get things done” attitude, able to work independently.
(Nice to Have) Background in cybersecurity or data lake platforms.

Job Type: Full-time

Pay: ₹1,350,000.00 - ₹4,050,000.00 per year

Benefits:

Food provided
Health insurance
Life insurance
Provident Fund

Work Location: In person

More Jobs at RACE Consulting

Senior SDET (Software Development Engineer in Test)

Noida

4.0 - 4.0 yrs

INR 19 - 58 Lacs

Sr. Software Engineer - Backend, Integrations

noida, uttar pradesh

2.0 - 6.0 yrs

Salary: Not disclosed

Senior SDET

noida, uttar pradesh

4.0 - 8.0 yrs

Salary: Not disclosed

Site Reliability Engineer

noida

9.0 - 9.0 yrs

INR 13 - 40 Lacs

Senior Technical Services Engineer

noida

3.0 - 3.0 yrs

INR 19 - 58 Lacs

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RACE Consulting

Login to

Please Verify Your Phone or Email

Confirm Action

Site Reliability Engineer