Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in chennai
>
Intellect Design Arena
>
Site Reliability Engineer

Site Reliability Engineer

Intellect Design Arena

10 - 15 years

15 - 27 Lacs

chennai

Posted:14 hours ago| Platform:

Apply

Skills Required

site reliability engineering sre eks aws kubernetes

Work Mode

Work from Office

Job Type

Full Time

Job Description

Senior Site Reliability Engineer / Principal Observability Platform Architect

Job Summary

We are seeking a highly experienced, hands-on Site Reliability Engineer (SRE) and Observability Subject Matter Expert to architect, engineer, and own an enterprise-grade, end-to-end observability platform. This role is accountable for unifying monitoring, logging, tracing, and user experience telemetry across a complex, multi-cloud (AWS & Azure), cloud-native ecosystem that includes Kubernetes platforms, microservices, APIs, and mobile applications.

This is a senior technical leadership role for an SME who goes beyond operating tools and instead designs a cohesive observability architecture that converts fragmented telemetry into actionable insight. You will work closely with platform engineering, application development, DevOps, security, and compliance teams to ensure reliability, performance, scalability, and operational excellence for business-critical systems.

Key Responsibilities

1. Observability Architecture & Platform Ownership

Architect, design, and own a unified, end-to-end observability platform spanning infrastructure, Kubernetes, container workloads, microservices, APIs, and mobile applications.
Define enterprise observability standards and golden paths” covering metrics, logs, traces, events, SLOs, SLIs, and error budgets.
Lead the consolidation and integration of a multi-tool ecosystem into a cohesive observability strategy, including:

Datadog (APM, Synthetic Monitoring, RUM)
Prometheus and Grafana (Kubernetes-native and infrastructure metrics)
ELK Stack (centralized logging, analytics, and forensics)
AWS CloudWatch and Azure Monitor

Design scalable telemetry ingestion, enrichment, correlation, and retention pipelines that support high data volumes and long-term analytics.

2. Kubernetes, Microservices & Instrumentation Engineering

Engineer deep observability integrations with Kubernetes platforms, including control plane, nodes, pods, services, ingress, and autoscaling components.
Design and maintain monitoring for service meshes and distributed systems (e.g., Istio, Linkerd).
Act as the technical authority for Application Performance Monitoring (APM) and distributed tracing across microservices.
Build and maintain standardized instrumentation using OpenTelemetry or equivalent frameworks, including auto-instrumentation across multiple languages.
Ensure observability is embedded by default into CI/CD pipelines and application deployment workflows.

3. Monitoring, Alerting & Analytics

Design and operate advanced monitoring and alerting frameworks focused on signal quality, noise reduction, and business impact.
Implement SLO-based alerting and proactive reliability indicators aligned with service and customer outcomes.
Oversee centralized logging, distributed tracing, and cross-domain correlation to reduce Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR).
Develop operational, engineering, and executive dashboards for availability, performance, reliability, and capacity.

4. Reliability, Performance & Capacity Engineering

Lead performance monitoring, capacity planning, and performance tuning for large-scale, distributed microservices.
Ensure high availability (HA), fault tolerance, and graceful degradation through observability-driven system design.
Define and validate resilience patterns including health checks, retries, circuit breakers, rate limiting, and autoscaling.
Use observability data to drive rightsizing, capacity forecasting, and cloud cost optimization.

5. Deployment, Operations & Incident Management

Manage deployment and lifecycle of microservices across multiple environments (development, testing, staging, production).
Partner with development teams during release cycles to ensure observability readiness and release health validation.
Lead and participate in production incident response, deep root cause analysis (RCA), and post-incident reviews.
Establish operational runbooks, on-call readiness standards, and continuous improvement processes.

6. Security, Compliance & Data Governance

Work closely with security and compliance teams to ensure observability data adheres to security, privacy, and regulatory requirements.
Define data retention, access control, and compliance policies for telemetry platforms.
Manage secure application configuration, secrets, keys, and credentials using auditable, automated mechanisms.

7. Business Continuity & Resilience

Own backup and restore strategies for both application and observability platforms.
Define, test, and execute application disaster recovery (DR) plans, using observability data to validate recovery objectives.

8. Mobile Application Observability & Releases

Oversee observability and reliability for mobile applications (iOS and Android), including client-side telemetry and user experience monitoring.
Collaborate with the CTO CM team on mobile build and release cycles.
Support mobile application releases via TestFlight, Google Play, and Apple Developer Console from a reliability and operational perspective.

Required Expertise & Technical Profile

SME-level experience architecting and operating large-scale observability platforms in cloud-native environments.
Deep, hands-on expertise with Kubernetes, containerized workloads, and microservices architectures.
Expert-level knowledge of observability tooling, including Datadog, Prometheus, Grafana, ELK, CloudWatch, and Azure Monitor.
Strong experience with OpenTelemetry, distributed tracing, APM SDKs, and auto-instrumentation.
Solid grounding in SRE principles, SLO/SLA design, reliability engineering, and incident management.
Experience managing monitoring assets as code using Terraform or equivalent automation frameworks.
Proven ability to influence architecture decisions, define standards, and mentor engineering teams on observability best practices.

More Jobs at Intellect Design Arena

Java Technical Lead

Chennai

6 - 11 yrs

INR 10 - 20 Lacs

Java Project Manager

Pune, Mumbai (All Areas)

12 - 19 yrs

INR 20 - 35 Lacs

Solution Architect

Hyderabad, Chennai

8 - 12 yrs

INR 25 - 37 Lacs

Senior Program Manager

Navi Mumbai

18 - 25 yrs

INR 35 - 60 Lacs

Automation Test Lead

Pune, Mumbai (All Areas)

6 - 10 yrs

INR 9 - 19 Lacs

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

Intellect Design Arena

Financial Technology

Chennai

Login to

Please Verify Your Phone or Email

Confirm Action

Site Reliability Engineer