Site Reliability Engineer

10 - 15 years

15 - 27 Lacs

Posted:14 hours ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Senior Site Reliability Engineer / Principal Observability Platform Architect

Job Summary

We are seeking a highly experienced, hands-on Site Reliability Engineer (SRE) and Observability Subject Matter Expert to architect, engineer, and own an enterprise-grade, end-to-end observability platform. This role is accountable for unifying monitoring, logging, tracing, and user experience telemetry across a complex, multi-cloud (AWS & Azure), cloud-native ecosystem that includes Kubernetes platforms, microservices, APIs, and mobile applications.

This is a senior technical leadership role for an SME who goes beyond operating tools and instead designs a cohesive observability architecture that converts fragmented telemetry into actionable insight. You will work closely with platform engineering, application development, DevOps, security, and compliance teams to ensure reliability, performance, scalability, and operational excellence for business-critical systems.

Key Responsibilities

1. Observability Architecture & Platform Ownership

  • Architect, design, and own a unified, end-to-end observability platform spanning infrastructure, Kubernetes, container workloads, microservices, APIs, and mobile applications.
  • Define enterprise observability standards and golden paths” covering metrics, logs, traces, events, SLOs, SLIs, and error budgets.
  • Lead the consolidation and integration of a multi-tool ecosystem into a cohesive observability strategy, including:
    • Datadog (APM, Synthetic Monitoring, RUM)
    • Prometheus and Grafana (Kubernetes-native and infrastructure metrics)
    • ELK Stack (centralized logging, analytics, and forensics)
    • AWS CloudWatch and Azure Monitor
  • Design scalable telemetry ingestion, enrichment, correlation, and retention pipelines that support high data volumes and long-term analytics.

2. Kubernetes, Microservices & Instrumentation Engineering

  • Engineer deep observability integrations with Kubernetes platforms, including control plane, nodes, pods, services, ingress, and autoscaling components.
  • Design and maintain monitoring for service meshes and distributed systems (e.g., Istio, Linkerd).
  • Act as the technical authority for Application Performance Monitoring (APM) and distributed tracing across microservices.
  • Build and maintain standardized instrumentation using OpenTelemetry or equivalent frameworks, including auto-instrumentation across multiple languages.
  • Ensure observability is embedded by default into CI/CD pipelines and application deployment workflows.

3. Monitoring, Alerting & Analytics

  • Design and operate advanced monitoring and alerting frameworks focused on signal quality, noise reduction, and business impact.
  • Implement SLO-based alerting and proactive reliability indicators aligned with service and customer outcomes.
  • Oversee centralized logging, distributed tracing, and cross-domain correlation to reduce Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR).
  • Develop operational, engineering, and executive dashboards for availability, performance, reliability, and capacity.

4. Reliability, Performance & Capacity Engineering

  • Lead performance monitoring, capacity planning, and performance tuning for large-scale, distributed microservices.
  • Ensure high availability (HA), fault tolerance, and graceful degradation through observability-driven system design.
  • Define and validate resilience patterns including health checks, retries, circuit breakers, rate limiting, and autoscaling.
  • Use observability data to drive rightsizing, capacity forecasting, and cloud cost optimization.

5. Deployment, Operations & Incident Management

  • Manage deployment and lifecycle of microservices across multiple environments (development, testing, staging, production).
  • Partner with development teams during release cycles to ensure observability readiness and release health validation.
  • Lead and participate in production incident response, deep root cause analysis (RCA), and post-incident reviews.
  • Establish operational runbooks, on-call readiness standards, and continuous improvement processes.

6. Security, Compliance & Data Governance

  • Work closely with security and compliance teams to ensure observability data adheres to security, privacy, and regulatory requirements.
  • Define data retention, access control, and compliance policies for telemetry platforms.
  • Manage secure application configuration, secrets, keys, and credentials using auditable, automated mechanisms.

7. Business Continuity & Resilience

  • Own backup and restore strategies for both application and observability platforms.
  • Define, test, and execute application disaster recovery (DR) plans, using observability data to validate recovery objectives.

8. Mobile Application Observability & Releases

  • Oversee observability and reliability for mobile applications (iOS and Android), including client-side telemetry and user experience monitoring.
  • Collaborate with the CTO CM team on mobile build and release cycles.
  • Support mobile application releases via TestFlight, Google Play, and Apple Developer Console from a reliability and operational perspective.

Required Expertise & Technical Profile

  • SME-level experience architecting and operating large-scale observability platforms in cloud-native environments.
  • Deep, hands-on expertise with Kubernetes, containerized workloads, and microservices architectures.
  • Expert-level knowledge of observability tooling, including Datadog, Prometheus, Grafana, ELK, CloudWatch, and Azure Monitor.
  • Strong experience with OpenTelemetry, distributed tracing, APM SDKs, and auto-instrumentation.
  • Solid grounding in SRE principles, SLO/SLA design, reliability engineering, and incident management.
  • Experience managing monitoring assets as code using Terraform or equivalent automation frameworks.
  • Proven ability to influence architecture decisions, define standards, and mentor engineering teams on observability best practices.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Intellect Design Arena logo
Intellect Design Arena

Financial Technology

Chennai

RecommendedJobs for You