Observability Engineer

6 - 9 years

12 - 22 Lacs

chennai bengaluru mumbai (all areas)

Posted:-1 days ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Location:

Experience:

CTC:

Notice Period:

Role Overview

Observability Engineer

nervous system”

Key Roles & Responsibilities

Observability Strategy & Architecture

  • Define and drive the organization’s

    observability strategy, standards, and roadmap

    .
  • Design comprehensive

    telemetry architectures

    for distributed and microservices-based systems.
  • Establish

    best practices and guidelines

    for metrics, logging, and tracing.
  • Evaluate, select, and standardize

    observability tools and platforms

    .
  • Create reference architectures for

    instrumentation across multiple technology stacks

    .
  • Partner with engineering teams to define

    SLIs, SLOs, and error budgets

    .

Instrumentation & Telemetry Engineering

  • Instrument applications and services with

    metrics, logs, and distributed traces

    .
  • Implement

    end-to-end distributed tracing

    across microservices architectures.
  • Deploy and configure

    telemetry agents, sidecars, and collectors

    .
  • Implement

    OpenTelemetry

    standards, SDKs, and Collector pipelines.
  • Build

    custom instrumentation libraries and SDKs

    across multiple languages.
  • Create

    auto-instrumentation frameworks

    to reduce developer effort.
  • Ensure

    semantic consistency and data quality

    across all telemetry signals.

Observability Platforms & Tooling

  • Deploy, manage, and optimize

    metrics platforms

    such as:
    • Prometheus, Grafana, Datadog, New Relic, Dynatrace, AppDynamics
    • Cloud-native platforms (AWS CloudWatch, Azure Monitor, GCP Monitoring)
    • Long-term storage solutions (Thanos, Mimir, VictoriaMetrics)
  • Deploy and manage

    logging platforms

    including:
    • ELK Stack, Splunk, Loki, Fluentd, Sumo Logic
    • Cloud-native logging (CloudWatch Logs, Azure Log Analytics, GCP Logging)
  • Deploy and manage

    distributed tracing tools

    such as:
    • Jaeger, Zipkin, Datadog APM, New Relic APM, Dynatrace, Lightstep
  • Optimize observability platforms for

    performance, scalability, and cost

    .

Dashboards, Alerting & Incident Enablement

  • Design and build

    comprehensive dashboards

    :
    • Service-level dashboards with

      Golden Signals

      (latency, traffic, errors, saturation)
    • Executive dashboards for

      SLO compliance and business KPIs

    • Real-time operational and on-call dashboards
  • Design

    intelligent alerting strategies

    to reduce alert fatigue.
  • Implement

    multi-signal alert correlation

    , anomaly detection, and adaptive thresholds.
  • Integrate with

    incident management tools

    (PagerDuty, Opsgenie, VictorOps).
  • Configure alert routing, escalation policies, suppression, and maintenance windows.
  • Enable

    self-healing automation

    triggered by alerts.

Logging & Trace Engineering

  • Design and implement

    centralized logging architectures

    .
  • Build log ingestion, parsing, enrichment, and normalization pipelines.
  • Define

    structured logging standards

    (JSON, key-value).
  • Implement log sampling and retention strategies for high-volume systems.
  • Create log-based metrics and alerts.
  • Ensure

    data privacy, compliance, and retention policies

    are enforced.
  • Implement trace sampling strategies to balance

    cost and visibility

    .

Performance Analysis & Optimization

  • Conduct deep-dive performance investigations using telemetry data.
  • Identify bottlenecks, latency contributors, and error propagation paths.
  • Build

    capacity planning models

    using observability data.
  • Analyze resource utilization (CPU, memory, disk, network).
  • Create

    cost attribution and optimization insights

    from telemetry.
  • Map service dependencies and request flows across distributed systems.

Telemetry Pipelines & Cost Optimization

  • Build and optimize

    telemetry data pipelines

    (filtering, routing, transformation).
  • Manage

    cardinality, storage costs, and data volumes

    effectively.
  • Implement sampling, aggregation, and retention strategies.
  • Ensure

    high data quality and completeness

    .
  • Build export pipelines for analytics, compliance, and archival use cases.

Enablement, Automation & DevEx

  • Build

    self-service observability frameworks

    and tooling.
  • Integrate observability into

    CI/CD pipelines

    (Observability-as-Code).
  • Automate dashboard and alert provisioning.
  • Develop APIs, plugins, and extensions for observability platforms.
  • Create documentation, tutorials, templates, and best-practice guides.
  • Conduct training sessions and provide observability consulting to teams.
  • Participate in code reviews to validate instrumentation quality.

Required Skills & Experience

Core Observability Expertise

  • Strong understanding of

    metrics types

    (counters, gauges, histograms, summaries).
  • Deep expertise in

    PromQL

    and time-series data modeling.
  • Strong knowledge of

    logging pipelines

    , parsing (Grok/Regex/JSON), and SPL.
  • Deep understanding of

    distributed tracing concepts

    , context propagation, and sampling.
  • Hands-on experience with

    OpenTelemetry

    specifications and implementations.

Programming & Platforms

  • Strong proficiency in

    Python, Go, Java, or Node.js

    .
  • Ability to instrument and read code across multiple languages.
  • Experience building

    custom instrumentation libraries and APIs

    .
  • Familiarity with

    Kafka, Fluentd, Logstash

    , or similar data pipelines.
  • Experience with

    AWS, Azure, or GCP

    environments.
  • Strong understanding of

    Kubernetes and container observability

    .

Professional Experience

  • 6–9 years of experience in

    observability, SRE, platform engineering, or performance engineering

    .
  • Proven experience building

    observability platforms at scale

    .
  • Experience managing

    high-cardinality data and observability cost optimization

    .
  • Strong troubleshooting background in

    complex distributed systems

    .

Soft Skills & Mindset

  • Strong analytical and problem-solving skills.
  • Ability to explain complex observability concepts to engineers and leadership.
  • Empathy for developer experience and operational pain points.
  • Strong documentation, training, and enablement capabilities.
  • High attention to detail and data quality.
  • Curiosity-driven mindset with passion for system internals and reliability.

Certifications (Preferred)

  • Prometheus Certified Associate (PCA)
  • Datadog / New Relic Observability Certifications
  • AWS / Azure / GCP Observability Certifications
  • Certified Kubernetes Administrator (CKA)
  • OpenTelemetry certifications (when available)

Nice-to-Have Experience

  • Real User Monitoring (RUM) and frontend observability.
  • Continuous profiling (Pyroscope, Google Cloud Profiler).
  • Chaos engineering and observability correlation.
  • ML-driven anomaly detection and predictive analytics.
  • FinOps and observability cost optimization.
  • eBPF-based observability tools (Pixie, Cilium).
  • Contributions to open-source observability projects.

Education

  • Bachelor’s degree

    in Computer Science, Engineering, or a related field.

Mock Interview

Practice Video Interview with JobPe AI

Start Node.js Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Net Connect logo
Net Connect

Software Development

Schinnen Amsterdam

RecommendedJobs for You

chennai, bengaluru, mumbai (all areas)