Site Reliability Engineer - Datadog Observability

8 - 13 years

12 - 20 Lacs

Posted:4 days ago| Platform: Naukri logo

Apply

Work Mode

Remote

Job Type

Full Time

Job Description

Key Responsibilities

SRE Implementation & Reliability Engineering

  • Drive

    end-to-end SRE strategy and implementation

    , ensuring systems meet reliability, scalability, and performance objectives.
  • Establish and enforce SRE best practices including

    SLIs, SLOs, SLAs

    , error budgets, incident response processes, and postmortems.
  • Lead efforts to automate repetitive operational tasks and improve engineering efficiency.

Datadog Observability Engineering

  • Architect, configure, and manage

    Datadog Observability

    components including:
    • Dashboards & visualizations
    • APM (Application Performance Monitoring)
    • Log Management & Log Pipelines
    • Synthetic Monitoring
    • Metrics & Traces
    • Alerts & Monitors
  • Configure and optimize Datadog for

    proactive outage prevention

    and early detection of infrastructure or application issues.
  • Use

    Datadog Roles API

    to create and manage:
    • User roles
    • Team-based access permissions
    • Security and governance controls
  • Provide leadership in designing observability standards, governance, and best practices across teams.

Operational Excellence & Incident Management

  • Collaborate closely with engineering, product, and business teams to:
    • Identify observability gaps
    • Design and implement monitoring solutions
    • Improve system reliability and operational readiness
  • Lead incident triage, root-cause analysis, and ensure timely recovery of services.
  • Support critical Financial

    Month-End, Quarter-End, and Year-End

    closure operations.

Automation & Tooling

  • Build automation for:
    • Alerting workflows
    • Ticket creation
    • Incident response orchestration
    • Monitoring configuration at scale
  • Integrate observability tooling with CI/CD pipelines and DevOps processes.
  • Leverage

    Datadog AI

    and anomaly detection features to improve proactive monitoring.

Collaboration & Leadership

  • Act as a technical leader for observability, reliability engineering, and monitoring best practices.
  • Partner with cross-functional teams across engineering, cloud, QA, and business functions.
  • Enable teams with training and documentation to adopt observability solutions effectively.
  • Communicate insights, risks, and recommendations to senior leadership in a clear and actionable manner.

Required Skills & Experience

  • 8+ years

    of experience in Site Reliability Engineering, Observability, or Infrastructure Operations.
  • Minimum 3+ years

    hands-on experience with

    Datadog

    , including:
    • Dashboards
    • APM
    • Alerts/Monitors
    • Log Management
    • Roles API
    • Metric/tracing instrumentation
  • Strong background in SRE fundamentals:
    • Incident management & runbooks
    • On-call operations
    • Automation & reliability metrics
    • Postmortem documentation
  • Ability to work effectively in fast-paced, large-scale production support environments.
  • Strong analytical, problem-solving, and troubleshooting skills.
  • Excellent communication and stakeholder management, with proven ability to work with business, IT, and engineering teams.

Preferred Qualifications

  • Datadog certification or certification in similar observability platforms (New Relic, Dynatrace, Splunk, etc.).
  • Experience with

    AWS, Azure, or OCI

    cloud environments.
  • Familiarity with CI/CD pipelines and automation frameworks (Jenkins, GitHub Actions, Azure DevOps, etc.).
  • Exposure to

    ITIL

    processes, production support frameworks, and enterprise monitoring standards.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
CIEL HR logo
CIEL HR

Human Resources

Noida

RecommendedJobs for You