Site Reliability Engineer

5 - 10 years

12 - 22 Lacs

Posted:6 days ago| Platform: Naukri logo

Apply

Work Mode

Hybrid

Job Type

Full Time

Job Description

Key Responsibilities

  • Build and scale observability systems:

    Design and maintain infrastructure for collecting, aggregating, and analyzing telemetry data (metrics, logs, and traces).
  • Enable actionable insights:

    Develop dashboards, alerts, and visualizations that turn raw data into clear, meaningful information for engineers, SREs, and business stakeholders.
  • Collaborate across teams:

    Partner with engineering, operations, and SRE teams to define SLIs/SLOs and improve visibility into system performance and health.
  • Drive best practices:

    Advocate for and support consistent instrumentation, effective alerting, and strong observability practices across engineering teams.
  • Optimize systems and tools:

    Continuously assess performance, usage, and cost of observability tools, identifying opportunities for improvement and efficiency.
  • Automate:

    Engineer capabilities that will drive the adoption of SRE principles and best practices into what is deployed within the Nexxen environment.
  • Improve:

    In collaboration with engineering teams develop plans to improve the reliability of applications and infrastructure and assist these teams with the engineering of these improvements.
  • Support incident response:

    Participate in and help improve the incident response process, reducing MTTR and contributing to post-incident reviews and root cause analysis.

What Were Looking For

Technical Skills

  • Programming experience

    in languages like Go, Python, Java, or Node.js. Able to contribute tools and advise on application-level instrumentation improvements.
  • Observability tooling expertise

    within these tools:
  • LGTM (Loki, Grafana, Tempo, Mimr)
  • Datadog
  • Cloudwatch
  • Prometheus
  • Pagerduty
  • ClickStack
  • VictoriaMetrics
  • Groundcover
  • Libre
  • Zabbix
  • Cloud experience

    with AWS and services like EC2, EKS, ECS, VPC networking
  • Containers & orchestration

    : Familiarity with Docker and Kubernetes.
  • Infrastructure as Code & automation:

    Experience with tools like Terraform, Ansible, Chef, or SCCM to manage observability infrastructure at scale.
  • Linux systems knowledge:

    Strong understanding of Linux, shell scripting, and the storage/networking stack.
  • Tracing:

    Deep understanding of tracing technology and OpenTelemetry
  • SRE Practices

    : SLIs, SLOs, Error Budgets, and Failure Domains

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Cybage logo
Cybage

Information Technology & Services

Pune

RecommendedJobs for You