Site Reliability Engineer

5 years

15 - 18 Lacs

Posted:1 day ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Site Reliability Engineer

We are building scalable, reliable, and high-performance cloud-native applications on Microsoft Azure. We are seeking a talented and passionate Site Reliability Engineer (SRE) to join our team, focusing on establishing robust observability with OpenTelemetry and driving operational excellence across our Azure infrastructure.

Role Overview:
As an SRE with OpenTelemetry and Azure expertise, you will play a critical role in ensuring the availability, performance, and scalability of our production systems. You will be responsible for designing, implementing, and maintaining our observability stack using OpenTelemetry standards, integrating it seamlessly with Azure services, and applying SRE principles to build resilient and efficient systems. You will work closely with development teams to embed reliability from the ground up, automate operational tasks, and respond to incidents with speed and precision.

Requirements

Key Responsibilities:

OTEL Monitoring Setup & Observability:

  • Design, implement, and manage a comprehensive observability platform using OpenTelemetry for distributed tracing, metrics, and logs across our microservices and applications.
  • Ensure full instrumentation of applications (e.g., Java, Python, Node.js) to capture end-to-end telemetry data.
  • Configure and optimize OpenTelemetry Collectors to receive, process, and export telemetry data to various backends (e.g., Prometheus, Grafana, Application Insights, Jaeger, Loki, Tempo and Azure Monitor).
  • Develop custom instrumentation and semantic conventions to enhance monitoring capabilities and provide deeper insights into application behavior.
  • Establish robust alerting and anomaly detection based on OpenTelemetry signals, utilizing tools like Azure Monitor, Prometheus Alert manager, or similar.
  • Create informative and actionable dashboards (e.g., Grafana, Azure Dashboards) for real-time system insights, performance monitoring, and incident response.
  • Continuously evaluate and integrate new OpenTelemetry features and best practices to improve our observability posture.

Azure SRE Capabilities:

  • Reliability & Performance Engineering: Monitor system performance, reliability, and availability metrics across Azure services. Identify bottlenecks, anticipate scaling needs, and implement strategies to reduce downtime and improve performance.
  • Incident Management & Response: Participate in on-call rotations, lead incident response efforts, conduct thorough root cause analysis (RCA), and implement preventative measures to minimize recurrence. Develop and maintain runbooks and playbooks for effective incident resolution.
  • Automation & Infrastructure as Code (IaC): Automate repetitive operational tasks, deployments, and infrastructure provisioning using Azure DevOps, Terraform, Azure Bicep, PowerShell, or Bash scripting.
  • CI/CD Integration: Integrate observability checks and validation steps into CI/CD pipelines to ensure the reliability and performance of new releases.
  • Capacity Planning & Cost Optimization: Conduct capacity planning, analyze usage patterns, and optimize Azure resources for cost efficiency, performance, and scalability.
  • Security & Compliance: Implement and enforce security best practices within Azure environments, collaborate with security teams, and ensure adherence to relevant compliance standards.
  • Collaboration & Mentorship: Work closely with development teams to foster a culture of reliability, provide guidance on observability best practices, and share knowledge across the organization.

Required Skills and Experience:

  • 5+ years of experience in a Site Reliability Engineering (SRE), DevOps, or a similar infrastructure-focused role.
  • Deep practical experience with OpenTelemetry (OTEL) for instrumenting, collecting, processing, and exporting traces, metrics, and logs.
  • Strong proficiency in Azure cloud services and their monitoring capabilities (Azure Monitor, Log Analytics, Application Insights).
  • Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform, Azure Bicep, or ARM templates.
  • Solid scripting and automation skills (e.g., Python, PowerShell, Bash).
  • Experience with containerization technologies (Docker) and orchestration platforms (Kubernetes/AKS).
  • Expertise with various observability backends like Grafana, Alloy, Loki, Tempo, Prometheus, Jaeger.
  • Strong understanding of distributed systems, microservices architectures, and cloud-native principles.
  • Excellent problem-solving, analytical, and troubleshooting skills.
  • Strong communication and collaboration abilities.

Preferred Qualifications:

  • Azure certifications (e.g., AZ-104 Azure Administrator, AZ-400 Azure DevOps Engineer Expert).
  • Experience with chaos engineering practices.
  • Understanding of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
  • Familiarity with database monitoring (e.g., PostgreSQL, Azure SQL).
  • Experience in a high-availability, regulated, or customer-facing environment.

Education:

  • Bachelor's degree in Computer Science, Information Technology, or a related technical field, or equivalent practical experience.

Job Type: Full-time

Pay: ₹130,000.00 - ₹150,000.00 per month

Experience:

  • Site Reliability Engineering: 7 years (Required)
  • DevOps: 6 years (Required)
  • OpenTelemetry: 5 years (Required)
  • Azure cloud services : 6 years (Required)
  • orchestration platforms (Kubernetes/AKS): 5 years (Required)

Work Location: In person

Mock Interview

Practice Video Interview with JobPe AI

Start Java Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Java Skills

Practice Java coding challenges to boost your skills

Start Practicing Java Now

RecommendedJobs for You

Serilingampalli, Telangana, India

Navi Mumbai, Maharashtra, India