Senior Site Reliability Engineer – Grafana & Observability

20 years

0 Lacs

Posted:2 days ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Job Description – Senior Site Reliability Engineer (SRE) – Grafana & Observability

Position: Senior Site Reliability Engineer – Grafana & Observability

Location: [Hyderabad /Hybrid]

Experience: 10–20+ years


Operating globally, Aptimized is a premium ERP, HCM, and Technology Optimization Consulting agency. Our team at Aptimized focuses on helping our customers become intelligent enterprises through leveraging creative technology solutions. At Aptimized, we prioritize our clients’ needs and create tailor-made solutions to deliver success. We understand success is not achieved through chance. We listen to your concerns. We consult with your organization. We accelerate your business. Visit us at our website to learn more about what we can do for you!


We are looking for a highly skilled Senior Site Reliability Engineer (SRE) with deep hands-on experience in Grafana ecosystem, observability engineering, and large-scale monitoring platforms.

The ideal candidate will be an expert in building and managing Grafana dashboards, Managed Grafana, Prometheus monitoring, OpenTelemetry pipelines, and integrating multiple data sources across cloud and on-prem infrastructures.


This role focuses heavily on building real-time observability, improving system reliability, and enabling data-driven operational insights.


Key Responsibilities

Grafana Engineering & Dashboard Development

Build advanced Grafana dashboards with alerts, custom panels, JSON models, and data visualizations.

Work with Grafana Managed (Azure Managed Grafana / AWS Managed Grafana) for enterprise-grade observability.

Integrate Grafana with multiple data sources such as:

Prometheus

ELK / Elasticsearch

Dynatrace

CloudWatch

Azure Monitor

InfluxDB / Telegraf

ServiceNow (incident integrations)

Develop role-based access (RBAC) and multi-tenant dashboard architectures.

Promztheus, Metrics & Alerting

Architect and manage Prometheus metrics pipelines, exporters, recording/alerting rules.

Optimize PromQL queries for high-performance dashboards.

Reduce alert noise through intelligent rule tuning and SLO-driven alerts.

Observability Platform Ownership

Build and maintain end-to-end observability stack:

Grafana + Prometheus + ELK + OpenTelemetry + Cloud-native monitoring tools.

Integrate logs, metrics, traces into unified dashboards.

Establish SLIs, SLOs, error budgets, and real-time reliability insights.

Kubernetes & Cloud Monitoring

Deploy and monitor Kubernetes clusters (AKS, EKS, Rancher).

Configure Grafana Alloy / Prometheus Operator / kube-state-metrics for cluster-level insights.

Implement Infrastructure-as-Code for observability stack deployments.

Automation & Infrastructure as Code

Automate monitoring agent deployments using:

Terraform

Azure DevOps / GitHub / GitLab

FluxCD, Kustomize, Helm

Develop monitoring-as-code for repeatable environment provisioning.

Incident Response & Performance Troubleshooting

Provide deep troubleshooting across infrastructure, network, applications, and microservices.

Build automated dashboards for war rooms and cross-team collaboration.

Leverage Grafana annotations, synthetic monitoring, and event correlation.

Security, Compliance & Governance

Implement secure access to metric/log dashboards using IAM, RBAC, ABAC.

Configure audit logs, long-term retention, and secure storage pipelines.

(Optional: FedRAMP/NIST experience beneficial for regulated workloads.)


Required Skills & Expertise

Grafana & Observability (Primary)

Expert in Grafana dashboard engineering

Prometheus + Alertmanager

Managed Grafana (Azure/AWS)

ELK Stack (Elasticsearch, Logstash, Kibana)

OpenTelemetry (OTEL) metrics & traces

Grafana Alloy, Loki (Bonus)

Cloud Platforms

Azure, AWS, IBM Cloud (Nice-to-have)

CloudWatch, Azure Monitor, App Insights

Containers & Infrastructure

Kubernetes (AKS, EKS)

Docker, Rancher, OpenShift

Linux (RHEL/CentOS)

DevOps & Automation

Terraform, Helm, Kustomize

Git, CI/CD pipelines

Scripting (Python, Bash, PowerShell)

Monitoring Ecosystem

Experience with additional tools is a plus:

Dynatrace

Splunk

Sysdig

AppDynamics

SolarWinds

Moogsoft AI-Ops

Preferred Qualifications

Strong background in SRE, Observability Engineering, DevOps, or Platform Engineering.

Experience with microservices, distributed systems, and cloud-native architectures.


ITIL v3 or industry certifications in AWS/Azure/Kubernetes are a plus.


Education


Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

Certifications in cloud, observability, Grafana, or Kubernetes are an advantage.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Aptimized logo
Aptimized

Software

Techville

RecommendedJobs for You