20 years
0 Lacs
Posted:2 days ago|
Platform:
On-site
Full Time
Job Description – Senior Site Reliability Engineer (SRE) – Grafana & Observability
Position: Senior Site Reliability Engineer – Grafana & Observability
Location: [Hyderabad /Hybrid]
Experience: 10–20+ years
Operating globally, Aptimized is a premium ERP, HCM, and Technology Optimization Consulting agency. Our team at Aptimized focuses on helping our customers become intelligent enterprises through leveraging creative technology solutions. At Aptimized, we prioritize our clients’ needs and create tailor-made solutions to deliver success. We understand success is not achieved through chance. We listen to your concerns. We consult with your organization. We accelerate your business. Visit us at our website to learn more about what we can do for you!
We are looking for a highly skilled Senior Site Reliability Engineer (SRE) with deep hands-on experience in Grafana ecosystem, observability engineering, and large-scale monitoring platforms.
The ideal candidate will be an expert in building and managing Grafana dashboards, Managed Grafana, Prometheus monitoring, OpenTelemetry pipelines, and integrating multiple data sources across cloud and on-prem infrastructures.
This role focuses heavily on building real-time observability, improving system reliability, and enabling data-driven operational insights.
Key Responsibilities
Grafana Engineering & Dashboard Development
Build advanced Grafana dashboards with alerts, custom panels, JSON models, and data visualizations.
Work with Grafana Managed (Azure Managed Grafana / AWS Managed Grafana) for enterprise-grade observability.
Integrate Grafana with multiple data sources such as:
Prometheus
ELK / Elasticsearch
Dynatrace
CloudWatch
Azure Monitor
InfluxDB / Telegraf
ServiceNow (incident integrations)
Develop role-based access (RBAC) and multi-tenant dashboard architectures.
Promztheus, Metrics & Alerting
Architect and manage Prometheus metrics pipelines, exporters, recording/alerting rules.
Optimize PromQL queries for high-performance dashboards.
Reduce alert noise through intelligent rule tuning and SLO-driven alerts.
Observability Platform Ownership
Build and maintain end-to-end observability stack:
Grafana + Prometheus + ELK + OpenTelemetry + Cloud-native monitoring tools.
Integrate logs, metrics, traces into unified dashboards.
Establish SLIs, SLOs, error budgets, and real-time reliability insights.
Kubernetes & Cloud Monitoring
Deploy and monitor Kubernetes clusters (AKS, EKS, Rancher).
Configure Grafana Alloy / Prometheus Operator / kube-state-metrics for cluster-level insights.
Implement Infrastructure-as-Code for observability stack deployments.
Automation & Infrastructure as Code
Automate monitoring agent deployments using:
Terraform
Azure DevOps / GitHub / GitLab
FluxCD, Kustomize, Helm
Develop monitoring-as-code for repeatable environment provisioning.
Incident Response & Performance Troubleshooting
Provide deep troubleshooting across infrastructure, network, applications, and microservices.
Build automated dashboards for war rooms and cross-team collaboration.
Leverage Grafana annotations, synthetic monitoring, and event correlation.
Security, Compliance & Governance
Implement secure access to metric/log dashboards using IAM, RBAC, ABAC.
Configure audit logs, long-term retention, and secure storage pipelines.
(Optional: FedRAMP/NIST experience beneficial for regulated workloads.)
Required Skills & Expertise
Grafana & Observability (Primary)
Expert in Grafana dashboard engineering
Prometheus + Alertmanager
Managed Grafana (Azure/AWS)
ELK Stack (Elasticsearch, Logstash, Kibana)
OpenTelemetry (OTEL) metrics & traces
Grafana Alloy, Loki (Bonus)
Cloud Platforms
Azure, AWS, IBM Cloud (Nice-to-have)
CloudWatch, Azure Monitor, App Insights
Containers & Infrastructure
Kubernetes (AKS, EKS)
Docker, Rancher, OpenShift
Linux (RHEL/CentOS)
DevOps & Automation
Terraform, Helm, Kustomize
Git, CI/CD pipelines
Scripting (Python, Bash, PowerShell)
Monitoring Ecosystem
Experience with additional tools is a plus:
Dynatrace
Splunk
Sysdig
AppDynamics
SolarWinds
Moogsoft AI-Ops
Preferred Qualifications
Strong background in SRE, Observability Engineering, DevOps, or Platform Engineering.
Experience with microservices, distributed systems, and cloud-native architectures.
ITIL v3 or industry certifications in AWS/Azure/Kubernetes are a plus.
Education
Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
Certifications in cloud, observability, Grafana, or Kubernetes are an advantage.
Aptimized
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python Now
hyderabad, telangana, india
Salary: Not disclosed
hyderabad, telangana, india
Salary: Not disclosed