As a Senior Software DevOps Engineer, you will lead the design, implementation, and evolution of telemetry pipelines and DevOps automation that enable next-generation observability for distributed systems. You will blend a deep understanding of Open Telemetry architecture with strong DevOps practices to build a reliable, high-performance and self-service observability platform across hybrid cloud environments (AWS & Azure). Your mission: empower engineering teams with actionable insights through rich metrics, logs, and traces, while championing automation and innovation at every layer.
WHAT YOU WILL BE DOING
- Observability Strategy & Implementation
- Architect and manage scalable observability solutions using OpenTelemetry (OTel),encompassing:
- Collectors: Design and deploy OTel Collectors (agent/gateway modes) for ingesting and exporting telemetry across services.
- Instrumentation: Guide teams on auto/manual instrumentation for services (metrics, traces, and logs).
- Export Pipelines: Build telemetry pipelines to route data to backends like
- Grafana, Prometheus, Loki, New Relic, and Azure Monitor.
- Processors & Extensions: Leverage OTel processors (batching, filtering,
- resource detection) and extensions for advanced enrichment and routing.
DevOps Automation & Platform Reliability
- Own the CI/CD experience using GitLab Pipelines, integrating infrastructure automation with Terraform, Docker, and scripting in Bash and Python.
- Build resilient and reusable infrastructure-as-code modules across AWS and Azure ecosystems.Manage containerized workloads, registries, secrets, and secure cloud-native deployments with best practices.
Cloud-Native Enablement
- Develop observability blueprints for cloud-native apps across AWS (ECS, EC2, VPC,IAM, CloudWatch) and Azure (AKS, App Services, Monitor).
- Optimize cost and performance of telemetry pipelines while ensuring SLA/SLO adherence for observability services.
Monitoring, Dashboards, and Alerting
- Build and maintain intuitive, role-based dashboards in Grafana ,New Relic..., enabling real-time visibility into service health, business KPIs, and SLOs. Implement alerting best practices (noise reduction, deduplication, alert grouping)integrated with incident management systems.
Innovation & Technical Leadership
- Drive cross-team observability initiatives that reduce MTTR and elevate engineering velocity.
- Champion innovation projects including self-service observability onboarding, log/metric reduction strategies, AI-assisted root cause detection, and more.
- Mentor engineering teams on instrumentation, telemetry standards, and operational excellence.
WHAT YOU BRING
- 10+years of experience in DevOps, Site Reliability Engineering, or Observability roles.
- Deep expertise with OpenTelemetry, including Collector configurations,
- receivers/exporters (OTLP, HTTP, Prometheus, Loki), and semantic conventions.
- Proficient in GitLab CI/CD, Terraform, Docker, and scripting (Python, Bash, Go). Strong hands-on experience with AWS and Azure services, cloud automation, and cost optimization.
- Proficiency with observability backends: Grafana, New Relic, Prometheus, Loki, or equivalent APM/log platforms.
- Passion for building automated, resilient, and scalable telemetry pipelines.
- Excellent documentation and communication skills to drive adoption and influence engineering culture.
Nice to Have)
- Certifications in AWS, Azure, or Terraform.
- Experience with OpenTelemetry SDKs in Go, Java, or Node.js .
- Familiarity with SLO management, error budgets, and observability-as-code approaches.
- Exposure to event streaming (Kafka,rabbitmq), Elasticsearch ,Vault,consul