Role: Observability EngineerJob Description:Senior Platform EngineerWe are seeking a highly experienced and driven Senior Observability Engineer to lead the design, development, and maintenance of observability solutions across our infrastructure, applications, and services. As a Senior Observability Engineer, you will be at the forefront of implementing cutting-edge monitoring, logging, and tracing solutions that ensure the reliability, performance, and availability of our complex, distributed systems. You will be collaborating with cross-functional teams, including Development, Infrastructure Engineers, DevOps, and SREs, to optimize system observability, and improve our incident response capabilities.
Key Responsibilities
- Lead the Design & Implementation of observability solutions, including monitoring, logging, and tracing for both cloud and on-premises environments.
- Drive the Development and maintenance of advanced monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics.
- Implement Distributed Tracing frameworks like OpenTelemetry, Jaeger, or Zipkin, and enhance application performance diagnostics and troubleshooting.
- Optimize Log Management and analysis strategies using tools like Elasticsearch, Splunk, Loki, and Fluentd, ensuring efficient log processing and insights.
- Develop Advanced Alerting and anomaly detection strategies to proactively identify system issues, minimizing downtime and improving Mean Time to Recovery (MTTR).
- Collaborate with Development & SRE Teams to enhance observability in CI/CD pipelines, microservices architectures, and across various platform environments.
- Automate Observability Tasks by leveraging scripting languages such as Python, Bash, or Golang to increase efficiency and scale observability operations.
- Ensure Scalability & Efficiency of monitoring solutions to manage large-scale distributed systems and handle evolving business requirements.
- Lead Incident Response by providing actionable insights through observability data for effective troubleshooting and root cause analysis.
- Stay Abreast of Industry Trends in observability, Site Reliability Engineering (SRE), and monitoring practices, continuously improving processes.
Required Qualifications
- 5+ years of hands-on experience in observability, SRE, DevOps, or a related field, with a proven track record of successfully managing complex, large-scale distributed systems.
- Expert-level proficiency in observability tools such as Prometheus, Grafana, Datadog, New Relic, AppDynamics, with the ability to lead the design and implementation of these solutions at scale.
- Advanced experience with log management platforms like Elasticsearch, Splunk, Loki, and Fluentd, and the ability to optimize log aggregation and analysis for better performance insights.
- Deep expertise in distributed tracing tools such as OpenTelemetry, Jaeger, or Zipkin, with a focus on performance optimization and root cause analysis.
- Extensive experience with cloud environments (preferably Azure, AWS, GCP) and Kubernetes for deploying and managing observability solutions across modern, cloud-native infrastructures.
- Advanced proficiency in scripting languages such as Python, Bash, or Golang, and strong experience with Infrastructure as Code (IaC) tools like Terraform and Ansible.
- Strong understanding of system architecture, performance tuning, and troubleshooting complex production environments, with an emphasis on scalability and high availability.
- Proven experience in leading and mentoring teams, providing technical direction, and driving the adoption of best practices for observability and monitoring.
- Exceptional problem-solving skills, with a focus on providing actionable insights and data-driven decision-making.
- Ability to lead high-impact projects, effectively communicate with stakeholders, and influence cross-functional teams.
- Strong communication and collaboration skills; demonstrated ability to work closely with engineering teams, leadership, and external partners to meet observability and system reliability goals.
Preferred Qualifications
- Experience with AI-driven observability tools and anomaly detection techniques.
- Familiarity with microservices, serverless architectures, and event-driven systems.
- Proven track record of handling on-call rotations and incident management workflows in high-availability environments.
- Relevant certifications in observability tools, cloud platforms, or SRE best practices are a plus