Job
Description
As a Senior Observability Engineer, you will play a crucial role in leading the design, development, and maintenance of observability solutions across our infrastructure, applications, and services. Your primary responsibility will be to implement cutting-edge monitoring, logging, and tracing solutions to ensure the reliability, performance, and availability of our complex, distributed systems. Collaboration with cross-functional teams, including Development, Infrastructure Engineers, DevOps, and SREs, will be essential to optimize system observability and enhance our incident response capabilities. Key Responsibilities: - Lead the Design & Implementation of observability solutions for cloud and on-premises environments, encompassing monitoring, logging, and tracing. - Drive the Development and maintenance of advanced monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics. - Implement Distributed Tracing frameworks like OpenTelemetry, Jaeger, or Zipkin to enhance application performance diagnostics and troubleshooting. - Optimize Log Management and analysis strategies using tools like Elasticsearch, Splunk, Loki, and Fluentd for efficient log processing and insights. - Develop Advanced Alerting and anomaly detection strategies to proactively identify system issues and improve Mean Time to Recovery (MTTR). - Collaborate with Development & SRE Teams to enhance observability in CI/CD pipelines, microservices architectures, and various platform environments. - Automate Observability Tasks by leveraging scripting languages such as Python, Bash, or Golang to increase efficiency and scale observability operations. - Ensure Scalability & Efficiency of monitoring solutions to manage large-scale distributed systems and meet evolving business requirements. - Lead Incident Response by providing actionable insights through observability data for effective troubleshooting and root cause analysis. - Stay Abreast of Industry Trends in observability, Site Reliability Engineering (SRE), and monitoring practices to continuously improve processes. Required Qualifications: - 5+ years of hands-on experience in observability, SRE, DevOps, or related fields, with a proven track record in managing complex, large-scale distributed systems. - Expert-level proficiency in observability tools such as Prometheus, Grafana, Datadog, New Relic, AppDynamics, and the ability to design and implement these solutions at scale. - Advanced experience with log management platforms like Elasticsearch, Splunk, Loki, and Fluentd, optimizing log aggregation and analysis for performance insights. - Deep expertise in distributed tracing tools like OpenTelemetry, Jaeger, or Zipkin, focusing on performance optimization and root cause analysis. - Extensive experience with cloud environments (Azure, AWS, GCP) and Kubernetes for deploying and managing observability solutions in cloud-native infrastructures. - Advanced proficiency in scripting languages like Python, Bash, or Golang, and experience with Infrastructure as Code (IaC) tools such as Terraform and Ansible. - Strong understanding of system architecture, performance tuning, and troubleshooting production environments with scalability and high availability in mind. - Proven leadership experience and the ability to mentor teams, provide technical direction, and drive best practices for observability and monitoring. - Excellent problem-solving skills, emphasizing actionable insights and data-driven decision-making. - Ability to lead high-impact projects, communicate effectively with stakeholders, and influence cross-functional teams. - Strong communication and collaboration skills, working closely with engineering teams, leadership, and external partners to achieve observability and system reliability goals. Preferred Qualifications: - Experience with AI-driven observability tools and anomaly detection techniques. - Familiarity with microservices, serverless architectures, and event-driven systems. - Proven track record of handling on-call rotations and incident management workflows in high-availability environments. - Relevant certifications in observability tools, cloud platforms, or SRE best practices are advantageous.,