Job Details
As a senior SRE / Observability Engineer, you will be part of the Atlas Platform Engineering team and will:
Create and maintain observability standards and best practices
Review the current observability platform, identify areas for improvement, and guide the team in enhancing monitoring, logging, tracing, and alerting capabilities.
Expand the observability stack across multiple clouds, regions, and clusters, managing all observability data.
Design and implement monitoring solutions for complex distributed systems to provide deep insights into systems and services aiming at complete visibility of digital operations
Supporting the ongoing evaluation of new capabilities in the observability stack, conducting proof of concepts, pilots, and tests to validate their suitability.
Assist teams in creating clear, informative, and actionable dashboards to improve system visibility.
Automate monitoring and alerting processes, including enrichment strategies and ML-driven anomaly detection where applicable.
Provide technical leadership to the observability team with clear priorities ensuring agreed outcomes are achieved in a timely manner.
Work closely with R&D and product development teams (understand their requirements and challenges) to ensure seamless visibility into system and service performance.
Work closely with the Traffic Management team to identify and standardise on existing and new observability tools as part of a holistic solution
Conduct training sessions and create documentation for internal teams
Support the definition of SLI (service level indicators) and SLO (service level objectives) for the Atlas services.
Keep track of the error budget of each service
Participate in the emergency response process
Conduct RCAs (root cause analysis)
Help to automate repetitive tasks and reduce toil.
Qualifications:
People and communication qualifications
Be a strong team player
Have good collaboration and communication skills
Ability to translate technical concepts for non-technical audiences
Problem-solving and analytical thinking
Technical qualifications - general:
Familiarity with cloud platforms (Ideally Azure)
Familiarity with Kubernetes and Istio as the architecture on which the observability and Atlas services run, and how they integrate and scale.
Experience with infrastructure as code and automation
Knowledge of common programming languages and debugging techniques
Have a strong technical background and be hands on.
Linux and scripting languages (Bash, Python, Golang).
Significant Understanding of DevOps principles.
Technical qualifications - observability
Strong understanding of observability principles (metrics, logs, traces)
Experience with APM tools and distributed tracing
Proficiency in log aggregation and analysis
Knowledge and hands-on experience with monitoring, logging, and tracing tools such as Prometheus, Prometheus, Grafana, Datadog, New Relic, Sumologic, ELK Stack, or others
Knowledge of Open Telemetry, including OTEL collector and code instrumentation
Experience designing and building unified observability platforms that enable the use of data (metrics, logs, and traces) to determine quickly if their application or service is operating as desired.
Technical qualifications - SRE
Understanding of the Google SRE principles
Experience in defining SLIs and SLOs
Experience in performing RCAs (root cause analysis)
Experience in system performance
Experience in incident response
Knowledge of status tools, such as Atlassian Status Page or similar
Knowledge of incident management and paging tools, such as PagerDuty or similar
Knowledge of ITIL (Information Technology Infrastructure Library) processes
Qualifications:
People and communication qualifications
Be a strong team player
Have good collaboration and communication skills
Ability to translate technical concepts for non-technical audiences
Problem-solving and analytical thinking
Technical qualifications - general:
Familiarity with cloud platforms (Ideally Azure)
Familiarity with Kubernetes and Istio as the architecture on which the observability platform runs, and how they integrate and scale.
Experience with infrastructure as code and automation
Knowledge of common programming languages and debugging techniques
Have a strong technical background and be hands on.
Linux and scripting languages (Bash, Python, Golang).
Significant Understanding of DevOps principles.
Technical qualifications - observability
Strong understanding of observability principles (metrics, logs, traces)
Experience with APM tools and distributed tracing
Proficiency in log aggregation and analysis
Knowledge and hands-on experience with monitoring, logging, and tracing tools such as Prometheus, Prometheus, Grafana, Datadog, New Relic, Sumologic, ELK Stack, or others
Knowledge of Open Telemetry, including OTEL collector and code instrumentation
Experience designing and building unified observability platforms that enable the use of data (metrics, logs, and traces) to determine quickly if their application or service is operating as desired.
?