The team comes from diverse technical backgrounds, and the responsibilities provide the opportunity for a variety of challenges. Ideal candidates will have a background in either software engineering or systems engineering with a desire to learn the other or previous experience with building and managing Monitoring and Alerting systems. We are looking for a Systems Thinking, Principal Engineer who has helped teams scale through production insights, operational automation, building observability program, developer guidance, real-time metrics, automation, automation, automation!
WHAT YOU WILL BE DOING
- Implement monitoring and alerting systems to guarantee high availability and performance, with a dedicated focus on SLA and availability metrics.
- Collaborate with engineering and operations teams to identify critical components and systems requiring enhanced availability measures.
- Design and implement strategies, tooling, and processes to enhance system uptime and reliability.
- Continuously evaluate and recommend improvements to platform infrastructure and processes, enhancing efficiency and reliability.
- Align the platform with customer needs and business goals by working closely with cross-functional teams.
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Build software and systems to monitor platform infrastructure and applications.
- Monitor and Improve reliability, quality, and time-to-market of our suite of software solutions.
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.
- Provide primary operational support and engineering for multiple large-scale distributed software applications.
- Gather and analyze metrics from operating systems as we'll as applications to assist in performance tuning and fault finding.
WHAT YOU BRING
- bachelors degree or higher in a technology related field (eg Engineering, Computer Science, etc) required, masters degree a plus
- 6+ years professional experience Monitoring and Alerting roles on major cloud platforms (AWS, Azure), preferably someone with project leadership roles.
- 4+ experience in Cloud development (AWS, Azure) and observability skills; Experience with building and operating highly resilient platforms in AWS cloud environments.
- 3+ years of experience in software development with Python, NodeJS, or Java with a focus on SDLC and automation
- Hands-on experience with container orchestration, preferably with Kubernetes
- Hands-on experience with building observability, monitoring and alerting on large scale distributed systems.
- Leadership/design of application and/or infrastructure migration projects from on-prem to cloud
- Cloud architecture design and implementation to solve key business needs and meet team goals.
- Familiarity with current AWS solutions; Azure experience also considered.
- Containerized workloads (Prefer Helm; Related: AKS & EKS, other K8s distributions, Docker, JFrog)
- Logging and monitoring tools (Prefer: Prometheus, Grafana, Dataddon, AWS Cloudwatch; Related, , Azure Monitor, Log Analytics, Fluentd)
- Network Security (eg AWZ Policy, Azure Policy, VPN, Active Directory/RBAC, ACLs, NSG rules, private endpoints)
- Proven experience in implementing advanced observability practices and techniques at scale.
- Hands on experience with one or more observability tools (Prometheus, Grafana,
- ELK/OpenSearch, OpenTelemetry, Datadog, etc)
- Experienced in Instrumentation with systems skills on building and operating,
- monitoring, logging, alerting services of distributed systems at scale.
- Demonstrated ability to utilize modern monitoring tools (DataDog, Prometheus, etc)
- Experienced in Instrumentation with systems skills on building and operating,
- monitoring, logging, alerting services of distributed systems at scale.
- Ability to build monitoring ecosystem with high fidelity alerting.
- Ability to automate resolution of alerts.
- Ability to automate with various scripting languages (Python, Golang, Shell scripting,etc)
- Knowledge of managing systems using infrastructure as code tools (IAM, ARM,Terraform, Chef)
- Solid understanding of Cloud Computing and DevOps concepts.
- Hands-on Kubernetes skills and knowledge.
- Proven experience in maintaining scalability and resiliency of complex environment.
- Ability to triage, execute root cause analysis, and be decisive under pressure
- Experience managing and interpreting large datasets using query languages and visualization tools
- Proficient communication skills with an ability to reach both technical and non-technical audience
- Ability to learn new software, method and practices and bringing them to our developers
- Ability to work with a variety of individuals and groups, both in person and virtually, in a
- constructive and collaborative manner and build and maintain effective relationships