As a Senior Engineer , you will be a key individual contributor within our Production Operations team. You will be instrumental in designing, building, and maintaining highly reliable, scalable, and performant cloud infrastructure and systems that support Greenlights mission-critical services. This role is for a seasoned engineer who thrives on solving complex operational challenges, enhancing system stability, and improving efficiency through automation and best practices.
What you will be doing:
- Contribute to the design, implementation, and maintenance of Greenlights core cloud infrastructure and Site Reliability Engineering (SRE) practices to ensure high availability, scalability, and performance.
- Develop, maintain, and optimize our cloud infrastructure using Infrastructure as Code (primarily Terraform) and other automation tools.
- Collaborate closely with development and security teams to embed SRE principles into the software development lifecycle, promoting secure and reliable coding practices.
- Design and implement robust monitoring, logging, and alerting solutions to provide comprehensive visibility into system health.
- Actively participate in and support incident response, performing deep-dive root cause analysis, and contributing to actionable blameless postmortems to prevent recurrence.
- Identify and implement architectural improvements to enhance system reliability, resilience, and operational efficiency.
- Automate operational tasks and processes to reduce toil and improve efficiency.
- Research, evaluate, and advocate for new technologies and tools that can improve our operational posture and efficiency.
- Enhance existing services and applications to increase availability, reliability, and scalability in a microservices environment.
- Build and improve engineering tooling, processes, and standards to enable faster, more consistent, more reliable, and highly repeatable application delivery.
What you'll bring to the team:
- 5+ years of experience in a Site Reliability Engineering, Production Operations, or similar role, with a strong focus on cloud infrastructure and distributed systems.
- Proven experience architecting, building, and maintaining highly available, secure, and scalable systems in a public cloud environment (AWS strongly preferred).
- Strong proficiency with IaC tools, particularly Terraform.
- Demonstrated experience in automating operational tasks using scripting languages (eg, Python, Go, Bash) and automation platforms.
- Expertise in designing and implementing comprehensive monitoring, logging, and alerting solutions (eg, Datadog, Prometheus, Grafana, ELK stack).
- Solid understanding of incident response best practices, with experience in troubleshooting and resolving complex production issues.
- Strong understanding of distributed systems, microservices architectures, and containerization technologies (Docker, Kubernetes/EKS).
- Exceptional analytical and problem-solving skills, with a track record of debugging complex issues in production environments.
- Excellent communication, collaboration, and interpersonal skills. Ability to clearly articulate technical concepts to both technical and non-technical audiences.
- A passion for identifying and implementing improvements in system reliability, performance, and operational efficiency.
Technologies we use:
- AWS
- MySQL, DynamoDB, Redis
- GitHub Actions for CI pipelines
- Kubernetes (specifically EKS)
- Ambassador, Helm, Argo CD, LinkerD
- REST, gRPC, graphQL
- React, Redux, Swift, Node.js , Kotlin, Java, Go, Python
- Datadog, Prometheus