Some of the things you ll be doing:
- System Reliability & Automation:
- Design, build, and maintain efficient and scalable systems through automation, reducing manual work and "toil."
- Develop and maintain CI/CD pipelines to ensure consistent, reliable, and fast software delivery.
- Plan, design, and execute configuration changes and rollouts both at the application and infrastructure levels.
- Monitoring & Observability:
- Define, measure, and report on key Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).
- Implement comprehensive monitoring, logging, and alerting solutions that focus on symptoms rather than causes.
- Utilize error budgets to balance the pace of feature development with system stability.
- Incident Response & Management:
- Participate in on-call rotation to respond to, troubleshoot, and mitigate production incidents and alerts.
- Conduct blameless post-mortem/Root Cause Analysis (RCA) reviews to identify the cause of incidents and implement preventative measures.
- Capacity Planning & Performance:
- Proactively monitor system performance, identify bottlenecks, and drive optimization efforts.
- Perform capacity planning to ensure the platform can scale to meet future user and traffic demands.
- Collaboration & Mentorship:
- Collaborate closely with development (Dev) teams to integrate operational and reliability best practices into the entire software development lifecycle (SDLC).
- Document systems, processes, and "runbooks" to share knowledge and facilitate smooth operations.
What technical skills, experience, and qualifications do you need
- Bachelor s degree in computer science, Engineering, or a related technical field, with 5+ years of experience
- Proficiency in at least one scripting or programming language (e.g., Python, Java, Bash).
- Experience with configuration management and infrastructure-as-code tools (e.g., Terraform, Ansible, Chef, Puppet).
- Solid understanding and experience with cloud computing platforms (e.g., AWS, Azure, OCI ).
- Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Familiarity with monitoring and alerting tools (e.g., Elastic, Grafana, Splunk, Nagios).
- Strong knowledge of Linux operating systems, networking, and distributed systems.
- Previous experience in an SRE, DevOps, or highly-automated Systems Engineering role.
- Experience with large-scale data systems or database administration.
- Demonstrated ability to debug and optimize code and automate infrastructure.
- Excellent written and verbal communication skills, including the ability to explain complex technical concepts to non-technical audiences.