System Reliability & Automation:
- Design, build, and maintain efficient and scalable systems through automation, reducing manual work and "toil."
- Develop and maintain CI/CD pipelines to ensure consistent, reliable, and fast software delivery.
- Plan, design, and execute configuration changes and rollouts both at the application and infrastructure levels.
Monitoring & Observability:
- Define, measure, and report on key Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).
- Implement comprehensive monitoring, logging, and alerting solutions that focus on symptoms rather than causes.
- Utilize error budgets to balance the pace of feature development with system stability.
Incident Response & Management:
- Participate in on-call rotation to respond to, troubleshoot, and mitigate production incidents and alerts.
- Conduct blameless post-mortem/Root Cause Analysis (RCA) reviews to identify the cause of incidents and implement preventative measures.
Capacity Planning & Performance:
- Proactively monitor system performance, identify bottlenecks, and drive optimization efforts.
- Perform capacity planning to ensure the platform can scale to meet future user and traffic demands.
Collaboration & Mentorship:
- Collaborate closely with development (Dev) teams to integrate operational and reliability best practices into the entire software development lifecycle (SDLC).
- Document systems, processes, and "runbooks" to share knowledge and facilitate smooth operations.
Required Qualifications
- Bachelors degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- Proficiency in at least one scripting or programming language (e.g., Python, Java, Bash).
- Experience with configuration management and infrastructure-as-code tools (e.g., Terraform, Ansible, Chef, Puppet).
- Solid understanding and experience with cloud computing platforms (e.g., AWS, Azure, OCI ).
- Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Familiarity with monitoring and alerting tools (e.g., Elastic, Grafana, Splunk, Nagios).
- Strong knowledge of Linux operating systems, networking, and distributed systems.