entomo is an Equal Opportunity Employer. The company promotes and supports a diverse workforce at all levels across the Company. The Company ensures that its associates or potential hires, third-party support staff and suppliers are not discriminated against, directly or indirectly, as a result of their colour, creed, cast, race, nationality, ethnicity or national origin, marital status, pregnancy, age, disability, religion or similar philosophical belief, sexual orientation, gender or gender reassignment, etc.
Summary:
We are seeking a skilled Site Reliability Engineer (SRE) to join our team. In this role, you will be responsible for bridging the gap between development and operations by applying software engineering principles to infrastructure and operations tasks. Your primary focus will be
ensuring the reliability, availability, performance, and scalability of our production systems while minimizing manual operational work through automation and enhancing system resilience.
Position Overview:
The Site Reliability Engineer will work closely with development and operations teams to design, implement, and maintain highly reliable systems. You will be instrumental in establishing best practices for observability, incident response, and infrastructure management. Your expertise
will help reduce operational overhead, improve system performance, and ensure seamless deployments through CI/CD pipelines.
Qualifications
Required
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
- 3+ years of experience in SRE, DevOps or similar roles
- Strong proficiency with Kubernetes (K8s) and Docker containerization
- Experience with the ELK stack (Elasticsearch, Logstash, Kibana) for logging and monitoring
- Good to have : Understanding of Java programming and troubleshooting Java applications
- Working knowledge of SQL and MongoDB databases
- Familiarity with Angular for frontend monitoring and diagnostic tooling
- Strong understanding of system architecture, cloud infrastructure, and networking
- Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible)
- Demonstrated experience with monitoring and observability platforms
- Excellent problem-solving skills and ability to troubleshoot complex systems
- Outstanding verbal and written communication skills
Preferred Skills:
- Must Have : Experience with public AWS cloud platforms. Good to have knowledge and experience in Azure, GCP.
- Knowledge of CI/CD tools (Jenkins, GitLab CI, GitHub Actions)
- Familiarity with service mesh technologies (e.g., Istio)
- Experience with scripting languages (Python, Bash)
- Understanding of distributed systems and microservices architecture
- Experience implementing SLOs, SLIs, and SLAs
- Knowledge of security best practices
- Certification in relevant technologies (CKA, AWS, etc.)
Roles and Responsibilities:
System
- Design, implement, and maintain highly available and scalable infrastructure
- Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets
- Conduct capacity planning and performance optimization for critical systems
- Implement strategies to improve system resilience and fault tolerance
- Perform regular system health checks and proactive maintenance
Monitoring and Observability
- Deploy and maintain comprehensive monitoring solutions using the ELK stack and other tools
- Create and refine dashboards for system metrics, logs, and application performance
- Set up effective alerting systems with appropriate thresholds to minimize alert fatigue
- Implement distributed tracing to understand system behavior and identify bottlenecks
- Ensure proper logging and telemetry across all services
Incident Management and Response
- Lead incident response efforts, including troubleshooting, mitigation, and resolution
- Conduct thorough post-incident reviews to identify root causes and preventive measures
- Document incidents, resolutions, and knowledge for future reference
- Develop and maintain runbooks for common operational procedures
- Participate in on-call rotation to provide 24/7 coverage for critical systems
Automation and Toil Reduction
- Identify and eliminate toil through systematic automation
- Develop automated solutions for recurring operational tasks
- Implement Infrastructure as Code (IaC) practices for consistent environment provisioning
- Create self-service tools for developers to reduce operational dependencies
- Automate testing and deployment processes for improved efficiency
CI/CD Pipeline Management
- Design and maintain reliable CI/CD pipelines for continuous deployment
- Implement automated testing within deployment workflows
- Ensure smooth and reliable deployment processes with minimal disruption
- Develop strategies for canary deployments and feature flagging
- Create rollback mechanisms for quick recovery from failed deployments
Infrastructure Management
- Manage Kubernetes clusters and containerized applications
- Oversee configuration management and version control for infrastructure
- Implement security best practices and compliance requirements
- Optimize resource utilization and cost efficiency
- Ensure proper backup and disaster recovery procedures
Collaboration and Knowledge Sharing
- Work closely with development teams to improve application reliability
- Provide guidance on architectural decisions from a reliability perspective
- Conduct regular knowledge sharing sessions and documentation updates
- Train team members on SRE practices and tools
- Contribute to the development of SRE culture across the organization
Working Environment
- Collaborative team environment focused on continuous improvement
- Opportunity to work with cutting-edge technologies and solve complex problems
- Balance of project work and operational responsibilities
- Culture that values automation, innovation, and reliability
- Emphasis on learning and professional development
Success Metrics
- Improvement in system availability and reliability metrics
- Reduction in mean time to detect (MTTD) and mean time to resolve (MTTR) incidents
- Decreased frequency of production incidents and outages
- Increased automation coverage and reduced manual operational work
- Successful implementation of SLOs and monitoring systems
- Positive feedback from development teams on collaboration and support