Company Overview
Logile is the leading retail labor planning, workforce management, inventory management and store execution provider deployed in thousands of retail locations across North America, Europe, Australia, and Oceania.Our proven AI, machine-learning technology and industrial engineering accelerate ROI and enable operational excellence with improved performance and empowered employees. Retailers worldwide rely on Logile solutions to boost profitability and competitive advantage by delivering the best service and products at optimal cost.From labor standards development and modeling to unified forecasting, storewide scheduling, and time and attendance, to inventory management, task management, food safety, and employee self-service — we transform retail operations with a unified store-level solution. Gain the Advantage with The Logic of Retail. One Platform for store planning, scheduling and execution.For more information, visit www.logile.com.
Job Summary
We are seeking a motivated and experienced
Site Reliability Engineer
(
SRE)
to join our dynamic engineering team. The ideal candidate will have a strong background to ensure the reliability, scalability, and performance of our infrastructure and applications. The SRE will focus on building robust monitoring systems, automating operations, and bridging the gap between development and operations to achieve high service availability.
Key Responsibilities
- Design, implement, and manage observability systems (Prometheus, Grafana, ELK/EFK, Jaeger, Open Telemetry).
- Define and maintain SLAs, SLOs, and SLIs for services, ensuring reliability goals are met.
- Build automation for infrastructure, monitoring, scaling, and incident response using Terraform, Ansible, and scripting (Python/Bash).
- Collaborate with developers to design resilient and scalable systems following SRE best practices.
- Lead incident management: monitoring alerts, root cause analysis, postmortems, and continuous improvement.
- Implement chaos engineering and fault-tolerance testing to validate system resilience.
- Drive capacity planning, performance tuning, and cost optimization across environments.
- Ensure security, compliance, and governance in infrastructure monitoring
Job Location & Schedule:
- This job is an onsite job at Logile Bhubaneswar Office.
- It is expected that the selected candidate will be available to work with some hours of overlap with US working times
Required Skills & Experience
- 2 -5 years, Strong experience with monitoring, logging, and tracing tools (Prometheus, Grafana, ELK, EFK, Jaeger, Open Telemetry, Loki).
- Cloud expertise: AWS, Azure, or GCP monitoring and reliability practices (CloudWatch, Azure Monitor).
- Proficiency in Linux system administration and networking fundamentals.
- Solid skills in infrastructure automation (Terraform, Ansible, Helm).
- Programming/scripting skills: Python, Go, Bash.
- Experience with Kubernetes and containerized workloads.
- Proven track record in CI/CD and DevOps practices.
Preferred Skills
- Experience with chaos engineering tools (Gremlin, Litmus).
- Strong collaboration skills to drive SRE culture across Dev & Ops teams.
- Experience with Agile/Scrum environments.
- Knowledge of security best practices (DevSecOps).
Compensation And Benefits
- The compensation and benefits associated for this role is benchmarked against the best in industry and job location
- Applicable shift allowances and home pick up and drops will be provided by Logile