Company Description
Founded in 2018, Leena AI is an Enterprise Agentic AI platform that is powerful, flexible, and can meet the needs of any enterprise. Trusted by 20M+ employees across companies like Nestle, Puma, AirAsia, Coca-Cola, Abbott and HDFC Bank, we have transformed 30M conversations and 1B employee interactions.Leena AI plays well with 100+ platforms, including SAP SuccessFactors, ADP, Oracle, Workday, Microsoft Office 365, and Slack3. Supporting 100+ languages globally, Leena AI has raised $40M in investment from Greycroft and Bessemer Venture Partners.
Job Overview
Leena AI is seeking an experienced and motivated Site Reliability Engineer (SRE) to ensure the stability, scalability, and performance of our Enterprise Agentic AI platform. As a B2B SaaS company handling tens of millions of interactions, our systems must be highly resilient.The ideal candidate will bridge the gap between development and operations, applying a software engineering mindset to system administration. You will be responsible for monitoring system health, managing alerts, and resolving routine service issues to maintain our commitment to enterprise-grade reliability.Key ResponsibilitiesSystem Reliability & Performance
- Monitor system health, performance, and capacity across our cloud-based platforms and modern tech stacks.
- Manage and respond to alerts and logs to ensure high availability of services.
- Troubleshoot and resolve routine technical issues and performance bottlenecks in production environments.
Define and track key reliability metrics, including Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Infrastructure & Automation
- Develop and maintain tools for automated deployments, scaling, and self-healing systems to reduce manual intervention.
- Collaborate with engineering teams to ensure new features meet reliability and production standards.
- Identify and automate repetitive "toil" tasks to improve operational efficiency.
Incident Management & Support
- Participate in on-call rotations to respond to and mitigate production incidents promptly.
- Conduct post-incident reviews (post-mortems) to identify root causes and implement preventative measures.
- Act as a technical point of contact for complex infrastructure-related issues, ensuring timely resolution.
Stakeholder & Process Alignment
- Act as a liaison between engineering and product teams to ensure infrastructure scalability aligns with the product roadmap.
Maintain clear documentation of infrastructure, incident responses, and operational procedures using tools like Confluence.
Required Skills & Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- 3+ years of experience in Site Reliability Engineering, DevOps, or a similar production operations role, preferably in B2B SaaS.
- Strong understanding of cloud-based platforms (e.g., AWS, Azure, or GCP) and modern containerization (e.g., Kubernetes, Docker).
- Proficiency in scripting or programming (e.g., Python, Go, or Bash) to automate operational tasks.
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack, Datadog).
- Familiarity with tools like JIRA, Confluence, and CI/CD pipelines.
Preferred Qualifications
- Relevant certifications such as AWS Certified DevOps Engineer or Google Professional Cloud DevOps Engineer.
- Proven ability to manage and prioritize high-pressure engineering support tasks in a B2B environment.
- Experience coordinating with geographically distributed engineering and product teams.
- Hands-on experience with Infrastructure as Code (IaC) tools like Ansible.
Skills: python,devops,infrastructure,enterprise,kubernetes,docker,confluence,reliability,site reliability engineering,b2b,jira,ci/cd