We are seeking a dedicated Cloud Reliability Engineer to champion the reliability, availability, and security of our production SaaS platform. In this role, you will act as the first line of defense for cloud infrastructure, balancing your time between core production day to day operations such as incident management, change management, monitoring, and triage and automation to reduce operational toil. You will play a pivotal role in maintaining customer trust by strictly adhering to SLAs and compliance processes while driving continuous improvement through code.
What you'll Do :
Operational Excellence & Incident Management
- Monitoring & Triage: Proactively monitor cloud infrastructure health to ensure high availability and performance. Act as the primary owner for production alert monitoring, triage, and swift resolution.
- Incident Response: Manage critical incidents and escalations from identification to resolution. Lead root cause analysis (RCA) and post-incident reviews to minimize Mean Time To Recovery (MTTR) and prevent recurrence.
- Change & Release Management: Execute and track production upgrades, multi-tenant deployments, and change requests within defined SLAs, ensuring zero-downtime maintenance where possible.
- Escalation Support: Handle escalated Support cases and provide infrastructure support for field teams and other environments.
- 24/7 Availability: Participate in a shift-based schedule and on-call rotation to provide round-the-clock support for critical production systems.
Automation & Continuous Improvement
- Task Automation: Utilize Python and Jenkins to script and automate repetitive operational tasks, reducing manual intervention and increasing efficiency.
- Tooling Optimization: Assist in maintaining and optimizing monitoring, alerting, and CI/CD tools to streamline workflows.
- Process Evolution: Identify opportunities to shift left on operations, transforming manual runbooks into automated self-healing mechanisms over time.
What You Bring :
- 2-5 years of professional experience in Cloud Operations, Site Reliability Engineering (SRE), or K8s administration.
- Hands-on experience with public cloud platforms ( AWS, GCP, or Azure ) in a production environment.
- Operational knowledge of Kubernetes (EKS, GKE, or AKS ) including troubleshooting and cluster management.
- Moderate proficiency in scripting and automation , specifically using Python and Jenkins .
- Strong understanding of ITIL processes (Incident, Change, Problem Management) .
- Demonstrated ability to prioritize tasks under pressure while maintaining strict SLAs.
- Excellent collaboration skills to work effectively with Engineering, Product, and Support teams .
- bachelors degree in Computer Science, Information Technology, or equivalent work experience.
Preferred Skills :
- Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation.
- Familiarity with cloud-native observability tools (eg, CloudWatch, Stackdriver, Prometheus, Grafana).
- Strong Linux system administration and networking troubleshooting skills.
- Background in supporting enterprise-grade SaaS platforms with strict compliance and security requirements.
Working Conditions :
- Shift-Based Role: This position requires working in defined shifts to ensure global coverage.
- On-Call: Regular participation in an on-call rotation is required.
- Environment: Fast-paced, collaborative, and process-oriented environment with a strong focus on production stability.