Cloud Reliability Engineer II

2 - 5 years

11 - 15 Lacs

Posted:6 days ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

We are seeking a dedicated Cloud Reliability Engineer to champion the reliability, availability, and security of our production SaaS platform. In this role, you will act as the first line of defense for cloud infrastructure, balancing your time between core production day to day operations such as incident management, change management, monitoring, and triage and automation to reduce operational toil. You will play a pivotal role in maintaining customer trust by strictly adhering to SLAs and compliance processes while driving continuous improvement through code.
 
What you'll Do :
 
Operational Excellence & Incident Management
  • Monitoring & Triage: Proactively monitor cloud infrastructure health to ensure high availability and performance. Act as the primary owner for production alert monitoring, triage, and swift resolution.
  • Incident Response: Manage critical incidents and escalations from identification to resolution. Lead root cause analysis (RCA) and post-incident reviews to minimize Mean Time To Recovery (MTTR) and prevent recurrence.
  • Change & Release Management: Execute and track production upgrades, multi-tenant deployments, and change requests within defined SLAs, ensuring zero-downtime maintenance where possible.
  • Escalation Support: Handle escalated Support cases and provide infrastructure support for field teams and other environments.
  • 24/7 Availability: Participate in a shift-based schedule and on-call rotation to provide round-the-clock support for critical production systems.

Automation & Continuous Improvement

  • Task Automation: Utilize Python and Jenkins to script and automate repetitive operational tasks, reducing manual intervention and increasing efficiency.
  • Tooling Optimization: Assist in maintaining and optimizing monitoring, alerting, and CI/CD tools to streamline workflows.
  • Process Evolution: Identify opportunities to shift left on operations, transforming manual runbooks into automated self-healing mechanisms over time.

What You Bring :

  • 2-5 years of professional experience in Cloud Operations, Site Reliability Engineering (SRE), or K8s administration.
  • Hands-on experience with public cloud platforms ( AWS, GCP, or Azure ) in a production environment.
  • Operational knowledge of Kubernetes (EKS, GKE, or AKS ) including troubleshooting and cluster management.
  • Moderate proficiency in scripting and automation , specifically using Python and Jenkins .
  • Strong understanding of ITIL processes (Incident, Change, Problem Management) .
  • Demonstrated ability to prioritize tasks under pressure while maintaining strict SLAs.
  • Excellent collaboration skills to work effectively with Engineering, Product, and Support teams .
  • bachelors degree in Computer Science, Information Technology, or equivalent work experience.
Preferred Skills :
  • Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation.
  • Familiarity with cloud-native observability tools (eg, CloudWatch, Stackdriver, Prometheus, Grafana).
  • Strong Linux system administration and networking troubleshooting skills.
  • Background in supporting enterprise-grade SaaS platforms with strict compliance and security requirements.
Working Conditions :
  • Shift-Based Role: This position requires working in defined shifts to ensure global coverage.
  • On-Call: Regular participation in an on-call rotation is required.
  • Environment: Fast-paced, collaborative, and process-oriented environment with a strong focus on production stability.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Thoughtspot logo
Thoughtspot

Software Development

Mountain View California

RecommendedJobs for You

warangal, hyderabad, nizamabad