Posted:3 days ago| Platform:
Work from Office
Full Time
As a Principal Site Reliability Engineer (SRE Level V/VI), you will play a central role in ensuring the performance, availability, and resilience of our platforms. In this position, you will go beyond maintaining systems by leading initiatives that redefine operational excellence. You will collaborate with diverse teams to implement cutting-edge technologies and best practices, foster a culture of reliability, and mentor others in their growth as engineers. This is an exceptional opportunity for someone passionate about solving complex challenges and shaping the future of platform reliability in a high-impact role. you'll spend time on the following: Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher. Drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools. Create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery. Build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack. Collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs. Lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues. Design and execute performance testing, capacity planning, and scalability strategies for evolving workloads. Proactively identify and resolve bottlenecks, increasing system performance and developer efficiency. Mentor junior engineers, fostering a collaborative and growth-oriented team environment. Guide architectural decisions that drive innovation and enhance system reliability. we're excited about you if you have: 10+ years in systems engineering, with at least 5+ years in SRE or DevOps roles. Expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker). Proficiency in programming and scripting languages like Python, Go, and Bash. Advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible. Deep understanding of networking, DNS, load balancing, and security principles. Proven track record of managing high-availability systems in demanding environments. Exceptional analytical and problem-solving skills. Preferred Qualifications: Certifications in cloud or container technologies (eg, AWS/GCP/Azure, Kubernetes CKA). Experience in industries like eCommerce, FinTech, or SaaS. Familiarity with Agile development processes and frameworks. What We Offer: The opportunity to work with cutting-edge technologies in a transformative environment. A collaborative and innovative work culture that values your expertise and contributions. Professional growth and leadership development pathways tailored to your aspirations. A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
7.0 - 12.0 Lacs P.A.
20.0 - 25.0 Lacs P.A.
9.0 - 13.0 Lacs P.A.
Hyderabad
17.0 - 20.0 Lacs P.A.
20.0 - 25.0 Lacs P.A.
20.0 - 25.0 Lacs P.A.
20.0 - 25.0 Lacs P.A.
30.0 - 45.0 Lacs P.A.
13.0 - 17.0 Lacs P.A.
Gurugram, Haryana, India
Salary: Not disclosed