Posted:4 days ago|
Platform:
On-site
Full Time
Cvent is looking for a Principal Site Reliability Engineer to help us scale our systems and ensure stability, reliability and performance and rapid deployments of our platform. We build teams that are inclusive, collaborative, and have a strong sense of ownership for the things they build. If you have a passion and track record for solving problems; moreover, have strong leadership skills, this is a great fit for you.
As a Principal Engineer, you will demonstrate both emerging and current technologies, methods, and processes contributing to the evolution of software deployment processes, enhancing security, reducing risk, and improving the overall end-user experience. As part of the Technology R&D Team, you will play an integral part in advancing DevOps maturity and be a part of a new culture of quality and site reliability. You will continually improve reliability, resiliency and scalability of our products, processes, and procedures. In this position, you would also be expected to ramp up to manage/mentor engineers and ensure their technical growth.
• Set long-term technical direction for complex problems; communicate timeline, scope, risks, and the technical roadmap to leadership and stakeholders.
• Continuously evaluate emerging cloud and AI/automation technologies; run POCs to assess fit and pioneer intelligent copilots for support, incident response, and developer workflows.
• Architect, standardize, and scale SRE frameworks and best practices; drive adoption and continual improvement of SLIs/SLOs/SLAs across business-critical platforms.
• Lead design and integration of CI/CD, containerization (Docker, Kubernetes), and IaC (Terraform, AWS CDK) for large-scale environments; ensure security and regulatory compliance.
• Define and implement observability, monitoring, and alerting strategies; conduct deep-dive RCAs using Datadog, Prometheus, Grafana, and ELK; lead blameless postmortems.
• Lead capacity planning, cost optimization, and disaster recovery to ensure scalability, reliability, and system resilience.
• Translate business risk and product goals into actionable reliability and observability strategies; partner closely with SRE, Product, and Engineering teams.
• Mentor and upskill SRE/DevOps engineers; foster a culture of ownership, continuous learning, and operational excellence.
• Pioneer the use of AI-powered automation and intelligent copilots for alert triage, event grouping, and developer/operations workflow efficiencies.
• Serve as a mentor and organizational leader, influencing technical direction, upskilling teams, and fostering a culture of shared reliability ownership and blameless postmortems.
• Lead capacity planning, cost optimization, and disaster recovery initiatives to ensure seamless scalability and system resilience.
• Bridge business and technology stakeholders, translating business risk and product goals into actionable reliability and observability strategies.
• Represent the technology perspective and priorities to leadership and other stakeholders by continuously communicating timeline, scope, risks, and technical road map.
• 10-13 years in SRE, cloud engineering, or DevOps with significant time in an architect, staff, or principal role.
• Deep fluency in AWS across multi-account, multi-region, and high-traffic environments; strong foundation in distributed systems architecture and infrastructure as code.
• Demonstrable leadership scaling organizational SRE practices: CI/CD, observability, incident management, RCAs, and blameless postmortems.
• Proven track record driving adoption of AI, automation, and ML to improve reliability, operational efficiency, and developer productivity.
• Expert programming/scripting (Python, Go, or similar) with Linux internals depth and advanced troubleshooting of distributed systems.
• Validated breadth across networking, cloud, databases, and scripting, experience with multi-tier architectures.
• Exceptional ability to influence, coach, and communicate across engineering and product, acts as a pragmatic technical conscience with a strong bias for execution.
• Mastery of incident management, postmortem culture, and root cause analysis for distributed systems.
• Experience with Unix/Linux environments with a deep grasp on system internals
• Worked on large-scale distributed systems including multi-tiered architecture.
• Validated breadth of understanding and development of solutions based on multiple technologies, including networking, cloud, database, and scripting languages.
• Strong leadership, communication and interpersonal skills geared to getting things done.
Cvent
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python Now
pune, bengaluru
18.0 - 22.5 Lacs P.A.
gurugram, haryana, india
Salary: Not disclosed
hyderabad, telangana, india
Salary: Not disclosed
noida, uttar pradesh, india
Experience: Not specified
Salary: Not disclosed
20.0 - 25.0 Lacs P.A.
5.0 - 9.0 Lacs P.A.
3.75 - 8.55 Lacs P.A.
hyderabad, telangana, india
Salary: Not disclosed
hyderabad, telangana, india
Salary: Not disclosed
bengaluru
18.0 - 22.5 Lacs P.A.