Posted:11 hours ago|
Platform:
Work from Office
Full Time
Sonyliv is a leading OTT platform revolutionizing the way audiences consume entertainment. With millions of users across the globe, our mission is to deliver seamless, high-quality, and reliable streaming experiences. We are looking for a Principal Site Reliability Engineer (SRE) to join our team and take ownership of ensuring the availability, scalability, and performance of our critical systems. Job Summary As a Principal SRE Engineer, you will be responsible for designing, building, and maintaining reliable and scalable infrastructure to support our OTT platform. You bring a developers mindset, coupled with extensive SRE experience, and a passion for reliability and performance. Youll ensure smooth system operations, take ownership of application and infrastructure reliability, and have a strong support mindset to tackle critical incidents, even during off-hours. Were seeking a candidate with 8+ years of experience, a deep understanding of observability, and the ability to lead reliability initiatives across systems and teams. Key Responsibilities Full System Ownership: Take complete responsibility for the availability, reliability, and performance of systems, including both application and infrastructure layers. Development & SRE Mindset: Leverage your experience as a developer and SRE to build tools, automation, and systems to improve system reliability and operational efficiency. Incident Management: Respond to and resolve critical system issues promptly, including being available for on-call support and handling emergencies during non-business hours, including late nights. Infrastructure Management: Design, deploy, and manage infrastructure solutions using containers (Docker/Kubernetes), networks, and CDNs to ensure scalability and performance. Observability: Drive best practices in observability, including metrics, logging, and tracing, to enhance system monitoring and proactive issue resolution. Implement and maintain observability tools like Prometheus, Grafana, ELK stack, or DataDog. Reliability and Performance: Proactively identify areas for improvement in system reliability, performance, and scalability, and define strategies and best practices to address them. Collaboration and Communication: Work closely with cross-functional teams, including development, QA, and support, to align goals and improve operational excellence. Communicate effectively across teams and stakeholders. CI/CD and Automation: Build and enhance CI/CD pipelines to improve deployment reliability and efficiency. Automate repetitive tasks and processes wherever possible. Continuous Improvement: Stay up to date with the latest technologies and best practices in DevOps, SRE, and cloud computing. Apply them to improve existing systems and processes. Required Skills and Experience Experience: 10+ years of experience in software development, DevOps, and SRE roles. Development Experience: Strong experience as a software developer with expertise in building scalable, distributed systems. SRE/DevOps Experience: Hands-on experience managing production systems, ensuring uptime, and improving system reliability. Technical Proficiency: Strong experience with containers (Docker, Kubernetes). In-depth understanding of networking concepts and CDNs (e.g., Akamai, Cloudfront). Proficiency in infrastructure-as-code (IaC) tools like Terraform or CloudFormation. Expertise in cloud platforms such as AWS, GCP, or Azure. Observability Expertise: Proven experience in implementing and maintaining robust observability solutions, including monitoring, alerting, metrics, and tracing. Incident Handling: Proven ability to handle critical incidents, perform root cause analysis, and implement permanent fixes. Automation: Strong scripting/programming skills in Python, Go, or similar languages. Reliability Focus: Demonstrated passion for system reliability, scalability, and performance optimization. Soft Skills: Excellent communication, collaboration, and leadership skills. Ability to explain technical details to non-technical stakeholders. On-Call Readiness: Willingness to participate in a 24x7 on-call rotation and support critical systems during off-hours. Preferred Qualifications Experience in OTT or video streaming platforms. Understanding of video delivery workflows, encoding, and adaptive bitrate streaming technologies. Experience working with hybrid infrastructure or multicloud cloud environment (on-premise and multi cloud). Certifications in cloud platforms (AWS Certified Solutions Architect, Google Professional Cloud Architect, etc.).
Sony Pictures Networks
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python Now35.0 - 40.0 Lacs P.A.
Bengaluru, Karnataka, India
Salary: Not disclosed
14.0 - 24.0 Lacs P.A.
25.0 - 30.0 Lacs P.A.
Pune, Maharashtra, India
Salary: Not disclosed
25.0 - 30.0 Lacs P.A.
25.0 - 30.0 Lacs P.A.
Experience: Not specified
Salary: Not disclosed
Chennai, Tamil Nadu, India
Salary: Not disclosed
Noida, Uttar Pradesh, India
Experience: Not specified
Salary: Not disclosed