Lead Site Reliability Engineer, Network Assurance Data Platform

5 - 9 years

0 Lacs

Posted:2 days ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Role Overview: Cisco ThousandEyes is a Digital Assurance platform that enables organizations to ensure flawless digital experiences across all networks, including those not owned by the organization. As a Site Reliability Engineering (SRE) Technical Leader on the Network Assurance Data Platform (NADP) team, you will be responsible for ensuring the reliability, scalability, and security of cloud and big data platforms. Your role will involve collaborating with cross-functional teams to design, influence, build, and maintain SaaS systems at a multi-region scale. Key Responsibilities: - Design, build, and optimize cloud and data infrastructure to ensure high availability, reliability, and scalability of big-data and ML/AI systems while implementing SRE principles. - Collaborate closely with cross-functional teams to create secure, scalable solutions supporting ML/AI workloads and enhancing operational efficiency through automation. - Troubleshoot complex technical problems in production environments, perform root cause analyses, and contribute to continuous improvement efforts. - Lead the architectural vision, shape the team's technical strategy, drive innovation, and influence the technical direction. - Serve as a mentor and technical leader, fostering a culture of engineering and operational excellence. - Engage with customers and stakeholders to understand use cases and feedback, translating them into actionable insights. - Utilize strong programming skills to integrate software and systems engineering, building core data platform capabilities and automation. - Develop strategic roadmaps, processes, plans, and infrastructure to deploy new software components at an enterprise scale while enforcing engineering best practices. Qualification Required: - Ability to design and implement scalable and well-tested solutions with a focus on operational efficiency. - Strong hands-on cloud experience, preferably AWS. - Infrastructure as Code expertise, especially Terraform and Kubernetes/EKS. - Experience building and managing Cloud, Big Data, and ML/AI infrastructure, including hands-on expertise with various technologies such as Hadoop ecosystem components, Spark, PySpark, AWS SageMaker, and Kafka. - Proficiency in writing high-quality code in Python, Go, or equivalent programming languages. Additional Details: Cisco is revolutionizing data and infrastructure connectivity and protection for organizations in the AI era and beyond. With a history of fearless innovation spanning 40 years, Cisco creates solutions that empower humans and technology to work seamlessly across physical and digital worlds, providing unparalleled security, visibility, and insights. The company fosters a collaborative environment where individuals can grow and build meaningful solutions on a global scale, driven by a worldwide network of experts. Cisco's impact is ubiquitous, reflecting the company's commitment to making big things happen everywhere. Role Overview: Cisco ThousandEyes is a Digital Assurance platform that enables organizations to ensure flawless digital experiences across all networks, including those not owned by the organization. As a Site Reliability Engineering (SRE) Technical Leader on the Network Assurance Data Platform (NADP) team, you will be responsible for ensuring the reliability, scalability, and security of cloud and big data platforms. Your role will involve collaborating with cross-functional teams to design, influence, build, and maintain SaaS systems at a multi-region scale. Key Responsibilities: - Design, build, and optimize cloud and data infrastructure to ensure high availability, reliability, and scalability of big-data and ML/AI systems while implementing SRE principles. - Collaborate closely with cross-functional teams to create secure, scalable solutions supporting ML/AI workloads and enhancing operational efficiency through automation. - Troubleshoot complex technical problems in production environments, perform root cause analyses, and contribute to continuous improvement efforts. - Lead the architectural vision, shape the team's technical strategy, drive innovation, and influence the technical direction. - Serve as a mentor and technical leader, fostering a culture of engineering and operational excellence. - Engage with customers and stakeholders to understand use cases and feedback, translating them into actionable insights. - Utilize strong programming skills to integrate software and systems engineering, building core data platform capabilities and automation. - Develop strategic roadmaps, processes, plans, and infrastructure to deploy new software components at an enterprise scale while enforcing engineering best practices. Qualification Required: - Ability to design and implement scalable and well-tested solutions with a focus on operational efficiency. - Strong hands-on cloud experience, preferably AWS. - Infrastructure as Code expertise, especially Terraform and Kubernetes/EKS. - Experience

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Cisco logo
Cisco

Software Development

San Jose CA

RecommendedJobs for You