Senior Site Reliability Engineer, Network Assurance Data Platform

5 - 9 years

0 Lacs

Posted:14 hours ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

As a Senior Site Reliability Engineer (SRE) on the Network Assurance Data Platform team at Cisco ThousandEyes, you will be responsible for ensuring the reliability, scalability, and security of our cloud and big data platforms. Collaborating with cross-functional teams, including software development, product management, and security, you will design, build, and maintain systems operating at multi-region scale. Your efforts will directly impact the success of our machine learning (ML) and AI initiatives by guaranteeing that the underlying infrastructure is robust, efficient, and aligned with operational excellence. Your main responsibilities will include designing, building, and optimizing cloud and data infrastructure to ensure high availability, reliability, and scalability of big-data and ML/AI systems. You will implement Site Reliability Engineering principles such as monitoring, alerting, error budgets, and fault analysis. Working closely with development, product management, and security teams, you will develop secure, scalable solutions that support ML/AI workloads and enhance operational efficiency through automation. Troubleshooting complex technical issues in production environments, performing root cause analyses, and contributing to continuous improvement efforts will also be part of your role. You will help shape the team's technical strategy and roadmap, balancing immediate needs with long-term goals, while mentoring peers and fostering a culture of learning and technical excellence. Qualifications for this role include the ability to design and implement scalable and well-tested solutions with a focus on streamlining operations. Strong hands-on experience in cloud services, preferably AWS, and Infrastructure as Code skills, ideally with Terraform and Kubernetes, are required. Previous experience in AWS cost management, understanding of Prometheus and its ecosystem, and the ability to write high-quality code in Python, Go, or equivalent languages are essential. A good understanding of Unix/Linux systems, the kernel, system libraries, file systems, and client-server protocols is expected. Experience in building cloud, big data, and/or ML/AI infrastructure (e.g., EMR, Airflow, Comet ML, AWS SageMaker, Spark, etc) would be a bonus. Cisco values diversity in its employees and believes that diverse teams are better equipped to solve problems, innovate, and create a positive impact. The company encourages candidates from all backgrounds to apply, even if they do not meet every single qualification listed. Research shows that individuals from underrepresented groups may experience imposter syndrome and doubt their candidacy strength. Cisco aims to unlock the potential in all candidates and emphasizes that everyone has something valuable to offer.,

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now