Job
Description
As a Site Reliability Engineer (SRE) on the Network Assurance Data Platform team at Cisco ThousandEyes, your role is crucial in ensuring the reliability, scalability, and security of our cloud and big data platforms. You will be responsible for designing, building, and maintaining systems operating at multi-region scale, collaborating with cross-functional teams to support machine learning (ML) and AI initiatives. Your primary responsibilities will include designing, building, and optimizing cloud and data infrastructure to guarantee high availability, reliability, and scalability of big-data and ML/AI systems. You will implement Site Reliability Engineering (SRE) principles such as monitoring, alerting, error budgets, and fault analysis. Close collaboration with development, product management, and security teams is essential to create secure, scalable solutions that enhance operational efficiency through automation. Troubleshooting complex technical issues in production environments, conducting root cause analyses, and contributing to continuous improvement efforts will be part of your daily tasks. Furthermore, you will have the opportunity to shape the technical strategy and roadmap of the team, balancing immediate needs with long-term goals. Mentoring peers and fostering a culture of learning and technical excellence will also be a key aspect of your role. To excel in this position, you should demonstrate the ability to design and implement scalable solutions with a focus on streamlining operations. Strong hands-on experience in cloud technologies, preferably AWS, is required, along with skills in Infrastructure as Code, specifically with Terraform and Kubernetes. Previous experience in AWS cost management and an understanding of Prometheus and its ecosystem, including Alertmanager, are preferred. Proficiency in writing high-quality code in languages like Python, Go, or equivalent is essential. Additionally, a good understanding of Unix/Linux systems, the kernel, system libraries, file systems, and client-server protocols is expected. Experience in building cloud, big data, and/or ML/AI infrastructure such as EMR, Airflow, Comet ML, AWS SageMaker, Spark, etc., would be a bonus. At Cisco, we value diversity and believe that diverse teams are better equipped to solve problems, innovate, and create a positive impact. We welcome candidates from all backgrounds and encourage you to apply even if you do not meet every single qualification listed. Research shows that individuals from underrepresented groups may doubt the strength of their candidacy, but we believe that everyone has something valuable to offer. Join us in our mission to unlock potential and drive innovation in the digital assurance space.,