Job
Description
As a Senior Site Reliability Engineer for the Operational Readiness team at HashiCorp, you play a crucial role in enhancing the scalability, performance, and reliability of our cloud products. With over 5 years of experience in site reliability engineering or a related field, you lead efforts to identify performance bottlenecks, address operational challenges proactively, and ensure our services meet the highest standards of operational excellence. Your expertise in load testing, performance analysis, and system hardening is instrumental in maintaining the operational resilience of our enterprise and cloud-based products. You focus on ensuring high availability and performance across all of HashiCorp's offerings, with a holistic view of enterprise and cloud systems. In this role, you define and execute test plans, develop system-wide strategies for product load and performance testing, and explore new avenues to meet essential operational readiness criteria. You utilize troubleshooting techniques like Chaos engineering to identify and provide novel solutions for complex system issues that may impact customers. Key Responsibilities: - Implement best practices for system reliability, including proactive identification of potential failure points and automated mitigations. - Design and execute comprehensive load testing strategies to identify performance bottlenecks and scalability limits. - Improve system resilience by implementing best practices and technologies for high availability and fault tolerance. - Collaborate with engineering and product teams to integrate operational readiness into the development lifecycle. - Build tools and frameworks for automated testing, environment simulation, and incident reproduction to increase test coverage. - Analyze testing results, document findings, and make actionable recommendations for system enhancements. - Drive systemic improvements through Chaos Testing and work closely with product development teams. - Share knowledge and expertise with team members, promoting a culture of learning and continuous improvement. - Develop and implement disaster recovery and backup strategies to ensure data integrity and system resilience. Ideal Candidate: - 5+ years of experience in SRE, systems engineering, or non-functional testing roles with a focus on operational readiness and performance testing. - Proficiency in high-level programming languages or scripting. - Track record of leading successful load testing and performance optimization initiatives in cloud and on-prem environments. - Experience in creating and managing test environments for automated testing. - Strong understanding of CI/CD processes and maintaining quality pipelines. - Familiarity with version control systems (e.g., Git) and agile project management methodologies. - Knowledge of monitoring and alerting systems, with the ability to develop metrics and alarms reflecting system health and operational risks. - Technical foundation in cloud technologies (AWS, Azure, or GCP) and container technologies like Nomad or Kubernetes. - Experience with performance testing tools like K6, Artillery, Vegeta, Locust, etc. - Effective communication and collaboration skills with cross-functional teams and diverse audiences. - Familiarity with HashiCorp products and tools is a plus. - Exposure to the disaster recovery domain is also a plus.,