Job
Description
About the Job
The Red Hat Chaos Engineering team, part of the Performance and Scale department, is looking for a Senior Software Engineer to join us in Bangalore, India to work on chaos testing Red Hat OpenShift Container Platform, Red Hat OpenShift Virtualization and related product portfolio to identify bottlenecks, tunings and capacity planning guidance under failure conditions. Our goal is to make these products the platform of choice for Red Hats enterprise customers! As a senior member of the team, you will be responsible for providing comprehensive resilience, reliability, performance and scalability assessments of the products and improving them. You will collaborate with various Engineering teams on driving features, bug fixes, tunings and providing guidance to ensure stable releases. You will also engage with customers to assist them with establishing chaos and performance test pipelines, best practices, strategies to ensure a scalable environment. This role needs an engineer that thinks creatively, adapts to rapid change, and has the willingness to learn and apply new technologies. You will be joining a vibrant open source culture, and helping promote performance and innovation in this Red Hat engineering team.
What will you do?
Formulate test plans and carry out chaos testing, performance and scalability benchmarks against various components/features of the OCPv platform to characterize reliability, resilience, drive product performance improvements and detect regressions through data analysis and visualization under failure conditions such as network faults, infrastructure failures, storage faults, etc
Work on capacity planning guidance for the product to handle failures while still being performant
Develop tools and automation related to fault injection, load generation and release CI
Work on AI integration to improve test coverage
Assist customers
Collaborate with other engineering teams to resolve resilience and performance issues
Triage, debug, and solve customer/partner cases related to virtualization reliability, performance and scale
Publish results, conclusions, recommendations and best practices via internal test reports, presentations, external blogs and official documentation to support our partners and customers
Participate in internal and external conferences about your work and results
What will you bring?
Bachelor's or Master's degree in Computer Science or related field, or equivalent experience
Overall 5+years of experience in software development
5+ years of programming experience in Python, Golang or related programming
Experience with site reliability, chaos testing, performance benchmarking, data capture, analysis and debugging
Very strong Linux system administration and system engineering skills.
Experience with container ecosystems like Docker, Podman and Kubernetes
Ability to quickly learn technologies with guidance and maintain high attention to detail
Experience with tools, metrics collection and analysis such as iostat, vmstat, sar, perf, pcp, prometheus, Grafana and Elasticsearch
Familiarity with Continuous Integration frameworks, automation like Jenkins, Airflow, Ansible etc. and version control tools such as Git, etc
Experience working with public clouds like AWS, Azure, GCP, or IBM Cloud, as well as bare metal environments.
Excellent written and verbal language skills in English
The following are considered a plus:
Experience with chaos testing and maintaining reliability of infrastructure at large scale
Experience working with virtualization technologies such as KubeVirt, VMware
Knowledge of performance observability/profiling tools like eBPF, Flame Graphs