- As a member of the System Admin team, youll be responsible for managing the team that monitors the system Lab for HPC software release. This process is at the core of engineering development for components of the High Performance Compute (HPC) system stack - including compute, storage, network, hardware, and software, in varied configurations.
- You will be on the front lines of ensuring that the development environment meets and exceeds the requirements of those engineering teams while remaining stable, consistent and repeatable.
How you ll make your mark:
- Recruit, manage and retain teams of talented engineers
- Mentor engineers and lead by example
- Embrace culture of innovation, collaboration and inclusion
- Partner with engineering teams to design and implement scalable, high performance automation solutions for build, deployment, testing and management tools
- Utilize best practices for Continuous Integration and Continuous Test to agile development teams
- Navigate the organization as necessary to unblock the team and ensure timely delivery
- Drive the adoption of new best practices into activities in the engineering organization
- Engage and collaborate with technology partners and customers
What you need to bring:
Knowledge and Skills:
- Bachelors or Masters degree in computer science or equivalent
- 7+ years of experience working with engineering teams (automation / tools / infrastructure experience a plus)
- Strong written and verbal communication skills ability to work successfully in a team environment
- Experience in overall architecture of software systems for products and solutions.
- Strong background in Linux/Unix system administration , performance tuning, and troubleshooting.
- Proven experience managing HPC cluster environments , job schedulers (e.g., Slurm, PBS, LSF), and workload managers.
- In-depth knowledge of networking concepts (DNS, DHCP, IP management, switches, etc.).
- Hands-on experience with monitoring and logging tools such as Prometheus, Grafana, or Elastic Stack.
- Expertise in hardware troubleshooting of compute nodes, storage systems, and interconnect technologies (e.g., InfiniBand).
- Familiarity with security hardening , compliance standards, and patch management.
- Strong scripting and automation skills using Python, Bash, or Ansible .
- Excellent leadership, team management, and communication skills .
- Ability to manage multiple priorities and deliver results in a dynamic environment.
- Understanding of software quality assurance tools and processes.
- Strong analytical and problem-solving skills.
- Ability to effectively communicate product architectures, design proposals, and negotiate options at management levels.
- Experience testing large systems.
- Good understanding of Containerization (Docker / Podman), and Container orchestration (Kubernetes).
Nice to have skills:
- Jenkins, GIT, Jira, Ansible
- Automated test framework (e.g. Avocado, PyTest)
Personal Attributes:
- Team-oriented; understands the value of collaborative expertise
- Takes ownership for delivering & innovating in defined & ambiguous situations
- Flexible works with the requisite urgency
- Hard-working, reliable, results-oriented
- Learnability / curious to learn business context, emerging technologies, and research techniques
- Stakeholder/customer orientation
Accountability, Accountability, Action Planning, Active Learning, Active Listening, Agile Methodology, Agile Scrum Development, Analytical Thinking, Bias, Coaching, Creativity, Critical Thinking, Cross-Functional Teamwork, Data Analysis Management, Data Collection Management (Inactive), Data Controls, Design, Design Thinking, Empathy, Follow-Through, Group Problem Solving, Growth Mindset, Intellectual Curiosity (Inactive), Long Term Planning, Managing Ambiguity