We’re Hiring: HPC Infrastructure Engineer 📍 Location:INDIA (CANDIDATE MUST BE COMFORTABLE TO RELOCATE TO UAE) 🕒 Experience: 5+ Years 💼 Employment Type: Full-Time ⸻ 🔧 Job Summary: We are seeking a highly skilled High-Performance Computing (HPC) Infrastructure Engineer to join our IT infrastructure team. This role focuses on designing, deploying, and maintaining robust HPC systems that support advanced computing and data-intensive applications. You will play a key role in ensuring the performance, reliability, and scalability of compute and storage infrastructure. The role includes managing incident response, service requests, and changes across HPC environments in managed service settings. ⸻ 🛠️ Roles and Responsibilities: • Design, implement, and manage high-performance network architectures for HPC clusters. • Configure and optimize InfiniBand and Ethernet switches, routers, and interconnects. • Ensure high availability, redundancy, and fault tolerance in HPC systems. • Deploy and maintain HPC clusters, monitor job scheduling, and ensure optimal system health. • Troubleshoot compute node hardware/software issues and implement performance improvements. • Maintain storage systems (Ceph, Vast Data, Lustre, GPFS, NFS, GlusterFS) with fast, reliable access from clusters. • Configure and manage InfiniBand fabrics; upgrade firmware and monitor performance. • Use tools like Grafana, Prometheus, Ganglia, and UFM for cluster and network monitoring. • Work closely with researchers and data scientists to support HPC/AI workloads. • Assist in debugging, tuning, and optimizing distributed applications. • Create and maintain HLD and LLD documentation. ⸻ 📚 Required Experience: • 5+ years managing infrastructure in HPC environments. • Strong background in data center operations – servers, switches, routers, storage. • Proficient in NVIDIA/Mellanox (Cumulus OS) switch configuration and troubleshooting. • Hands-on with monitoring tools: Prometheus, Grafana, Elastic Observability. • Experience with HPC schedulers: SLURM, PBS, or Torque. • Kubernetes environment setup and maintenance experience. • Familiar with ML and data science workflows in HPC/AI environments. • Strong Linux administration experience. ⸻ 💡 Skills & Knowledge: • Deep understanding of Ethernet and InfiniBand networks. • Proficiency in distributed storage and file systems. • Expertise in diagnosing and resolving complex infrastructure issues. • Collaborative team player with strong communication skills. • Capable of documenting and designing complex systems architecture. ⸻ 🎓 Qualifications: • Bachelor’s or Master’s degree in Computer Science, IT, or equivalent experience. ⸻ 📜 Certifications (Preferred): • Red Hat Certified Engineer (RHCE) • Cisco Certified Network Associate (CCNA) • AWS Certified Solutions Architect