Hi,
We have an immediate requirement for HPC Team Lead position in Hyderabad with our organization SHI Locuz Enterprise Solutions Pvt Ltd.
PFB JD:
Experience - 6+years
Work location - Hyderabad
ROLE SUMMARY
The
Technology Lead HPC
ensures that critical IT services and high-performance computing (HPC) infrastructure are available, efficient, and secure. The person in this role manages daily operations of mission critical systems in multiple client s data centres, working closely with both facilities engineering teams (power, cooling, physical infrastructure) and IT infrastructure / operations teams, to support service clients around the clock. This role combines technical leadership, operations oversight, incident / problem management, and strategic planning.
PRIMARY ROLES & RESPONSIBILITIES
- Experience architecting and maintaining HPC/AI systems.
- Linux system administration
- Cluster management
- System and software configuration management
- High speed networking
- Resource managers and schedulers
- High speed parallel storage
- Monitoring and alerting
- Strong understanding of HPC/AI architectures and concepts.
- Experience supporting and managing a group of HPC/AI Clusters.
- Excellent knowledge in prototyping and deploying HPC/AI clusters.
- Extensive experience in troubleshooting Linux OS, filesystems and cluster hardware.
- Good command of various Linux scripting tools, like bash, Perl, python, etc.
- Experience implementing, maintaining, and verifying defined security policies.
- To be willing to maintain a flexible work schedule.
- A positive attitude and willingness to help enable the lab users for success.
- Excellent guidance and teamwork skills.
TECHNICAL SKILLS
- RedHat, Ubuntu, SuSE OS
- Cluster Tools (Bright, xCAT, werewolf, OpenHPC, ROCKS etc)
- InfiniBand
- Lustre, BeeGFS and GPFS architecture and maintenance
- Configuration management software (Ansible, Puppet)
- SLURM/PBS/LSF/Gridengine Scheduler
- SPACK software manager
- Experience in AI Servers & Software stack Deployment.
- Experience on container technologies and orchestration tools - docker, singularity, Apptainer, Kubernetes.
- Hands-on with AI/ML tools: TensorFlow, PyTorch, Keras, ONNX, JAX.
- Experience in benchmarking and performance optimization of large-scale HPC/AI systems
- Experience in Linux, and/or Windows Operating System (OS), including file management, scripting, editing, and security.
- Log consolidation and monitoring (ganglia, Grafana etc.)
- Lifecycle and patch management experience.
SOFT SKILLS
- Good logical reasoning & analytical skill
- Good communication skill
OTHER SKILLS
- Collaborative, co-operative, and commitment mindset.
- Teamwork
- Excellent analytical and problem-solving skills.
- Ability to work independently and within cross-functional teams.
- Detail-oriented with good documentation practices.
- Excellent interpersonal, communication, customer interaction, documentation skills and decision-making ability.