HPC Engineer - AI Workloads & Infrastructure

3 years

0 Lacs

Posted:3 days ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Job Title: HPC Engineer – AI Workloads & Infrastructure

Department: Operations – High Performance Computing (HPC)

About iVedha:

iVedha


Role Overview:

We are seeking an HPC Engineer to join our operational team supporting AI workloads in a high-performance computing environment. This role focuses on building and managing HPC compute nodes, deploying Kubernetes clusters, and orchestrating bare-metal and virtualized environments. You will also work with advanced storage technologies such as VAST Data and MooseFS, ensuring seamless integration with GPU-accelerated infrastructure.


Key Responsibilities:

  • Design, deploy, and maintain HPC clusters for AI/ML workloads, including GPU-accelerated compute nodes (NVIDIA DGX/HGX platforms).
  • Implement and manage Kubernetes for containerized AI workloads, ensuring scalability and high availability.
  • Configure and optimize bare-metal servers, VMs, and virtualized environments for HPC applications.
  • Integrate and manage high-performance storage systems (VAST, MooseFS, Lustre, or similar parallel file systems).
  • Implement job scheduling and orchestration using Slurm or equivalent tools for AI and HPC workloads.
  • Monitor and tune system performance for GPU utilization, network throughput, and storage I/O.
  • Automate deployment and configuration using Forman, Ansible, Terraform, or similar tools.
  • Collaborate with AI engineers, DevOps, and data teams to optimize infrastructure for LLM training, fine-tuning, and inference pipelines.
  • Ensure security, compliance, and data integrity across HPC environments.


Required Skills & Experience:

  • 3+ years in HPC engineering, systems administration, or AI infrastructure roles.
  • Strong experience with Linux (RHEL/CentOS/Ubuntu) in HPC environments.
  • Hands-on experience with Kubernetes, Docker, and container orchestration for AI workloads.
  • Familiarity with GPU clusters, CUDA, NCCL and NVIDIA ecosystem tools.
  • Knowledge of high-speed interconnects (InfiniBand, RoCE) and networking for HPC.
  • Experience with parallel/distributed file systems (VAST, MooseFS, Lustre, GPFS).
  • Proficiency in automation and scripting (Python, Bash, Ansible).
  • Understanding of job schedulers (Slurm, PBS, Torque) and workload optimization.


Nice-to-Have:

  • Experience with cloud HPC platforms (Azure HPC, AWS ParallelCluster, or similar).
  • Familiarity with AI/ML frameworks (PyTorch, TensorFlow) and MLOps pipelines.
  • Exposure to observability tools (Prometheus, Grafana) for HPC environments.


Why Join iVedha?

  • Work on cutting-edge AI infrastructure projects powering Canada’s sovereign AI ecosystem.
  • Collaborate with a world-class team of engineers and AI specialists.
  • Competitive compensation, benefits, and opportunities for career growth in HPC and AI.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You