On-site
Full Time
We are seeking an experienced HPC (High-Performance Computing) System Engineer to design,
implement, and manage cutting-edge HPC infrastructure using Dell servers, AMD GPUs (MI210),
and Pure Storage systems. The ideal candidate will have expertise in Commvault backup
systems, Kubernetes container orchestration, and multitenancy configurations, ensuring
scalable, GPU-accelerated, and high-performance solutions tailored to enterprise and HPC
workloads.
Key Responsibilities:
Dell Servers:
Architect and deploy HPC systems using Dell PowerEdge servers, ensuring high availability
and optimized performance for compute-intensive applications.
Manage server hardware lifecycle, including deployment, upgrades, and diagnostics.
Configure HPC cluster nodes for seamless integration with Kubernetes and GPU workloads.
AMD GPUs (MI210):
Deploy and optimize AMD GPU-based servers to accelerate AI/ML, HPC, and data-intensive
applications.
Monitor GPU utilization, troubleshoot performance bottlenecks, and optimize workloads for
GPU acceleration.
Integrate GPUs into Kubernetes environments for containerized GPU-based applications.
Pure Storage:
Design and manage Pure Storage solutions, including FlashBlade, to support HPC and
data-intensive workloads.
Implement multitenancy configurations for isolated, secure, and efficient resource
utilization.
Monitor storage health and ensure performance optimization for high-speed data access.
Commvault Backup:
Architect and manage enterprise-wide Commvault backup solutions, ensuring data integrity
and readiness for disaster recovery.
Implement backup and retention policies for HPC environments, including containerized and
GPU-accelerated workloads.
Kubernetes Container Management:
Deploy and manage Kubernetes clusters for HPC applications, ensuring scalability and fault
tolerance.
Configure persistent storage for containerized workloads and integrate storage with GPUs for
high-performance data processing.
Monitor cluster performance and troubleshoot HPC-specific Kubernetes challenges.
System Optimization and Monitoring:
Implement advanced monitoring solutions for servers, GPUs, storage, and Kubernetes
clusters to ensure peak performance.
Develop and enforce policies for system security, resource allocation, and compliance with
industry standards.
Lead capacity planning and scaling initiatives for HPC infrastructure.
Team Leadership and Collaboration:
Mentor and guide junior engineers on HPC best practices, system design, and
troubleshooting techniques.
Collaborate with cross-functional teams, including data scientists and DevOps, to align
infrastructure capabilities with organizational goals.
Qualifications:
Technical Skills:
Extensive experience with Dell PowerEdge servers in HPC or enterprise environments.
Proven expertise in AMD GPUs (MI210), including their integration and optimization for AI/ML
and HPC workloads.
Advanced knowledge of Pure Storage systems, including multitenancy and high-
performance configurations.
Expertise in Commvault backup systems, including design, deployment, and disaster
recovery.
Strong proficiency in Kubernetes container orchestration, particularly for GPU-accelerated
applications.
Knowledge of high-performance interconnects (e.g., RDMA, InfiniBand) and networking for
HPC.
HireKul
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
gandhinagar
15.0 - 25.0 Lacs P.A.
india
Experience: Not specified
Salary: Not disclosed
india
Experience: Not specified
Salary: Not disclosed