Job
Description
We are looking for a highly skilled HPC/GPU Operations Engineer with 6-10 years of experience to oversee the management, optimization, and maintenance of high-performance computing (HPC) infrastructure, specifically focusing on GPU-accelerated workloads. Your primary responsibilities will include ensuring the reliability, efficiency, and scalability of HPC systems used for scientific computing, AI/ML, and data-intensive applications. As an IC3 level professional, you will be tasked with the following responsibilities: - Managing and administering HPC clusters, GPU nodes, and high-speed interconnects. - Configuring GPU-accelerated workloads for AI/ML, scientific computing, and simulations. - Monitoring system performance, troubleshooting issues, and optimizing resource utilization. - Installing, configuring, and maintaining HPC-related software, libraries, and tools such as CUDA, OpenMP, and MPI. - Supporting containerized workflows using technologies like Docker and Singularity. - Ensuring software compatibility with GPU architectures from NVIDIA, AMD, and Intel. - Tuning GPU and CPU performance for specific workloads, including benchmarking and profiling. - Utilizing monitoring tools like Prometheus, Grafana, and Slurm to track system health and efficiency. - Optimizing scheduling and resource allocation in workload managers such as Slurm, PBS, and LSF. - Upholding system security and access control for HPC resources. - Applying software patches, firmware updates, and security best practices. - Assisting in regulatory compliance for HPC environments. - Providing support to researchers, data scientists, and engineers using HPC resources. - Developing and maintaining documentation on best practices, troubleshooting, and system usage. - Conducting training sessions or workshops on HPC/GPU computing. To qualify for this role, you should possess the following technical skills: - Experience in managing HPC clusters and GPU-based computing environments. - Proficiency in Linux system administration, scripting (Bash, Python), and automation (Ansible, Terraform). - Knowledge of parallel computing, GPU programming (CUDA, OpenCL), and HPC frameworks. - Familiarity with networking technologies such as Infiniband and RDMA, storage solutions like Lustre, GPFS, NFS, and virtualization. At Oracle, we are dedicated to leveraging cutting-edge technology to address current challenges while fostering an inclusive work environment that encourages innovation and collaboration. We offer a range of competitive benefits, flexible medical, life insurance, and retirement options, and support employee engagement in volunteer programs. We are committed to providing accessibility assistance or accommodation for individuals with disabilities throughout the employment process. If you require such assistance, please contact us via email at accommodation-request_mb@oracle.com or by phone at +1 888 404 2494 in the United States.,