Key Responsibilities: 26821
- Design, implement, and support HPC clusters (CPU/GPU-based) with scalable storage and high-bandwidth interconnects.
- Generate hardware BOMs, manage vendors, and oversee hardware release and integration.
- Use expert-level Linux system administration skills to configure and tune HPC environments (RedHat, SuSE, Ubuntu, Rocky, etc.).
- Assemble project specifications and performance requirements at both system and subsystem levels.
- Drive timely execution of project deliverables across cross-functional teams.
- Develop and maintain shell/Python scripts, golden images, procedures, and automation for deployment and monitoring.
- Support release of new hardware/software products into manufacturing with proper documentation and knowledge transfer.
- Configure and maintain robust storage solutions, netboot/PXE environments, and Linux HA clusters.
Required Qualifications
- Bachelor's or Master’s degree (BE/BTech/MS/MCA/MSc) in Computer Engineering or Electrical Engineering.
- Minimum 7 years of experience in:
- High Performance Computing (HPC) environments
- Cluster management, deployment, and optimization
- Linux Systems (SuSE, RedHat, Rocky, Ubuntu)
- Server, GPU, BIOS, BMC, Networking, and Storage hardware
- TCP/IP fundamentals, DNS, DHCP, HTTP, LDAP, SMTP
- Shell and Python scripting
- Strong experience with systemd, PXE boot, and high-availability clusters.
- Familiarity with configuration management tools: Salt, Chef, Puppet, etc.
Preferred Qualifications
- DevOps mindset with experience in CI/CD pipelines (Jenkins), Git-based repo systems.
- Exposure to containerization tools (Singularity, Docker).
- Working knowledge of Kubernetes, Prometheus, Grafana, and observability tools.
- Understanding of web/proxy technologies like Apache/Nginx, reverse proxies, and HAProxy for load balancing.
- Experience with cloud-based compute architectures and hybrid models (on-prem + cloud).
Skills & Abilities
- Strong problem-solving skills and troubleshooting abilities.
- Exceptional team collaboration and communication skills.
- Ability to manage multiple tasks, prioritize efficiently, and meet project deadlines.
- Adaptable in fast-paced, evolving technology environments.
- Strong documentation and process-oriented mindset.
Skills: project,dhcp,smtp,grafana,web/proxy technologies (apache, nginx, haproxy),ldap,python scripting,documentation,high-availability clusters,configuration management tools (salt, chef, puppet),suse,pxe,ci/cd pipelines (jenkins),skills,tcp/ip fundamentals,cloud,linux,linux systems (suse, redhat, rocky, ubuntu),pxe boot,python,shell scripting,containerization tools (singularity, docker),dns,git-based repo systems,cluster management,storage,systemd,prometheus,high performance computing (hpc),management,http,kubernetes