Senior System Engineer II (AI Infrastructure)

0 years

0 Lacs

Posted:4 days ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Role: Senior System Engineer II (AI Infrastructure) Stack: Linux, LXC, Python, libvirt, KVM, QEMU, CEPH, VyOS, GPU network fabric Tools: NetPlan, Ansible, Prometheus, Grafana, Bash Shell scripting What You’ll Be Doing: Provision, deploy, and maintain GPU and compute infrastructure in high-performance environments. Work with your fellow sharks to design, develop, and optimize the next generation of GPU infrastructure. Manage and configure Linux networking using Netplan. Develop and maintain infrastructure automation scripts using Python, Bash, or other scripting languages. Collaborate with cross-functional teams to meet AI/ML infrastructure needs. Work with customers and stakeholders to define and refine infrastructure requirements needed to support their AI/ML workload Work with infrastructure technical leaders to define infrastructure requirements to store, move, and manipulate large datasets Guide performance teams on industry standard testing methodologies and help optimize for GPU fabric throughput Identify security improvements and drive review discussions with internal teams Working directly with individual engineering teams to deliver new infrastructure functions and technologies in support of AI/ML products What We’ll Expect From You: Experience delivering bare metal GPU infrastructure Provision, deploy, and maintain GPU and compute infrastructure in high-performance environments. Manage and configure Linux networking using Netplan. Understanding of AI/ML workloads and overall industry trends Strong collaborator and consensus builder. Author and review design documentation. Experience troubleshooting, analyzing, and debugging relevant virtualization stacks (kernel, KVM, QEMU) Experience as a software engineer / developer in a large scale, distributed environment Experience writing secure, testable, and robust low-level code Deep understanding of operating systems, virtualization, and Linux internals Familiarity with related virtualization fundamentals, including networking datapath, containers, and data persistence layers A critical thinker dedicated to solving problems and delivering solutions Required Skills & Qualifications Strong systems administration experience in Linux (Ubuntu or Debian-based systems preferred). Scripting expertise (Python, Bash, etc.) for automation and tooling. Experience in infrastructure provisioning and deployment, both bare-metal and containerized. Proficiency with Netplan and Linux network stack configuration (routes, interfaces, DNS). Familiarity with GPU technologies and cloud platforms (AWS, Azure, GCP) is a plus. Day-to-day tasks as seen on the job: Provision, deploy, and maintain GPU and compute infrastructure in high-performance environments. Manage and configure Linux networking using Netplan. MAAS Runbooks (documents) Author and edit runbooks (procedures) for provisioning (deployment), decommissioning (reclaim/removal), repave (LXC container cleanup) etc. Author and edit runbooks (playbooks) for new issues encountered and troubleshooting performed to fix them. Provisioning procedure New customer deployments Existing customer expansions Minimum Skillset: MAAS – structure, cloud-init (initial configuration scripts) LXC basic start, stop, destroy, list of container(s) maneuvering LXC access from bare metal host troubleshooting services created within LXC at deploy time Local storage Partitions (on physical disk devices) Software RAID mdadm (md0/1/2, on partitions) Filesystem (ext4, on S/w RAID block devices) NCCL – single-node and multi-node distributed tests Database – PostgreSQL, psql: basic SELECT, UPDATE queries Specific NVIDIA driver and CUDA version install on specific Ubuntu and HWE kernel SCM – Git, GitHub: branch, commit, PR, markdown File – yaml (JSON), sh (Bash Shell), py (Python) formatting knowledge DCIM – NetBox ITSM – Jira Show more Show less

Mock Interview

Practice Video Interview with JobPe AI

Start Ai Interview Now

My Connections DC Tech Consulting

Download Chrome Extension (See your connection in the DC Tech Consulting )

chrome image
Download Now

RecommendedJobs for You