Get alerts for new jobs matching your selected skills, preferred locations, and experience range.
0 years
0 Lacs
India
On-site
Role: Senior System Engineer II (AI Infrastructure) Stack: Linux, LXC, Python, libvirt, KVM, QEMU, CEPH, VyOS, GPU network fabric Tools: NetPlan, Ansible, Prometheus, Grafana, Bash Shell scripting What You’ll Be Doing: Provision, deploy, and maintain GPU and compute infrastructure in high-performance environments. Work with your fellow sharks to design, develop, and optimize the next generation of GPU infrastructure. Manage and configure Linux networking using Netplan. Develop and maintain infrastructure automation scripts using Python, Bash, or other scripting languages. Collaborate with cross-functional teams to meet AI/ML infrastructure needs. Work with customers and stakeholders to define and refine infrastructure requirements needed to support their AI/ML workload Work with infrastructure technical leaders to define infrastructure requirements to store, move, and manipulate large datasets Guide performance teams on industry standard testing methodologies and help optimize for GPU fabric throughput Identify security improvements and drive review discussions with internal teams Working directly with individual engineering teams to deliver new infrastructure functions and technologies in support of AI/ML products What We’ll Expect From You: Experience delivering bare metal GPU infrastructure Provision, deploy, and maintain GPU and compute infrastructure in high-performance environments. Manage and configure Linux networking using Netplan. Understanding of AI/ML workloads and overall industry trends Strong collaborator and consensus builder. Author and review design documentation. Experience troubleshooting, analyzing, and debugging relevant virtualization stacks (kernel, KVM, QEMU) Experience as a software engineer / developer in a large scale, distributed environment Experience writing secure, testable, and robust low-level code Deep understanding of operating systems, virtualization, and Linux internals Familiarity with related virtualization fundamentals, including networking datapath, containers, and data persistence layers A critical thinker dedicated to solving problems and delivering solutions Required Skills & Qualifications Strong systems administration experience in Linux (Ubuntu or Debian-based systems preferred). Scripting expertise (Python, Bash, etc.) for automation and tooling. Experience in infrastructure provisioning and deployment, both bare-metal and containerized. Proficiency with Netplan and Linux network stack configuration (routes, interfaces, DNS). Familiarity with GPU technologies and cloud platforms (AWS, Azure, GCP) is a plus. Day-to-day tasks as seen on the job: Provision, deploy, and maintain GPU and compute infrastructure in high-performance environments. Manage and configure Linux networking using Netplan. MAAS Runbooks (documents) Author and edit runbooks (procedures) for provisioning (deployment), decommissioning (reclaim/removal), repave (LXC container cleanup) etc. Author and edit runbooks (playbooks) for new issues encountered and troubleshooting performed to fix them. Provisioning procedure New customer deployments Existing customer expansions Minimum Skillset: MAAS – structure, cloud-init (initial configuration scripts) LXC basic start, stop, destroy, list of container(s) maneuvering LXC access from bare metal host troubleshooting services created within LXC at deploy time Local storage Partitions (on physical disk devices) Software RAID mdadm (md0/1/2, on partitions) Filesystem (ext4, on S/w RAID block devices) NCCL – single-node and multi-node distributed tests Database – PostgreSQL, psql: basic SELECT, UPDATE queries Specific NVIDIA driver and CUDA version install on specific Ubuntu and HWE kernel SCM – Git, GitHub: branch, commit, PR, markdown File – yaml (JSON), sh (Bash Shell), py (Python) formatting knowledge DCIM – NetBox ITSM – Jira Show more Show less
Posted 4 days ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
Accenture
36723 Jobs | Dublin
Wipro
11788 Jobs | Bengaluru
EY
8277 Jobs | London
IBM
6362 Jobs | Armonk
Amazon
6322 Jobs | Seattle,WA
Oracle
5543 Jobs | Redwood City
Capgemini
5131 Jobs | Paris,France
Uplers
4724 Jobs | Ahmedabad
Infosys
4329 Jobs | Bangalore,Karnataka
Accenture in India
4290 Jobs | Dublin 2