AI Infrastructure Engineers – Network Design, Deployment & Operations (NCP-AIN/AII/AIO Certified)

0 years

0 Lacs

Posted:3 days ago| Platform: Linkedin logo

Apply

Work Mode

Remote

Job Type

Contractual

Job Description

NCP-Certified Engineers


Apply now to be part of cutting-edge AI deployments and scalable data center innovation!


1. Network Design & Installation Engineer (NCP-AIN Certified)

Location: India REMOTE

Duration: Long Term Contract


Overview:

Network Design & Installation Engineer

Key Responsibilities:

  • Design and implement scalable InfiniBand/Ethernet networks to support large-scale H100 GPU clusters.
  • Configure Spectrum-X switches, BlueField DPUs, and Cumulus Linux-based environments.
  • Integrate networking architecture with existing data center infrastructure.
  • Perform on-site installations, including racking, cable management, and connectivity validation.
  • Utilize tools such as UFM and IBDiagnet to run diagnostics and optimize network performance.
  • Collaborate with infrastructure and operations teams to ensure seamless deployment and expansion.

Qualifications:

  • NCP-AIN certification (required)

    or strong equivalent hands-on experience.
  • In-depth knowledge of InfiniBand, RoCE v2, Spectrum switches, BlueField DPUs, and Cumulus Linux.
  • Proven experience in designing and deploying high-performance or HPC network environments.
  • Willingness to travel for on-site deployments and hands-on hardware installation.
  • Experience with telemetry, diagnostics, and fabric tuning tools.


2. AI Infrastructure Deployment Engineer (NCP-AII Certified)

Location: India REMOTE

Duration: Long Term Contract


Overview:

AI Infrastructure Deployment Engineer

Key Responsibilities:

  • Lead end-to-end deployment of AI racks, including servers, GPUs, switches, and interconnects.
  • Validate bare-metal hardware, Spectrum-X switches, routers, and storage systems.
  • Configure multi-tenant GPU environments using MIG, MPS, and virtualization tools.
  • Deploy NVIDIA Base Command, DGX OS, and associated AI/ML software stacks.
  • Integrate systems with Kubernetes, Helm, and other orchestration platforms.
  • Implement monitoring and telemetry using DCGM, UFM, and performance benchmarking tools.

Qualifications:

  • NCP-AII certification (required)

    or equivalent hands-on infrastructure experience.
  • Expertise in GPU server configurations, MIG/MPS, Base Command, and virtualization (K8s, vSphere).
  • Experience with BIOS/firmware updates, system burn-in, and power/cooling validation.
  • Strong understanding of data center infrastructure and AI workload requirements.
  • Experience integrating AI infrastructure with cloud-native tools and container environments.


3. AI Infrastructure Operations Engineer (NCP-AIO Certified)

Location: India REMOTE

Duration: Long Term Contract


Overview:

AI Infrastructure Operations Engineer

Key Responsibilities:

  • Manage day-to-day operations of GPU clusters, networking fabric, and server infrastructure.
  • Monitor and maintain the health of InfiniBand/Ethernet networks and DGX/H100 nodes.
  • Apply firmware upgrades, OS patches, and handle infrastructure lifecycle management.
  • Troubleshoot hardware, network, and container-level failures using telemetry tools like UFM and DCGM.
  • Create and maintain operational runbooks, automate workflows, and improve incident response.
  • Support infrastructure scaling, upgrades, and collaborate with deployment teams.

Qualifications:

  • NCP-AIO certification (required)

    or comparable operational experience in large-scale AI environments.
  • Strong troubleshooting skills across compute, network, and storage domains.
  • Experience with monitoring and telemetry tools (Prometheus, Grafana, DCGM, UFM).
  • Familiarity with log aggregation and alerting systems.
  • Background in data center operations, capacity planning, and support automation.


How These Roles Collaborate

  • NCP-AIN (Design & Install):

    Builds and installs the high-speed network fabric that powers AI workloads.
  • NCP-AII (Deploy):

    Deploys and validates the full AI infrastructure stack, including hardware and software integration.
  • NCP-AIO (Operate):

    Ensures continuous, reliable, and optimized operations of deployed AI environments.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now