Sr Systems Engineer Linux – AI Infrastructure

5 years

0 Lacs

Posted:5 days ago| Platform: Linkedin logo

Apply

Work Mode

Remote

Job Type

Full Time

Job Description

Position: Senior Linux Administrator – AI/ML Infrastructure


Location: Remote

Experience: 5+ Years

Type: Full-time


Role Overview


We are seeking a highly skilled Senior Linux Administrator to join our team, focusing on the implementation and management of on-premises Linux servers optimized for AI/ML workloads.


The ideal candidate will have deep expertise in Linux system administration, Kubernetes cluster management, and a strong understanding of data center infrastructure components including servers, networking, storage, and virtualization technologies.


This role requires hands-on experience in automating infrastructure, optimizing performance, and ensuring reliability for high-performance computing (HPC) and AI/ML pipelines.


Key Responsibilities


Deploy, configure, and manage on-premises Linux servers supporting AI/ML workloads.


Set up, manage, and troubleshoot Kubernetes clusters for containerized workloads.


Optimize system and network performance for compute-intensive applications.


Automate provisioning and configuration using Ansible, Terraform, and scripting (Bash/Python).


Administer and monitor data center components such as servers, storage arrays, switches, and power systems.


Ensure system security, patch management, and compliance across environments.


Collaborate with DevOps, Data Science, and AI engineering teams to enable seamless integration with ML pipelines.


Plan and implement scalability strategies, maintaining uptime and redundancy.


Maintain comprehensive documentation of configurations, policies, and network diagrams.


Required Skills & Qualifications


7+ years of experience in Linux system administration (RHEL, Ubuntu, CentOS).


Proven hands-on experience with Kubernetes cluster management (setup, scaling, troubleshooting).


CKA (Certified Kubernetes Administrator) certification is mandatory.


Strong knowledge of data center components – servers, racks, networking switches, storage systems, and virtualization layers.


Experience with Ansible, Terraform, CI/CD pipelines, and infrastructure automation.


Proficiency in scripting languages (Bash, Python).


Understanding of performance tuning, system optimization, and fault diagnosis.


Excellent problem-solving, communication, and collaboration skills.


Preferred / Good to Have


Exposure to NVIDIA GPU management, CUDA environments, and AI/ML compute nodes.


Familiarity with HPC environments and distributed computing frameworks.


Experience managing monitoring systems (Prometheus, Grafana) and backup solutions.


Knowledge of DevOps practices, containerization, and hybrid cloud environments.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You