AI Infrastructure Systems/Solutions Architect

10 years

4 - 7 Lacs

Posted:1 week ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Part Time

Job Description

About the Role
We are looking for a Systems or Solutions Architect with deep expertise in networking, infrastructure-as-a-service (IaaS), and cloud-scale system design to help architect and optimize AI/ML infrastructure . The ideal candidate combines strong fundamentals in cloud architecture (AWS or equivalent) , networking , compute , and storage , with hands-on experience in Kubernetes, observability, and automation . You’ll design scalable systems that support large AI workloads — enabling efficient training, inference, and data pipelines across distributed environments. Key Responsibilities Architect and scale AI/ML infrastructure across public cloud (AWS / Azure / GCP) and hybrid environments. Design and optimize compute, storage, and network topologies for distributed training and inference clusters. Build and manage containerized environments using Kubernetes, Docker, and Helm . Develop automation frameworks for provisioning, scaling, and monitoring infrastructure using Python, Go, and IaC (Terraform / CloudFormation) . Partner with data science and ML Ops teams to align AI infrastructure requirements (GPU/CPU scaling, caching, throughput, latency). Implement observability, logging, and tracing using Prometheus, Grafana, CloudWatch, or Open Telemetry . Drive networking automation (BGP, routing, load balancing, VPNs, service meshes) using software-defined networking (SDN) and modern APIs. Lead performance, reliability, and cost-optimization efforts for AI training and inference pipelines. Collaborate cross-functionally with product, platform, and operations teams to ensure secure, performant, and resilient infrastructure . Required Qualifications Knowledge of AI/ML infrastructure patterns , including distributed training, inference pipelines, and GPU orchestration. Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field. 10+ years of experience in systems, infrastructure, or solutions architecture roles. Deep understanding of:
Cloud architecture: AWS (preferred), Azure, or GCP
Networking: VPC, Transit Gateway, DNS, routing, peering, load balancing, VPN
Compute and storage: EC2, ECS/EKS, S3, EBS, EFS, FSx, caching systems
Core infrastructure: virtualization, containers, distributed systems, and OS-level tuning
Proficiency in Linux systems engineering and scripting with Python and Bash . Experience with Kubernetes (EKS/GKE/AKS) for large-scale workload orchestration. Experience with Go (Golang) for infrastructure or network automation. Familiarity with Infrastructure-as-Code (IaC) tools like Terraform, Ansible, or CloudFormation. Experience implementing monitoring and observability systems (Prometheus, Grafana, ELK, Datadog, CloudWatch). Preferred Qualifications Experience with DevOps and MLOps ecosystems (SageMaker, Kubeflow, MLflow, Airflow). AWS or cloud certifications such as Solutions Architect Professional or Advanced Networking Specialty . Experience in performance benchmarking , security hardening , and cost optimization for compute-intensive workloads. Strong collaboration skills and ability to communicate complex infrastructure concepts clearly.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You