Position Overview
We are seeking an experienced DevOps Engineer to architect and manage the infrastructure backbone of our revolutionary AI startup. This role offers an exceptional opportunity to build scalable, secure, and efficient systems that power next-generation AI applications. You'll work directly with our founding team to establish DevOps practices that will scale from MVP to enterprise-level solutions.
Key Responsibilities
Infrastructure & Cloud Management
-
Design and implement scalable cloud infrastructure on AWS, Azure, or GCP for AI/ML workloads
-
Architect and manage Kubernetes clusters optimised for ML training and inference
-
Build and maintain infrastructure as code using Terraform, CloudFormation, or Pulumi
-
Implement auto-scaling solutions for variable AI compute demands
-
Manage GPU clusters and specialized hardware for deep learning workloads
MLOps & AI Pipeline Management
-
Design and implement CI/CD pipelines specifically for machine learning model deployment
-
Build automated model training, validation, and deployment workflows
-
Implement model versioning, experiment tracking, and artifact management systems
-
Set up monitoring and alerting for ML model performance and data drift detection
-
Create disaster recovery and rollback strategies for AI model deployments
Platform Engineering
-
Develop internal developer platforms and self-service tools for the engineering team
-
Implement secure API gateways and microservices architecture for AI applications
-
Build and maintain data pipelines for real-time and batch processing
-
Design secrets management and security policies for sensitive AI data and models
-
Establish logging, monitoring, and observability across all systems
Security & Compliance
-
Implement security best practices for AI systems and sensitive data handling
-
Design and maintain network security, firewalls, and VPN configurations
-
Establish backup and disaster recovery procedures for critical AI infrastructure
-
Ensure compliance with data protection regulations and industry standards
-
Conduct regular security audits and vulnerability assessments
Performance Optimization
-
Monitor and optimize infrastructure costs, especially for expensive GPU resources
-
Implement caching strategies for AI inference and data processing
-
Optimize container orchestration for maximum resource utilization
-
Performance tune databases and storage systems for AI workloads
-
Establish SLA monitoring and capacity planning procedures
Required Qualifications
Technical Expertise
-
Cloud Platforms: 4+ years hands-on experience with AWS, Azure, or GCP
-
Containerization: Expert-level Docker and Kubernetes skills with production experience
-
Infrastructure as Code: Proficiency with Terraform, Ansible, or similar tools
-
CI/CD: Experience building robust pipelines using Jenkins, GitLab CI, GitHub Actions, or Azure DevOps
-
Programming: Strong scripting skills in Python, Bash, and familiarity with Go or Java
AI/ML Infrastructure Knowledge
-
Experience deploying and managing ML models in production environments
-
Understanding of GPU computing, CUDA, and specialized AI hardware
-
Familiarity with ML frameworks (TensorFlow, PyTorch, Scikit-learn) and their deployment requirements
-
Knowledge of data engineering tools and big data processing (Spark, Kafka, Airflow)
-
Experience with ML model serving platforms (MLflow, Kubeflow, Seldon, or TensorFlow Serving)
DevOps Fundamentals
-
5-8 years of DevOps/SRE experience with demonstrated expertise in production systems
-
Strong Linux administration skills and system performance optimization
-
Experience with monitoring tools (Prometheus, Grafana, ELK/EFK stack, Datadog)
-
Database management experience (PostgreSQL, MongoDB, Redis) with backup/recovery
-
Network engineering knowledge including load balancers, CDNs, and service meshes
Preferred Qualifications
-
Previous experience in AI/ML startups or high-growth technology companies
-
Certifications in cloud platforms (AWS Solutions Architect, Azure DevOps Engineer, etc.)
-
Experience with edge computing and distributed AI inference systems
-
Knowledge of data privacy frameworks and federated learning infrastructure
-
Familiarity with FinOps practices for cloud cost optimization
-
Experience with service mesh technologies (Istio, Linkerd, Consul Connect)
What We Offer
Compensation & Benefits
-
Competitive salary up to ₹20,00,000 per annum
-
Comprehensive health insurance with family coverage and wellness benefits
Technical Growth
-
Access to cutting-edge AI infrastructure and latest cloud technologies
-
Opportunity to shape the technical architecture of a groundbreaking AI product
-
Direct collaboration with world-class AI researchers and engineers
-
Mentorship from experienced startup founders and tech leaders
Work Environment
-
Flexible working arrangements with hybrid and remote options
-
Modern office in Bengaluru with high-end development workstations
-
Unlimited learning resources and access to cloud credits for experimentation
-
Fast-paced, innovation-driven culture with direct impact on product success
-
Regular tech talks, hackathons, and team building activities
Career Impact
-
Ground-floor opportunity in a stealth-mode AI company with massive potential
-
Chance to build infrastructure that will serve millions of users
-
Direct reporting to CTO/Founders with significant decision-making authority
-
Opportunity to lead and build the DevOps team as the company scales
-
Potential for international expansion and technology leadership roles
About This Opportunity
Join us at the most exciting phase of our journey. As one of our first DevOps hires, you'll have unprecedented influence over our technical infrastructure and engineering culture. This role is perfect for someone who wants to combine deep technical expertise with entrepreneurial impact in the rapidly evolving AI landscape.
You'll work on challenging problems like:
-
Scaling AI training from single GPUs to multi-node clusters
-
Implementing real-time AI inference at global scale
-
Building secure, compliant infrastructure for sensitive AI applications
-
Optimizing costs while maintaining high performance for variable AI workloads
Required Mindset
-
Strong problem-solving skills with ability to debug complex distributed systems
-
Excellent communication skills for cross-functional collaboration
-
Passion for automation, efficiency, and engineering excellence
-
Interest in AI/ML technology and its infrastructure challenges
Note: Due to our stealth mode status, specific product and technology details will be shared during the interview process with qualified candidates who execute appropriate NDAs.
Application Requirements
Please submit:
-
Detailed resume highlighting relevant DevOps and AI infrastructure experience
-
GitHub/GitLab profile showcasing infrastructure code and automation projects
-
Brief cover letter explaining your interest in AI DevOps and startup environments
-
Any relevant cloud certifications, case studies, or technical blog posts
We are committed to building a diverse and inclusive team. All qualified applicants will receive equal consideration regardless of race, gender, age, religion, sexual orientation, disability status, or veteran status.