Posted:6 days ago|
                                Platform:
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                
                                 
                                
                                
                                
                                
                                
                                
                            
On-site
Full Time
AIOps Lead
Location: Chandigarh (On-site)
Experience: 3 to 5 years (AI/ML + DevOps + Observability)
Employment Type: Full-time
About the Role
We are looking for a next-generation AIOps Engineer to design and operate AI-driven, self-healing, and intelligent infrastructure systems.
In this role, you’ll fuse MLOps, DevOps, and agentic AI systems — leveraging technologies like Ray, vLLM, SGLang, and PyTorch Lightning to build predictive, autonomous, and scalable operational pipelines.
You will develop intelligent observability systems capable of detecting, diagnosing, and resolving issues in real time — powered by distributed AI and LLM-based automation.
Key Responsibilities
• Design, implement, and scale AIOps pipelines that collect, analyze, and act on telemetry data across infrastructure and applications.
• Build and deploy distributed ML/LLM workflows using Ray, PyTorch Lightning, vLLM, or SGLang for anomaly detection, event correlation, and predictive maintenance.
• Orchestrate LLM-based operations agents using LangChain, LangGraph, or SGLang to power AI-assisted diagnostics and root-cause analysis.
• Implement intelligent observability layers over systems like Prometheus, Grafana, ELK, OpenTelemetry, or Datadog to enable AI-driven insights and alerting.
• Develop self-healing systems leveraging AI and automation frameworks to auto-remediate incidents.
• Optimize inference serving and distributed compute with vLLM, Ray Serve, and Triton Inference Server for ultra-fast response times.
• Build real-time data ingestion pipelines using Kafka, Spark, or Flink for operational and telemetry data.
• Collaborate with SRE, MLOps, and AI engineering teams to create autonomous, adaptive infrastructure systems.
• Integrate CI/CD pipelines for AI workflows using MLflow, Kubeflow, or Airflow, with model monitoring and drift detection.
• Evaluate and integrate AIOps platforms (Moogsoft, BigPanda, Datadog AIOps, Dynatrace, etc.) and agentic frameworks for proactive automation.
Required Skills & Qualifications
• Bachelor’s or Master’s in Computer Science, Engineering, or related field.
• 4+ years of experience in DevOps, SRE, or AI infrastructure engineering.
• Strong programming experience in Python (preferred), Go, or Bash scripting.
• Deep understanding of cloud platforms (AWS, GCP, Azure) and Kubernetes/Docker orchestration.
• Expertise in infrastructure as code (Terraform, Helm, Pulumi).
• Experience with distributed compute frameworks — Ray, PyTorch Lightning, vLLM, SGLang.
• Proficiency with observability and monitoring stacks (Prometheus, Grafana, ELK, OpenTelemetry, Splunk).
• Familiarity with MLOps and LLMOps tools (MLflow, Kubeflow, Airflow, ArgoCD).
• Experience with event-driven systems and message queues (Kafka, RabbitMQ, AWS SQS).
• Understanding of AI-powered automation, root cause analysis, and predictive operational analytics.
Preferred / Nice-to-Have
• Hands-on with vLLM for optimized LLM inference and observability agents.
• Experience deploying and optimizing Ray Serve, vLLM, or Triton in production.
• Exposure to SGLang for LLM-based orchestration, workflow automation, and diagnostics reasoning.
• Familiarity with vector databases (Milvus, Weaviate, Pinecone) and RAG-based observability.
• Experience with agentic AIOps frameworks and LLM-driven operational reasoning (LangGraph, AutoGen, CrewAI).
• Understanding of AI observability, drift detection, cost-aware scaling, and fault-tolerant AI systems.
• Contributions to open-source AIOps, observability, or distributed AI infrastructure projects.
What We Offer
• Opportunity to build the foundation for autonomous, intelligent operations.
• Hands-on exposure to SGLang, vLLM, Ray, PyTorch Lightning, and LangGraph ecosystems.
• Collaborative, cross-functional environment spanning AI, cloud, and systems engineering.
• Competitive compensation, flexible work setup, and professional development opportunities.
 
                PaladinAi
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
 
        Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
 
            
         
                        Practice Python coding challenges to boost your skills
Start Practicing Python Now