Overview
We are looking for a hands-on Agentic AI Infrastructure & Orchestration Engineer to design, develop, and optimize intelligent, self-operating multi-agent frameworks. The role combines deep system-level thinking, applied machine learning, and infrastructure engineering to enable scalable agentic AI ecosystems integrating LLMs, vector databases, retrieval pipelines, and orchestration layers across cloud environments.
Key Responsibilities
- Architect and deploy agentic multi-agent AI frameworks.
- Develop scalable pipelines integrating LLM RAG VectorDB Agents
- Build and deploy MCP server for Agentic AI Agents and Integration
- Build observability, latency optimization, and performance monitoring systems.
- Implement self-refine / feedback loop learning architectures.
1. Multi-Agent System Architecture & Deployment
- Architect, design, and deploy agentic multi-agent frameworks where multiple AI agents collaborate autonomously.
- Design and implement inter-agent communication protocols, coordination strategies, and workflow orchestration layers.
- Integrate with frameworks such as LangGraph, CrewAI, AutoGen, or Swarm to develop distributed, event-driven agentic ecosystems.
- Develop containerized deployments (Docker / Kubernetes) for multi-agent clusters running in hybrid or multi-cloud environments.
2. Intelligent Pipeline Development
- Build end-to-end scalable pipelines integrating LLMs RAG VectorDB Agents, ensuring optimal latency and retrieval quality.
- Implement retrieval-augmented generation (RAG) architectures using FAISS, Chroma, Weaviate, Milvus, or Pinecone.
- Develop embedding generation, storage, and query pipelines using OpenAI, Hugging Face, or local LLMs.
- Orchestrate data movement, context caching, and memory persistence for agentic reasoning loops.
3. Agentic Infrastructure & Orchestration
- Build and maintain MCP (Model Context Protocol) servers for Agentic AI agents and integrations.
- Develop APIs, microservices, and serverless components for flexible integration with third-party systems.
- Implement distributed task scheduling and event orchestration using Celery, Airflow, Temporal, or Prefect.
4. Observability, Performance, and Optimization
- Build observability stacks for multi-agent systems with centralized logging, distributed tracing, and metrics visualization.
- Optimize latency, throughput, and inference cost across LLM and RAG layers.
- Implement performance benchmarking and automated regression testing for large-scale agent orchestration.
- Monitor LLM response quality, drift, and fine-tuning performance through continuous feedback loops.
5. Self-Refining & Feedback Loop Architectures
- Implement self-refining / reinforcement learning feedback mechanisms for agents to iteratively improve their performance.
- Integrate auto-evaluation agents to assess output correctness and reduce hallucination.
- Design memory systems (episodic, semantic, long-term) for adaptive agent learning and contextual persistence.
- Experiment with tool-use capabilities, chaining, and adaptive reasoning strategies to enhance autonomous capabilities.
Technical Skills Required
- Programming: Expert-level Python (async, multiprocessing, API design, performance tuning).
- LLM Ecosystem: Familiarity with OpenAI, Anthropic, Hugging Face, Ollama, LangChain, LangGraph, CrewAI, or AutoGen.
- Databases: VectorDBs (FAISS, Weaviate, Milvus, Pinecone), NoSQL (MongoDB, Redis), SQL (PostgreSQL, MySQL).
- Cloud Platforms: AWS / Azure / GCP; experience with Kubernetes, Docker, Terraform, and serverless architecture.
- Observability: Prometheus, Grafana, OpenTelemetry, ELK Stack, Datadog, or New Relic.
- CI/CD & DevOps: GitHub Actions, Jenkins, ArgoCD, Cloud Build, and testing frameworks (PyTest, Locust, etc.).
- Other Tools: FastAPI, gRPC, REST, Kafka, Redis Streams, or event-driven frameworks.
Preferred Experience
- Experience designing agentic workflows or AI orchestration systems in production environments.
- Background in applied AI infrastructure, ML Ops, or distributed system design.
- Exposure to RAG-based conversational AI or autonomous task delegation frameworks.
- Strong understanding of context management, caching, and inference optimization for large models.
- Experience with multi-agent benchmarking or simulation environments.
Soft Skills
- Ability to translate conceptual AI architectures into production-grade systems.
- Strong problem-solving and debugging capabilities in distributed environments.
- Collaboration mindset – working closely with AI researchers, data scientists, and backend teams.
- Passion for innovation in agentic intelligence, orchestration systems, and AI autonomy.
Education & Experience
- Bachelor’s or Master’s in Computer Science, AI/ML, or related technical field.
- 5+ years of experience in backend, cloud, or AI infrastructure engineering.
- 2+ years in applied AI or LLM-based system development preferred.
Optional Nice-to-Haves
- Knowledge of Reinforcement Learning from Human Feedback (RLHF) or self-improving AI systems.
- Experience deploying on-premise or private LLMs or integrating custom fine-tuned models.
- Familiarity with graph-based reasoning or knowledge representation systems.
- Understanding of AI safety, alignment, and autonomous agent governance.