This isnt your average DevOps role. This isnt just about pipelines or cloud provisioning. This is about engineering the backbone of
Agentic AI systems
that drive the next generation of enterprise SaaS where conversational interfaces, dynamic UIs, and intelligent agents
operate seamlessly on AWS Serverless infrastructure
, with deep integration into Salesforce and cross-agent protocols
. This is for builders with something to prove. For engineers who ve gone beyond cloud fluency to orchestrate
complex, multi-agent ecosystems
who want to shape how enterprise applications are deployed, debugged, scaled, and observed in real time. If you re driven by deep automation, passionate about creating fault-tolerant agentic systems, and thrive where innovation is the expectation not the exception you re in the right place. Join us to redefine SaaS infrastructure and champion a
new era of AI-powered, product-led enterprise experiences
.
The Role
We are seeking a
hands-on Agentic AI Ops Engineer
who thrives at the intersection of cloud infrastructure
, AI agent systems
, and DevOps automation
. In this role, you will build and maintain the CI/CD infrastructure for Agentic AI solutions
using Terraform on AWS
, while also developing, deploying, and debugging intelligent agents and their associated tools
. This position is critical to ensuring scalable, traceable, and cost-effective delivery of agentic systems in production environments.
The Responsibilities
CI/CD Infrastructure for Agentic AI
- Design, implement, and maintain
CI/CD pipelines
for Agentic AI applications
using Terraform
, AWS CodePipeline
, CodeBuild
, and related tools. - Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod.
Agent Development & Debugging
- Collaborate with ML/NLP engineers to develop and deploy
modular, tool-integrated AI agents
in production. - Lead the effort to create
debuggable agent architectures
, with structured logging, standardized agent behaviors, and feedback integration loops. - Build agent lifecycle management tools that support
quick iteration, rollback, and debugging
of faulty behaviors.
Monitoring, Tracing & Reliability
- Implement
end-to-end observability
for agents and tools, including runtime performance metrics
, tool invocation traces
, and latency/accuracy tracking
. - Design dashboards and alerting mechanisms to capture
agent failures, degraded performance, and tool bottlenecks
in real-time. - Build lightweight tracing systems that help
visualize agent workflows
and simplify root cause analysis.
Cost Optimization & Usage Analysis
- Monitor and manage
cost metrics
associated with agentic operations including API call usage
, toolchain overhead
, and model inference costs
. - Set up proactive
alerts for usage anomalies
, implement cost dashboards
, and propose strategies for reducing operational expenses without compromising performance.
Collaboration & Continuous Improvement
- Work closely with product, backend, and AI teams to evolve the
agentic infrastructure design
and tool orchestration workflows
. - Drive the adoption of
best practices for Agentic AI DevOps
, including retraining automation, secure deployments, and compliance in cloud-hosted environments. - Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability.
-
2+ years
of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to AI/ML systems
. -
Deep expertise in AWS serverless architecture
, including hands-on experienc