Posted:1 week ago|
Platform:
Remote
Full Time
We're a no-code AI platform helping mid-market and enterprise teams ship real value from Generative AI using their own datafast. Customers love us for quality of answers, ease of use, scalability, and a powerful API surface.
We are shipping features at insane velocity, and so we need a software engineer / site reliability engineer who will use new agentic features to ensure quality and reliability, without drastically affecting the speed and velocity of releases.
Hence, this role requires the use of new agentic technology like Claude Agent SDK to build out automations and other quality-enforcing systems.
- High-leverage ownership: Be the go-to engineer for reliability automation across CI/CD, directly moving SLOs for enterprise customers.
- Ship fast, safely: Pair rapid delivery with guardrails you designfeature flags, canaries, shadow trafficso your work lands in prod quickly and safely.
- Work on the frontier: Build agentic quality systems (e.g., Claude Agent SDK) and LLM-driven test harnesses
- Broad product surface: Touch API, retrieval/RAG, data syncs, and agent behavior while partnering with Product, GTM, and Customer Success on real deployments.
- Clear career runway: Grow toward Platform/Tech Lead/Staff paths in the AI industry
- Flexibility that respects IST: Remote-first India role with limited US-overlap.
- Reliability automation: Use agentic tooling (e.g., Claude Agent SDK) to auto-detect regressions, validate releases, and enforce quality gates in CI/CD without slowing velocity.
- Performance & resilience: Design and run load/soak/chaos tests for API, retrieval, and agentic workflows; turn findings into hardening fixes and guardrails.
- Release safety: Build pre-prod validation (synthetic user journeys, canaries, shadow traffic) and post-deploy verifiers; champion rollout strategies (feature flags, blue/green).
- Incident engineering: Improve MTTR with crisp runbooks, auto-remediation playbooks, and blameless postmortems; participate in on-call rotation.
- Experience: 47 years total (target 5) in software/SRE/production-facing engineering with hands-on ownership of reliability, quality, and release safety.
- Coding: Strong in at least one of Python/TypeScript/Go; comfortable writing test harnesses, CLI tools, and lightweight services.
- Ops mindset: You turn ambiguous reliability risks into concrete milestones, ship guardrails, and iterate quickly.
- Communication: Clear written narratives, crisp status updates, and the ability to influence cross-functionally.
- Location & hours: Based in India (remote) with overlap with US Eastern (atleast 6 hours between 9AM and 5PM US EST).
- Agentic quality systems (e.g., Claude Agent SDK) or LLM-based test generation/verification.
- Prior quality/SRE experience.
- Proven track record moving reliability metrics in production systems.
- Growth: Big surface area, real ownership, and a path to shape reliability engineering for our platform.
- Benefits: 2-weeks paid time off, WFH, and a team that cares about winning together.
- Gear & setup: We'll equip you with what you need to move fast (hardware, tools, and services).
- Use the **Apply** button (it will open the assessment link which is required to apply)
CustomGPT.ai
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Salary: Not disclosed