Product Support Engineer (L3)

8 years

0 Lacs

Posted:2 days ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

About Calfus


At Calfus, we are known for delivering cutting-edge AI agents and products that transform businesses in ways previously unimaginable. We empower companies to harness the full potential of AI, unlocking opportunities they never imagined possible before the AI era. Our software engineering teams are highly valued by customers, whether start-ups or established enterprises, because we consistently deliver solutions that drive revenue growth. Our ERP solution teams have successfully implemented cloud solutions and developed tools that seamlessly integrate with ERP systems, reducing manual work so teams can focus on high-impact tasks.


None of this would be possible without talent like you! Our global teams thrive on collaboration, and we’re actively looking for skilled professionals to strengthen our in-house expertise and help us deliver exceptional AI, software engineering, and solutions using enterprise applications.


As one of the fastest-growing companies in our industry, we take pride in fostering a culture of innovation where new ideas are always welcomed—without hesitation. We are driven and expect the same dedication from our team members. Our speed, agility, and dedication set us apart, and we perform best when surrounded by high-energy, driven individuals.

To continue our rapid growth and deliver an even greater impact, we invite you to apply for our open positions and become part of our journey!


About the role:


The L3 Production Support Engineer is a backend-focused full-stack incident SME responsible for owning complex production incidents, driving root cause analysis, and implementing systemic improvements for the agentic on-call management platform. This role bridges incident command, deep backend engineering, and targeted frontend troubleshooting to ensure platform reliability at scale.


What You’ll Do:


Incident Management & Leadership


  • Own Sev-1/Sev-2 incident response as incident commander or lead resolver, driving swift diagnosis and resolution
  • Lead post-incident RCAs, identifying systemic issues and driving long-term fixes across backend, infrastructure, and UI
  • Establish and refine incident response playbooks, runbooks, and escalation procedures
  • Participate in on-call rotation as primary/secondary responder with accountability for critical systems


Backend & Infrastructure Expertise


  • Perform deep production troubleshooting: log analysis, distributed tracing, metric correlation, and profiling under pressure
  • Diagnose and fix complex issues across microservices: scheduling engine, LLM orchestration, notification pipeline, and integrations
  • Optimize database queries, identify locking issues, and manage migrations in PostgreSQL under production constraints
  • Architect and implement Redis caching, rate limiting, and queue-based patterns for reliability and scale
  • Work with Kubernetes, container orchestration, and deployment pipelines; manage rollbacks and feature toggles during incidents


Full-Stack Incident Resolution


  • Resolve end-to-end incidents regardless of origin (backend API, database, LLM vendor, or React frontend)
  • Debug and ship targeted React fixes when UI is the fastest path to incident resolution
  • Drive code-level improvements in backend services (Python/FastAPI) to harden agent flows, retry logic, and error handling
  • Collaborate closely with dev teams on defects, performance bottlenecks, and architecture-level changes


Observability & Continuous Improvement


  • Design and tune monitoring, alerting, and SLO/SLI frameworks for the platform
  • Maintain and evolve critical runbooks, playbooks, and knowledge base entries as patterns emerge
  • Mentor L2 engineers on deep troubleshooting, escalation discipline, and incident best practices
  • Drive blameless post-mortems and systemic risk reduction across the platform


On your first day, we'll expect you to have:


Backend (Primary Focus)


  • 5–8+ years

    in backend engineering with strong hands-on experience in Python/FastAPI or equivalent
  • Deep knowledge of async APIs, background jobs, message queues (Celery, RabbitMQ, or similar), and distributed scheduling
  • Production-grade database skills: PostgreSQL query optimization, locking, migrations, and performance tuning
  • Redis expertise: caching patterns, rate limiting, streams, and pub/sub for real-time systems
  • Strong observability and on-call mindset: designing alerts, understanding SLOs/SLIs, error budgets, and Sev definitions
  • Proficiency with Kubernetes, Docker, container orchestration, and CI/CD pipelines (Jenkins, Bitbucket, GitHub Actions)
  • Understanding of cloud infrastructure (Azure preferred) and networking fundamentals


LLM & Agentic Systems


  • Solid grasp of LLM orchestration concepts: prompt engineering, tool-calling, context windows, rate limits, and vendor-specific behavior
  • Experience with LLM failure modes: hallucinations, token limits, timeout patterns, and cost/latency tradeoffs
  • Knowledge of agent frameworks (LangGraph, similar) and how they compose across microservices
  • Ability to debug LLM-driven flows: tracing prompts, understanding retry/backoff behavior, and validating tool outputs


Full-Stack (Secondary but Required)


  • 2–3+ years

    hands-on with React and TypeScript in production environments
  • Competency reading and modifying existing React code: components, hooks, routing, state management (Redux/Context)
  • Browser debugging skills: DevTools, React DevTools, network throttling, and performance profiling
  • Ability to implement targeted UI fixes: form validation, error handling, API error display, and minor UX hardening
  • Familiarity with frontend build pipelines: Webpack/Vite, environment configs, feature flags, and deployment strategies


Logging, Metrics & Troubleshooting


  • Expert-level log parsing and correlation across services using structured logging (JSON, correlation IDs)
  • Proficiency with observability platforms (Prometheus, Grafana, Datadog, New Relic, or similar)
  • Ability to construct and execute production queries under incident time pressure
  • Strong shell scripting (bash/Python) for diagnostics, automation, and custom monitoring


Required Soft Skills


  • Incident command maturity

    : composure under pressure, clear communication, and decisive decision-making during critical outages
  • Technical depth with breadth

    : deep backend knowledge + sufficient full-stack awareness to own end-to-end incidents
  • Mentorship mindset

    : capable of raising L2 engineers through code review, pairing, and RCA participation
  • Documentation discipline

    : ability to capture runbooks, architecture decisions, and lessons learned clearly
  • Cross-functional collaboration

    : working effectively with dev, SRE, platform, and business teams during incidents


Experience Requirements


  • Minimum

    6–10 years

    in backend/platform/SRE roles with at least

    3+ years

    in production support, incident response, or on-call engineering
  • Proven track record leading Sev-1/Sev-2 incidents in distributed, multi-service systems
  • Experience with at least one agentic AI or LLM-integrated product (customer-facing or internal tools)
  • Comfortable with continuous on-call rotation and on-demand availability for critical incidents


Nice-to-Have


  • Experience with on-call/incident management platforms (PagerDuty, Squadcast, Opsgenie, or custom solutions)
  • Familiarity with RBAC, SSO, and authentication/authorization patterns
  • Knowledge of RAG (Retrieval Augmented Generation) systems


Success Metrics


  • Incident resolution

    : Mean Time to Resolution (MTTR) for Sev-2/3 incidents and escalation quality for Sev-1 incidents
  • Runbook effectiveness

    : % of L2 team successfully using documented runbooks without L3 escalation
  • RCA quality

    : systemic issues identified and fixed; Sev-1 recurrence rate < 1% within 30 days
  • Mentorship impact

    : L2 engineers able to independently handle higher-complexity issues over 6–12 months
  • On-call reliability

    : response times, ticket accuracy, and team feedback on L3 support quality


Benefits:

At Calfus, we value our employees and offer a strong benefits package. This includes medical, group, and parental insurance, coupled with gratuity and provident fund options. Further, we support employee wellness and provide birthday leave as a valued benefit.


Calfus Inc. is an Equal Opportunity Employer.

We believe diversity drives innovation. We’re committed to creating an inclusive workplace where everyone—regardless of background, identity, or experience—has the opportunity to thrive. We welcome all applicants!

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You