L2 Production Support Engineer

4 years

2 - 6 Lacs

Posted:1 day ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Part Time

Job Description

Pune
Apply Now
About Calfus
At Calfus, we are known for delivering cutting-edge AI agents and products that transform businesses in ways previously unimaginable. We empower companies to harness the full potential of AI, unlocking opportunities they never imagined possible before the AI era. Our software engineering teams are highly valued by customers, whether start-ups or established enterprises, because we consistently deliver solutions that drive revenue growth. Our ERP solution teams have successfully implemented cloud solutions and developed tools that seamlessly integrate with ERP systems, reducing manual work so teams can focus on high-impact tasks.

None of this would be possible without talent like you! Our global teams thrive on collaboration, and we’re actively looking for skilled professionals to strengthen our in-house expertise and help us deliver exceptional AI, software engineering, and solutions using enterprise applications.

As one of the fastest-growing companies in our industry, we take pride in fostering a culture of innovation where new ideas are always welcomed—without hesitation. We are driven and expect the same dedication from our team members. Our speed, agility, and dedication set us apart, and we perform best when surrounded by high-energy, driven individuals.
To continue our rapid growth and deliver an even greater impact, we invite you to apply for our open positions and become part of our journey!

About the role:

The L2 Production Support Engineer is the frontline operational backbone of the Agentic on-call platform, responsible for incident triage, runbook-based remediation, and clean escalation to L3. This role owns steady-state operations, maintains operational documentation (runbooks and guides), and ensures smooth incident workflows for the on-call team.

What You’ll Do:

Incident Triage & Operational Response

  • Own initial triage for Sev-2/3/4 incidents and user-reported issues, including ticket classification and reproduction
  • Follow established runbooks to remediate common issues (service restarts, config toggles, data corrections, cache clears)
  • Monitor dashboards, alert streams, and on-call channels; acknowledge alerts and coordinate initial response
  • Participate in on-call rotation for non-Sev-1 issues and serve as secondary responder during major incidents
  • Provide clear, timely communication to users and stakeholders during incident resolution

Escalation & Handoff Excellence

  • Escalate complex or novel issues to L3 with excellent context: timeline, hypothesis, attempted steps, relevant logs, and metrics
  • Document escalations clearly for incident tracking and post-incident review
  • Ensure escalations include sufficient detail that L3 can pick up work without requiring clarification
  • Learn from L3's solutions and incorporate new findings into runbooks and knowledge base

Operational Guide & Runbook Maintenance

  • Own and maintain the Operational Guide for the agentic on-call platform: standard procedures, troubleshooting flows, and decision trees
  • Create and update runbooks for recurring issues, preventive maintenance, and escalation patterns discovered through incidents
  • Regularly review and refine existing runbooks based on L2/L3 feedback and incident retrospectives
  • Test runbook accuracy quarterly and flag ambiguities or outdated instructions to the L2 team lead
  • Collaborate with L3 engineers to capture complex fixes as simplified runbooks for future L2 use
  • Maintain a knowledge base of common user issues and L2-resolvable solutions

Monitoring & Alerting Support

  • Monitor key dashboards during shifts and validate alert accuracy (reduce false positives, tune thresholds)
  • Report missing or broken alerts to L3 for engineering fixes
  • Provide operational feedback on alerting gaps discovered during incidents
  • Assist in testing new alerts or monitoring rules before production deployment

Technical Troubleshooting (Within Runbook Scope)

  • Read and interpret logs, metrics, and dashboards to correlate incident signals and narrow root cause hypothesis
  • Execute safe runbook-based fixes: service restarts, configuration toggles, safe data queries, and cache clears
  • Apply L3-provided remediation steps for known failure patterns
  • Document troubleshooting steps taken to build context for escalations

Team Collaboration & Knowledge Sharing

  • Participate in incident post-mortems and RCA discussions, contribute observations from initial triage
  • Sharing learnings with L2 team through knowledge base updates and team sync meetings
  • Mentor and support newer L2 engineers through pairing and code review of runbook contributions
  • Provide constructive feedback on operational processes and suggest improvements
On your first day, we'll expect you to have:

Operational & Monitoring

  • 4–8+ years in application support, operational support, or platform operations roles
  • Strong dashboard reading and alert interpretation skills; ability to spot anomalies and correlate signals
  • Proficiency with on-call and ticketing tools: PagerDuty, Jira, ServiceNow, or similar
  • Familiarity with observability platforms: Prometheus, Grafana, Datadog, New Relic, or equivalent
  • Comfortable reading structured logs (JSON format) and using log aggregation platforms (ELK, Datadog, etc.)

Platform & Backend Understanding

  • Solid working knowledge of the agentic on-call platform architecture: core services, job scheduler, LLM orchestration, notification pipeline
  • Basic understanding of microservices: how they communicate, common failure modes, and escalation paths
  • Comfortable with Linux command line basics: SSH, file navigation, process inspection, basic grep/awk for log parsing
  • Familiarity with containerization and orchestration: Docker and Kubernetes at an operational level (restart pods, check logs, review resource usage)
  • Basic SQL read-only skills: able to run safe SELECT queries to validate data, check state, and gather troubleshooting context under runbook guidance
  • Understanding of CI/CD basics: awareness of deployment pipelines, rollback procedures, and config toggle mechanics

LLM & Agentic Systems (Operational Level)

  • Exposure to LLM/agent usage patterns: understanding of tool-calling, context limits, rate limits, and vendor API quirks
  • Familiarity with common LLM failure modes: hallucinations, token exhaustion, timeouts, and vendor-specific rate-limiting
  • Ability to follow troubleshooting flows for agent-driven incidents (prompt tracing, tool execution validation, fallback behavior)

Incident & On-Call Workflows

  • Understanding of incident classification (Sev-1/2/3/4) and appropriate escalation criteria
  • Knowledge of on-call best practices: communication protocols, incident documentation, and post-mortem participation
  • Comfortable with asynchronous and shift-based work; reliable responder with good alert acknowledgment habits

Communication & Documentation

Required Soft Skills

  • Customer-focused mindset: empathy for users and urgency in resolving their issues
  • Detail-oriented: accurate notetaking during incidents and meticulous runbook following
  • Proactive learner: ability to absorb new technologies, platforms, and troubleshooting patterns quickly
  • Collaborative: works well with L3 engineers, dev teams, and other operational teams
  • Shift-friendly: reliable availability during on-call rotations, including nights/weekends as scheduled
  • Humble & curious: asks clarifying questions, escalates appropriately, and doesn't hesitate to ask for help

Experience Requirements

  • Minimum 2–4 years in application/production support, technical support, operations, or platform engineering roles
  • Proven experience with incident triage, ticket management, and on-call workflows
  • Prior exposure to on-call systems or incident management platforms (PagerDuty, Squadcast or custom)
  • Experience with at least one agentic AI or LLM-integrated product (customer-facing or internal tools) is a plus
  • Comfortable working shift-based on-call rotation (evenings, nights, weekends, as scheduled)

Nice-to-Have

  • Prior experience in a Global Capability Center or consulting firm environment
  • Familiarity with incident severity frameworks and SLO/SLI concepts
  • Exposure to multiple monitoring and observability tools
  • Basic scripting (Python or bash) for custom diagnostics and automation
  • Experience writing operational procedures or internal documentation

Operational Guide & Runbook Responsibilities (Detailed)
Creating New Runbooks

  • When a recurring issue is identified (by L2 or L3), collaborate to create a step-by-step runbook
  • Ensure runbooks are clear, actionable, and safe for L2 execution without requiring L3 escalation
  • Include decision trees: "if X, do Y; if Z, escalate to L3"
  • Test runbook accuracy by walking through it with a peer before publishing

Maintaining Existing Runbooks

  • Review runbooks quarterly for accuracy and relevance; update if processes or tool names have changed
  • Flag outdated runbooks during team syncs (e.g., "This runbook references an old dashboard URL")
  • Incorporate feedback from L3 when they fix complex issues: simplify complex fixes into runbook steps for future L2 use

Updating the Operational Guide

  • Maintain a single, authoritative Operational Guide covering:
  • Platform architecture overview (high-level, non-code)
  • Alert guide: what each alert means, typical causes, and first-response steps
  • Runbook index: list of all runbooks with quick-reference links
  • Troubleshooting decision tree: common symptoms which runbook to follow
  • Escalation criteria and process
  • On-call procedures and communication protocols
  • Known issues and workarounds
  • Update the guide when new features deploy, alerts change, or new runbooks are created
  • Conduct semi-annual reviews of the guide to ensure accuracy and completeness

Knowledge Base Management

  • Maintain a searchable knowledge base (wiki, Notion, Confluence, or similar) with:
  • Common user issues and L2-resolvable solutions
  • Frequently asked questions with step-by-step answers
  • Post-incident summaries (non-sensitive) to share learnings
  • Troubleshooting checklists organized by symptom
  • Encourage L2 team members to contribute findings and suggest improvements
  • Archive or deprecate outdated entries quarterly

Success Metrics

  • Incident response: Mean Time to Acknowledgment (MTTA) and Mean Time to Escalation (MTTE) for triage decisions
  • Runbook effectiveness: % of L2 team able to resolve tickets using runbooks without escalation; reduction in "unknown" escalations
  • Documentation quality: User and L3 feedback on runbook clarity and accuracy; reduced escalations due to missed troubleshooting steps
  • Operational guide updates: Guide reviewed and refreshed quarterly; 0 outdated procedures in active rotation
  • On-call reliability: response times, ticket accuracy, and team feedback on L2 availability and professionalism
  • Knowledge base engagement: number of contributions per quarter, search usage, and user satisfaction with knowledge base accuracy
1. Dashboard Reading: Given a dashboard screenshot, identify what metrics matter most and what they tell you about platform health.
2. Escalation Quality: Describe an escalation you wrote to L3. What context did you include? What would you improve?
3. LLM Failure Mode: An agent is timing out. Walk through your initial troubleshooting using logs and dashboards.
4. On-Call Shift: Describe your ideal on-call shift. What tools do you use? How do you stay alert and responsive?
5. Knowledge Sharing: How would you mentor a new L2 engineer on your team?

Benefits:
At Calfus, we value our employees and offer a strong benefits package. This includes medical, group, and parental insurance, coupled with gratuity and provident fund options. Further, we support employee wellness and provide birthday leave as a valued benefit.

Calfus Inc. is an Equal Opportunity Employer.
We believe diversity drives innovation. We’re committed to creating an inclusive workplace where everyone—regardless of background, identity, or experience—has the opportunity to thrive. We welcome all applicants!

Back

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You