Home
Jobs

Senior Dev Ops- Site Reliability Engineer

0 years

0 Lacs

Posted:2 weeks ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

We’re looking for a hands-on, self-directed Senior DevOps Engineer to join our fast-paced startup. You’ll be the first line of defense for production issues, architect robust observability systems, and improve deployment and testing practices. If you thrive in startup environments, enjoy taking ownership, and are comfortable in modern JS/TS stacks, we’d love to meet you. Top Outcomes – First 3 Months Implement a reliable observability stack: Leverage Grafana, CloudWatch, and OpenTelemetry within our Node.js and TypeScript codebase. Be on top of alerts and issues: Monitor, triage, fix or escalate production issues with traceability and follow-up. Reduce system noise: Begin reducing the frequency and volume of unexpected errors. Top Outcomes – First 12 Months Improve test coverage: Ensure better code quality and proactively catch regressions. Own DevOps workflows: Deploy, debug, and maintain infrastructure health autonomously. Become a core team member: Handle incidents independently and support the evolution of our infra/dev culture. Key Performance Indicators (KPIs) Leading Indicators: Number of alerts and incidents triaged Trace IDs investigated and logged Bugs found early and resolved Tickets opened/closed efficiently Reduced volume of unhandled or duplicate errors Lagging Indicators: Production uptime and stability % fixes resolved without handoff Number of tests added Reduction in recurring or duplicate issues Core Responsibilities Observability & Alerting Maintain and enhance Grafana dashboards Integrate and manage CloudWatch alarms and OpenTelemetry traces Ensure traceability across all systems (CRM, APIs, webhooks, workflows) Issue Response & Triage Act as first responder for production issues during working hours Troubleshoot, escalate with full context, and coordinate incident response Infrastructure Maintenance Improve deployment workflows and monitor resource usage Maintain the health of critical subsystems (queues, sync jobs, memory/cpu) Testing & QA Add and improve test coverage once baseline reliability is achieved Build confidence in deployments through automated testing and regression checks Candidate Profile Strong experience with Node.js, TypeScript, and React Deep knowledge of AWS, Grafana, OpenTelemetry, and CloudWatch Prior Startup Experience Preferred Clear, proactive communicator with a bias toward ownership Available 1:30 AM to 10:30 PM IST 5 days/week for on-call responsibilities Bonus: Experience reviewing pull requests and deploying code regularly Immediate Tasks Review and phase-implement an internal RFC for observability Refine and own Grafana dashboards; implement meaningful alerts Ensure consistent trace ID usage throughout the codebase Improve logging and tracing to increase debuggability Monitor and respond to production errors daily Investigate, fix, or escalate recurring system issues Show more Show less

Mock Interview

Practice Video Interview with JobPe AI

Start Reliability Interview Now

My Connections WeAssemble

Download Chrome Extension (See your connection in the WeAssemble )

chrome image
Download Now
WeAssemble

10 Jobs

RecommendedJobs for You