Principal Service Reliability Engineer

10 years

5 - 7 Lacs

Posted:5 days ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Part Time

Job Description

Key Responsibilities

  • End-to-end service ownership: design for telemetry, security, resiliency, scalability, and performance; lead sizing/architecture; drive service health reviews and process simplification.
  • Incident management and prevention: lead postmortems/RCAs, coordinate fixes, define repair items, and implement data-driven prevention and continuous improvement.
  • AI/ML and GenAI delivery: design and integrate solutions with LLMs, RAG, agentic workflows, and conversational AI; build low-latency model serving and retraining pipelines.
  • Application engineering: develop performant microservices for distributed, containerized, cloud-native systems.
  • Automation: eliminate toil by automating operational workflows, recovery procedures, code delivery, and configuration management; build internal tools and reusable scripts/services to accelerate delivery and reduce errors.
  • Observability: define and implement monitoring, logging, alerting, and tracing strategies; establish SLOs/SLIs/error budgets; improve diagnostics and performance visibility for rapid triage.
  • Cross-functional collaboration: partner with product, operations, and data teams to translate requirements into secure, scalable solutions; communicate effectively with technical and non-technical stakeholders.

Minimum Qualifications

  • BS/MS in Computer Science or related field; 10+ years of software engineering in cloud environments.
  • Strong in distributed systems/microservices using java / python; SQL/data modeling; python for AI/automation.
  • SRE/DevOps expertise: systems and networking fundamentals, application security, observability, performance analysis, and incident response.
  • Proven SDLC excellence: code quality, reviews, version control, CI/CD, testing, and release engineering.
  • Excellent written and verbal communication; English fluency.

Preferred/Technical Skills

  • AI/ML/GenAI: experience with foundational models, RAG, agentic architectures; model deployment, optimization, monitoring, and retraining.
  • Cloud and containers: experience with containerization, orchestration, and resilient, fault-tolerant microservices.
  • Observability: hands-on experience designing dashboards, alerts, traces, logs, and metrics; defining SLOs/SLIs and error budgets; on-call readiness and runbook quality.
  • Operations: performance tuning across java / python and SQL for large-scale enterprise applications; strong Linux/Unix expertise; capacity planning and reliability reviews.
  • Automation and scripting: proficiency in scripting to automate operational workflows, build tooling, and CI/CD tasks (e.g., shell scripting, python, configuration-as-code, task runners).
  • Familiarity with enterprise ERP applications and standard DevOps tooling and practices.

Mock Interview

Practice Video Interview with JobPe AI

Start Java Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Java Skills

Practice Java coding challenges to boost your skills

Start Practicing Java Now
Oracle logo
Oracle

Information Technology

Redwood City

RecommendedJobs for You