Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in Pune
>
Servicemax
>
Technical Lead - DevOps Engineer

Technical Lead - DevOps Engineer

Servicemax

6 - 10 years

13 - 18 Lacs

Pune

Posted:4 months ago| Platform:

Apply

Skills Required

Automation Linux Analytical Debugging Incident management Instrumentation Information technology Distribution system Monitoring Python

Work Mode

Work from Office

Job Type

Full Time

Job Description

Job Details

As a senior SRE / Observability Engineer, you will be part of the Atlas Platform Engineering team and will:

Create and maintain observability standards and best practices

Review the current observability platform, identify areas for improvement, and guide the team in enhancing monitoring, logging, tracing, and alerting capabilities.

Expand the observability stack across multiple clouds, regions, and clusters, managing all observability data.

Design and implement monitoring solutions for complex distributed systems to provide deep insights into systems and services aiming at complete visibility of digital operations

Supporting the ongoing evaluation of new capabilities in the observability stack, conducting proof of concepts, pilots, and tests to validate their suitability.

Assist teams in creating clear, informative, and actionable dashboards to improve system visibility.

Automate monitoring and alerting processes, including enrichment strategies and ML-driven anomaly detection where applicable.

Provide technical leadership to the observability team with clear priorities ensuring agreed outcomes are achieved in a timely manner.

Work closely with R&D and product development teams (understand their requirements and challenges) to ensure seamless visibility into system and service performance.

Work closely with the Traffic Management team to identify and standardise on existing and new observability tools as part of a holistic solution

Conduct training sessions and create documentation for internal teams

Support the definition of SLI (service level indicators) and SLO (service level objectives) for the Atlas services.

Keep track of the error budget of each service

Participate in the emergency response process

Conduct RCAs (root cause analysis)

Help to automate repetitive tasks and reduce toil.

Qualifications:

People and communication qualifications

Be a strong team player

Have good collaboration and communication skills

Ability to translate technical concepts for non-technical audiences

Problem-solving and analytical thinking

Technical qualifications - general:

Familiarity with cloud platforms (Ideally Azure)

Familiarity with Kubernetes and Istio as the architecture on which the observability and Atlas services run, and how they integrate and scale.

Experience with infrastructure as code and automation

Knowledge of common programming languages and debugging techniques

Have a strong technical background and be hands on.

Linux and scripting languages (Bash, Python, Golang).

Significant Understanding of DevOps principles.

Technical qualifications - observability

Strong understanding of observability principles (metrics, logs, traces)

Experience with APM tools and distributed tracing

Proficiency in log aggregation and analysis

Knowledge and hands-on experience with monitoring, logging, and tracing tools such as Prometheus, Prometheus, Grafana, Datadog, New Relic, Sumologic, ELK Stack, or others

Knowledge of Open Telemetry, including OTEL collector and code instrumentation

Experience designing and building unified observability platforms that enable the use of data (metrics, logs, and traces) to determine quickly if their application or service is operating as desired.

Technical qualifications - SRE

Understanding of the Google SRE principles

Experience in defining SLIs and SLOs

Experience in performing RCAs (root cause analysis)

Experience in system performance

Experience in incident response

Knowledge of status tools, such as Atlassian Status Page or similar

Knowledge of incident management and paging tools, such as PagerDuty or similar

Knowledge of ITIL (Information Technology Infrastructure Library) processes

Qualifications:

People and communication qualifications

Be a strong team player

Have good collaboration and communication skills

Ability to translate technical concepts for non-technical audiences

Problem-solving and analytical thinking

Technical qualifications - general:

Familiarity with cloud platforms (Ideally Azure)

Familiarity with Kubernetes and Istio as the architecture on which the observability platform runs, and how they integrate and scale.

Experience with infrastructure as code and automation

Knowledge of common programming languages and debugging techniques

Have a strong technical background and be hands on.

Linux and scripting languages (Bash, Python, Golang).

Significant Understanding of DevOps principles.

Technical qualifications - observability

Strong understanding of observability principles (metrics, logs, traces)

Experience with APM tools and distributed tracing

Proficiency in log aggregation and analysis

Knowledge and hands-on experience with monitoring, logging, and tracing tools such as Prometheus, Prometheus, Grafana, Datadog, New Relic, Sumologic, ELK Stack, or others

Knowledge of Open Telemetry, including OTEL collector and code instrumentation

Experience designing and building unified observability platforms that enable the use of data (metrics, logs, and traces) to determine quickly if their application or service is operating as desired.

More Jobs at Servicemax

Analyst-Cloud Engineer - Java Support

Pune

2 - 6 yrs

INR 4 - 8 Lacs

Technical Lead (C++, CAD)

Pune

3 - 5 yrs

INR 5 - 7 Lacs

Technical Lead (C++, CAD)

Pune

3 - 5 yrs

INR 2 - 2 Lacs

Information/Cyber Security Analyst

Pune

3 - 5 yrs

INR 5 - 7 Lacs

Intern

Pune

Experience: Not specified

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

Servicemax

IT Services and IT Consulting

Pleasanton CA

Login to

Please Verify Your Phone or Email

Confirm Action

Technical Lead - DevOps Engineer