Senior SRE Engineer – Observability

HariNex Solutions

3 years

9 - 15 Lacs

chennai

Posted:1 day ago| Platform: GlassDoor logo

Apply

Skills Required

developer splunk engineering service reliability monitoring development devops agile software cutting support design analysis scaling technology schedule resolve debugging code api deployment synthesize tooling itil management documentation electrical mathematics communication architecture checks integration jenkins git ansible artifactory jira aws kubernetes automation chef metrics storage query python security

Work Mode

On-site

Job Type

Part Time

Job Description

Job description: Engineer/Senior Engineer – Observability

Location: Chennai (Preferred) /Mumbai

Role Type- Contract

Grafana Developer Expertise ( Grafana, Prometheus , Splunk) With 2~3 years of Experience

The Engineer/Senior Engineer – Observability Engineering is key member of Service Reliability Engineering. He/she will be ultimately responsible for system Observability, reliability Monitoring and reducing time to detect by continuously finetuning the monitoring infrastructure of the services our SRE team supports.

As a Reliability engineering team member- With proactive and predictive monitoring our Production & development team can continue to innovate by spotting small bugs and big disasters before they actually happen. That’s your main mission as an Monitoring & Observability Engineer. Next to our Elastic community, you’ll be part of our multidisciplinary Innovative Tech team, where DevOps, Agile, Cloud & Software Engineering experts all work together to create remarkable solutions based on cutting-edge technology.

What will you be doing?

Implement, maintain, and consult on the observability and monitoring framework that supports the needs of multiple internal stakeholders.
Manage Opera/Prometheus/Grafana/Splunk to support custom metric delivery dashboards.
Design and build an observability infrastructure for all engineering teams to consume
Design and develop tools for metric collection, analysis, and reporting
Educate and lead efforts to improve observability among all engineering teams
Responsible for the availability, performance, scaling, monitoring and incident response of FSS technology platform and services.
Ensure the site and services are up 24*7 with no unplanned downtimes. Participate in a rotating on-call schedule to troubleshoot and resolve production escalations from our 24x7x365 NOC & Customer Success teams;
Debugging of the code issues based on web service and API responses, errors, events, logs, etc. Monitor and optimize application performance within the deployment architecture;
Identify and collect the appropriate measurements, and synthesize the correct queries, to show intuitive and insightful visualizations which characterize the behaviour of complex systems
Continue evolving monitoring tooling toward a standards-based self-service automated platform and come up with creative solutions to solve problems
Ensure proper reviews are built to minimise the Mean Time to Recover (MTTR) and Mean Time to Failure (MTTF).
Implementation of ITIL processes like Incident management, problem management and change management.
You will add, tune and maintain alert configurations and documentation as needed;

What you will bring along

BS/MS/MCA Degree in Computer Science, Electrical & Computer Engineering or Mathematics or equivalent experience;
3-8 years of relevant reliability engineering work experience in any of the Online technology companies.
Ability to understand the business services and map it to the reliability engineering design and review
Excellent analytical, problem-solving and communication skills
Driven and self-motivated, work creatively to solve challenging problems.
Experience with design and implementation of Continuous Delivery and/or DevOps solutions or architecture patterns.
Experience with code repository management, code merge and quality checks, continuous
integration, and automated deployment & management using tools like Jenkins, Git, Ansible, Artifactory, Jira, Sonar
Abreast of industry standards and trends related to telemetry and software pipelines
Experience rationalizing and implementing monitoring and observability toolchain at enterprise scale
Previous experience of public clouds (AWS and Terraform)
Knowledge and experience of containers and Kubernetes cluster
Hands on experience consolidating application and system logs at enterprise scale
Experience with automation tools (Chef, Ansible)
Experience with metrics exporters and integrations
Experience with metrics collection and storage (Prometheus, InfluxDB)
Experience with log collection and storage (ELK, Splunk,logstash)
Experience with metric and log query languages (PromQL, LogQL, Sumo Logic)
Experience with alert and notification management (Alertmanager, PagerDuty, Teams

integrations)

Experience with building dashboards (Grafana, Loki, Sumologic, Tenable)
Proven development background with Go, Python, Shell or Java
Security awareness, with an emphasis on designing for security best practices

Job Type: Contractual / Temporary
Contract length: 12 months

Pay: ₹80,000.00 - ₹130,000.00 per month

Work Location: In person

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.