Job
Description
Prometheus & Grafana Engineer Job Purpose Description:The Engineer develops, supports, and maintains Grafana, Prometheus, Thanos metric, reporting and monitoring system to create a single pane of glass dashboarding and alerting environment.
Work as part of a cross functional team and participate in the design and development of Grafana Dashboard UI. Integrate logs and events into the Grafana single pane of glass monitoring environment. The Engineer plays a critical role in the monitoring of Unix, Linux, Windows operating system and virtual machine sub system software. Automation is an essential principle the Engineer should use so the team can continue engineering the high performance, scalable solutions that power the business. A SME/L3 Engineer with a deep understanding of Grafana and Prometheus is responsible for maintaining, optimizing, and advancing monitoring and observability systems. The role involves monitoring Enterprise servers using Prometheus and creating operational dashboards using Grafana. In this role, you will be responsible for maintaining, optimizing, and advancing our monitoring and observability systems. Your expertise will be critical in ensuring the reliability, performance, and scalability of our infrastructure. You will be owning the overall health/availability/configurations of Grafana and Prometheus solutions. Knowledge, Skills & Abilities The position requires skills in crafting graphics using Grafana Dashboards Have experience and proven ability in crafting professional dashboards that give insight into sophisticated data sets. Understanding of multiple approaches to data storage such as Prometheus, Influx dB or NoSQL DB and experience in analyzing sophisticated Time Series data sets. Hands on Experience on monitoring tools for Application, Service, Infrastructure and Data Quality Monitoring Good understanding of long term storage and reporting dashboard for metrics in Grafana. Development of real time monitoring solution for the various environment Experience in using Prometheus, Grafana, Loki and Influx dB API, exporters and libraries Crafting automation to capture data from multiple sources Good understanding for monitoring Good attitude towards troubleshooting skills that span systems, network, and applications Experience integrating tools into pipelines used for Continuous Integration (CI) Scripting ability in Perl, PowerShell, Unix Scripting, etc. Experience with centralized logging systems Experience with Elasticsearch would all be a strong advantage Familiarity in programming languages such as python, Go, etc. is a plus Experience with web and application server technology Good knowledge of Linux and AIX operating systems a plus Good knowledge of Kubernetes, Docker, Container. Good Knowledge of DevOps Tools (Ansible, Jenkins, GitHub, Bitbucket, Artifactory, Monitoring Prometheus and Grafana) Good knowledge of open source tools. Education and Experience Bachelors degree or equivalent experience and/or education. 7 10 years of experience required. Key responsibilities:1. Grafana and Prometheus Administration o Configure, maintain, and scale Grafana and Prometheus instances. o Develop and implement custom dashboards for monitoring key metrics. o Troubleshoot issues, ensure data accuracy, and optimize query performance. 2. Monitoring and Alerting:o Design and manage alerting rules for proactive issue identification and resolution. o Continuously improve and expand monitoring coverage to meet evolving needs. o Collaborate with teams to define alert thresholds and escalation procedures. 3. Data Analysis and Visualization:o Analyze metrics data to identify performance bottlenecks and areas for improvement. o Create meaningful visualizations and reports to provide insights for stakeholders. o Contribute to the enhancement of data retention and archiving strategies. 4. Scaling and Optimization:o Collaborate with the infrastruc