Job Description Key Responsibilities : Observability Systems Management Design, deploy, and maintain observability tools and platforms, including monitoring, logging, and tracing systems. Ensure optimal configuration and performance of observability tools such as Prometheus, Loki, Grafana, ELK stack (Elasticsearch, Logstash, Kibana), Jaeger and cloud (AWS/GCP/Azure) Observability Tools. Monitoring And Alerting Develop and manage dashboards using Kibana/Grafana and set up alerts with ElastAlert and Prometheus Alert Manager to monitor the health and performance of applications and infrastructure. Implement robust alerting mechanisms to detect and notify of anomalies, outages, and system performance issues in real-time. Logging And Tracing Implement centralized logging solutions to aggregate logs from various systems and applications. Develop and maintain distributed tracing solutions to provide end-to-end visibility into system transactions. Performance Analysis And Optimization Analyze system performance metrics and identify bottlenecks and performance degradation. Understanding of SLOs and SLIs Work with development and operations teams to remediate performance issues and optimize system performance. Automation And Scripting Create automation scripts to streamline observability tasks and processes. Develop self-healing mechanisms through automated incident response. Collaboration And Communication Work closely with development, operations, and SRE teams to align observability solutions with business and technical requirements. Provide guidance and training on observability tools and best practices to other team members. Documentation And Reporting Create and maintain detailed documentation for observability systems, processes, and procedures. Generate periodic reports and dashboards to provide insights into system performance and reliability. Qualifications And Experience Education : Bachelor's degree in Computer Science, Information Technology, or a related field. Advanced degree preferred. Experience Minimum of 4+ years of experience in IT infrastructure, with at least 3+ years in a observability or monitoring role. Proven experience in observability engineering, including deploying and managing observability solutions. Experience with monitoring tools (e.g., Prometheus, Grafana), logging tools (e.g., ELK stack), and tracing tools (e.g., Jaeger, OpenTelemetry). Experience with cloud platforms such as AWS, Azure, or Google Cloud and Database like MySQL. Technical Skills Strong understanding of observability concepts including metrics, logging, and tracing. Proficiency in scripting languages such as Bash, Python, Perl or Go. Familiarity with containerization (e.g., Docker) and orchestration tools (e.g., Kubernetes) and CI/CD pipelines. Understanding of IP Network and monitoring on Network device (e.g. Router, Firewall). Experience with infrastructure as code tools (e.g., Terraform, Ansible). Soft Skills Excellent problem-solving and analytical skills. Strong communication and collaboration skills. Ability to work independently and in a team-oriented environment. Preferred Qualifications Experience with APM tools like New Relic, Datadog, or Dynatrace. Knowledge of service mesh technologies (e.g., Istio). Open-source contributions or relevant certifications in observability tools and methodologies. (ref:hirist.tech)