Observability Engineer

0 years

0 Lacs

Posted:1 day ago| Platform: Foundit logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

We are Hiring for Observability Engineer (Dashboarding & Analytics Developer) , Splunk and technical KPIs (Application or API metrics) logs.

WORK LOCATION :

Budget:

Notice period : Immediate to 20 Days

MUST BE INCLUDED WITH SUBMITTAL :

  • Full Legal Name
  • Phone
  • Email
  • Current Location
  • Rate
  • Work Authorization
  • Willing to relocate
  • Which tool candidate used to create dashboards/ visualizations

Which tool candidate used to create dashboards/ visualizations

Observability Engineer (Dashboarding & Analytics Developer)

The JedAI team is at the forefront of developing cutting-edge generative AI platforms that connect to Large

Language Models (LLMs), agents, knowledge bases, and Multi-Channel Processing (MCP) servers. Our

mission is to harness the power of generative AI to deliver innovative solutions that drive eHiciency,

safety, and intelligence across various applications.

  • Job Description
  • We are seeking a highly skilled Dashboarding and Analytics Developer to join our JedAI team. In this role,
  • you will be responsible for the visualization and development of Key Performance Indicators (KPIs) that
  • are critical to monitoring and enhancing the performance of our generative AI systems. You will develop
  • and maintain comprehensive dashboards that provide real-time insights into the performance of LLMs,
  • Retrieval Augmented Generation (RAG) systems, safety mechanisms, other generative AI features, billing,
  • token consumption, and many more.
  • Dashboard Development: Design, develop, and maintain interactive and user-friendly dashboards for monitoring AI
  • system performance.
  • KPI Identification: Collaborate with cross-functional teams to define and implement KPIs related to LLMs, RAG
  • systems, safety protocols, and other AI features.
  • Data Visualization: Create clear and insightful visualizations that communicate complex data trends and patterns
  • eGectively to stakeholders.
  • Performance Monitoring: Continuously monitor AI system metrics to identify anomalies, performance issues, and
  • areas for improvement.
  • Data Analysis: Analyze large and complex datasets to extract meaningful insights that support decision-making
  • processes.
  • Collaboration: Work closely with AI engineers, data scientists, and product managers to align dashboard
  • functionalities with project goals.
  • Innovation: Stay updated with the latest trends and technologies in data visualization and analytics to introduce
  • innovative solutions.
  • Documentation: Maintain thorough documentation of dashboard configurations, data sources, and visualization
  • methodologies.
  • Details of work
  • 1. Performance Metrics:
  • o Latency and Throughput: Monitor the response times and the number of requests processed per unit time to ensure
  • the system meets performance expectations.
  • o Resource Utilization: Track CPU, memory, disk I/O, and network bandwidth usage to identify bottlenecks or
  • neffiiciencies.
  • 2. Model Performance and Drift Monitoring:
  • o Accuracy Metrics: Keep track of model accuracy, precision, recall, F1 score, etc., to ensure the models are
  • performing as expected.
  • o Data and Concept Drift Detection: Monitor for changes in data distribution that could aEect model performance
  • over time.
  • o Feature Importance Tracking: Observe changes in feature importance to understand and explain model predictions.
  • 3. Anomaly Detection:
  • o Implement systems to detect unusual patterns or outliers in data inputs, user behavior, or system performance,
  • which could indicate errors or security issues.
  • 4. Security Monitoring:
  • Dashboarding & Analytics Developer
  • o Access Logs: Maintain detailed logs of user access and actions for security auditing.
  • o Threat Detection: Use intrusion detection systems (IDS) to identify potential security threats.
  • o Compliance Monitoring: Ensure adherence to regulations like GDPR, HIPAA, or other industry-specific compliance
  • requirements.
  • 5. User Engagement and Feedback:
  • o Usage Analytics: Analyze how users interact with the system to improve user experience.
  • o Feedback Collection: Provide mechanisms for users to report issues or suggest improvements.
  • o Session Tracking: Monitor user sessions to understand behavior patterns and enhance personalization.
  • 6. Error Handling and Logging:
  • o Detailed Error Logs: Capture and categorize errors to facilitate quicker debugging and resolution.
  • o Automated Alerting: Set up alerts for critical failures or error rate thresholds being exceeded.
  • 7. Audit Trails and Traceability:
  • o Transaction Logging: Keep records of all transactions and changes in the system for accountability.
  • o Version Control Tracking: Monitor changes in models, code, or configurations to track the evolution of the system.
  • 8. Data Quality Monitoring:
  • o Validation Checks: Ensure incoming data meets quality standards before processing.
  • o Missing or Corrupted Data Detection: Identify and handle incomplete or corrupted data inputs.
  • 9. Scalability Metrics:
  • o Load Testing Metrics: Assess how the system performs under various load conditions to plan for scaling.
  • o Auto-Scaling Monitoring: Monitor the eEectiveness of auto-scaling policies in cloud environments.
  • 10. Cost Management:
  • o Resource Cost Analysis: Monitor the costs associated with compute, storage, and network resources to optimize
  • spending.
  • o Budget Alerts: Set up alerts when spending exceeds predefined budgets.
  • 11. Deployment and CI/CD Pipeline Monitoring:
  • o Deployment Success Rates: Track the success or failure of deployments.
  • o Pipeline Performance: Monitor the CI/CD pipeline for bottlenecks or failures.
  • 12. Compliance and Governance:
  • o Policy Enforcement: Ensure data usage and model deployment adhere to organizational policies.
  • o Role-Based Access Control (RBAC): Implement and monitor access controls for diEerent system components.
  • 13. Disaster Recovery and Backup Monitoring:
  • o Backup Integrity Checks: Regularly verify backups to ensure data can be recovered when needed.
  • o Recovery Time Objectives (RTO) Monitoring: Ensure systems can be restored within acceptable time frames after
  • outages.
  • 14. Customer Support Integration:
  • o Ticketing System Integration: Monitor support tickets related to the system to identify common issues.
  • o Service Level Agreement (SLA) Compliance: Track metrics to ensure SLAs are being met.
  • 15. Visualization and Reporting:
  • o Custom Dashboards: Create dashboards tailored to diEerent stakeholdersexecutives, developers, support teams.
  • o Scheduled Reports: Automate reporting on key metrics for regular review.
  • Some tools & skills preferred but does need to check all the boxes:
  • Technical Domain expereince of AI LLMs, Retrieval Augmented Generation (RAG) systems, safety mechanisms, other generative AI features, billing,
  • token consumption, and many more.
  • Data Visualization Tools: Tableau, Power BI, Grafana, Splunk
  • Programming Languages: Python, JQL, SPL
  • Data Query Languages: SQL
  • Cloud Platforms: AWS, Azure, GCP (Likely if Auto-Scaling is a key responsibility)
  • Monitoring Tools: Prometheus, Datadog, New Relic, CloudWatch (AWS), Azure Monitor
  • Version Control Systems: Git
  • Ticketing Systems: Jira, Zendesk, ServiceNow
  • Logging Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk
  • CI/CD Tools: Jenkins, GitLab CI, CircleCI, GitHub Actions

Shift hours - 2 PM to 11 PM IST Mon - Fri

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You