We are Hiring for Observability Engineer (Dashboarding & Analytics Developer) , Splunk and technical KPIs (Application or API metrics) logs.
WORK LOCATION :
Budget:
Notice period : Immediate to 20 Days
MUST BE INCLUDED WITH SUBMITTAL :
- Full Legal Name
- Phone
- Email
- Current Location
- Rate
- Work Authorization
- Willing to relocate
- Which tool candidate used to create dashboards/ visualizations
Which tool candidate used to create dashboards/ visualizations
Observability Engineer (Dashboarding & Analytics Developer)
The JedAI team is at the forefront of developing cutting-edge generative AI platforms that connect to Large
Language Models (LLMs), agents, knowledge bases, and Multi-Channel Processing (MCP) servers. Our
mission is to harness the power of generative AI to deliver innovative solutions that drive eHiciency,
safety, and intelligence across various applications.
- Job Description
- We are seeking a highly skilled Dashboarding and Analytics Developer to join our JedAI team. In this role,
- you will be responsible for the visualization and development of Key Performance Indicators (KPIs) that
- are critical to monitoring and enhancing the performance of our generative AI systems. You will develop
- and maintain comprehensive dashboards that provide real-time insights into the performance of LLMs,
- Retrieval Augmented Generation (RAG) systems, safety mechanisms, other generative AI features, billing,
- token consumption, and many more.
- Dashboard Development: Design, develop, and maintain interactive and user-friendly dashboards for monitoring AI
- system performance.
- KPI Identification: Collaborate with cross-functional teams to define and implement KPIs related to LLMs, RAG
- systems, safety protocols, and other AI features.
- Data Visualization: Create clear and insightful visualizations that communicate complex data trends and patterns
- eGectively to stakeholders.
- Performance Monitoring: Continuously monitor AI system metrics to identify anomalies, performance issues, and
- areas for improvement.
- Data Analysis: Analyze large and complex datasets to extract meaningful insights that support decision-making
- processes.
- Collaboration: Work closely with AI engineers, data scientists, and product managers to align dashboard
- functionalities with project goals.
- Innovation: Stay updated with the latest trends and technologies in data visualization and analytics to introduce
- innovative solutions.
- Documentation: Maintain thorough documentation of dashboard configurations, data sources, and visualization
- methodologies.
- Details of work
- 1. Performance Metrics:
- o Latency and Throughput: Monitor the response times and the number of requests processed per unit time to ensure
- the system meets performance expectations.
- o Resource Utilization: Track CPU, memory, disk I/O, and network bandwidth usage to identify bottlenecks or
- neffiiciencies.
- 2. Model Performance and Drift Monitoring:
- o Accuracy Metrics: Keep track of model accuracy, precision, recall, F1 score, etc., to ensure the models are
- performing as expected.
- o Data and Concept Drift Detection: Monitor for changes in data distribution that could aEect model performance
- over time.
- o Feature Importance Tracking: Observe changes in feature importance to understand and explain model predictions.
- 3. Anomaly Detection:
- o Implement systems to detect unusual patterns or outliers in data inputs, user behavior, or system performance,
- which could indicate errors or security issues.
- 4. Security Monitoring:
- Dashboarding & Analytics Developer
- o Access Logs: Maintain detailed logs of user access and actions for security auditing.
- o Threat Detection: Use intrusion detection systems (IDS) to identify potential security threats.
- o Compliance Monitoring: Ensure adherence to regulations like GDPR, HIPAA, or other industry-specific compliance
- requirements.
- 5. User Engagement and Feedback:
- o Usage Analytics: Analyze how users interact with the system to improve user experience.
- o Feedback Collection: Provide mechanisms for users to report issues or suggest improvements.
- o Session Tracking: Monitor user sessions to understand behavior patterns and enhance personalization.
- 6. Error Handling and Logging:
- o Detailed Error Logs: Capture and categorize errors to facilitate quicker debugging and resolution.
- o Automated Alerting: Set up alerts for critical failures or error rate thresholds being exceeded.
- 7. Audit Trails and Traceability:
- o Transaction Logging: Keep records of all transactions and changes in the system for accountability.
- o Version Control Tracking: Monitor changes in models, code, or configurations to track the evolution of the system.
- 8. Data Quality Monitoring:
- o Validation Checks: Ensure incoming data meets quality standards before processing.
- o Missing or Corrupted Data Detection: Identify and handle incomplete or corrupted data inputs.
- 9. Scalability Metrics:
- o Load Testing Metrics: Assess how the system performs under various load conditions to plan for scaling.
- o Auto-Scaling Monitoring: Monitor the eEectiveness of auto-scaling policies in cloud environments.
- 10. Cost Management:
- o Resource Cost Analysis: Monitor the costs associated with compute, storage, and network resources to optimize
- spending.
- o Budget Alerts: Set up alerts when spending exceeds predefined budgets.
- 11. Deployment and CI/CD Pipeline Monitoring:
- o Deployment Success Rates: Track the success or failure of deployments.
- o Pipeline Performance: Monitor the CI/CD pipeline for bottlenecks or failures.
- 12. Compliance and Governance:
- o Policy Enforcement: Ensure data usage and model deployment adhere to organizational policies.
- o Role-Based Access Control (RBAC): Implement and monitor access controls for diEerent system components.
- 13. Disaster Recovery and Backup Monitoring:
- o Backup Integrity Checks: Regularly verify backups to ensure data can be recovered when needed.
- o Recovery Time Objectives (RTO) Monitoring: Ensure systems can be restored within acceptable time frames after
- outages.
- 14. Customer Support Integration:
- o Ticketing System Integration: Monitor support tickets related to the system to identify common issues.
- o Service Level Agreement (SLA) Compliance: Track metrics to ensure SLAs are being met.
- 15. Visualization and Reporting:
- o Custom Dashboards: Create dashboards tailored to diEerent stakeholdersexecutives, developers, support teams.
- o Scheduled Reports: Automate reporting on key metrics for regular review.
- Some tools & skills preferred but does need to check all the boxes:
- Technical Domain expereince of AI LLMs, Retrieval Augmented Generation (RAG) systems, safety mechanisms, other generative AI features, billing,
- token consumption, and many more.
- Data Visualization Tools: Tableau, Power BI, Grafana, Splunk
- Programming Languages: Python, JQL, SPL
- Data Query Languages: SQL
- Cloud Platforms: AWS, Azure, GCP (Likely if Auto-Scaling is a key responsibility)
- Monitoring Tools: Prometheus, Datadog, New Relic, CloudWatch (AWS), Azure Monitor
- Version Control Systems: Git
- Ticketing Systems: Jira, Zendesk, ServiceNow
- Logging Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk
- CI/CD Tools: Jenkins, GitLab CI, CircleCI, GitHub Actions
Shift hours - 2 PM to 11 PM IST Mon - Fri