Work Location : All Pan India locations except MumbaiReliability Architect with over 10 years of experience in proactive monitoring,automation, and observability. Skilled in AIOps/MLOps, infrastructuremanagement, and performance optimization using modern tools and practices.Adept at leading incident response, mentoring support teams, and driving cross-functional collaboration to ensure system reliability and scalability.
Key Responsibilities
Monitoring and AutomationProactively monitor software systems to prevent incidents and automate routineoperational tasks. Effective MonitoringDesign monitoring systems that trigger alerts based on symptoms rather thanoutages, ensuring early detection and resolution. Application Performance Monitoring (APM)Implement and manage APM tools like New Relic or Dynatrace to trackapplication performance, identify bottlenecks, and optimize resource usage. Log Analysis with SplunkUse Splunk to analyze logs for troubleshooting, anomaly detection, andimproving system reliability. Dashboards PreparationBuild intuitive dashboards to visualize system health, performance metrics, andoperational KPIs. Alerts SetupConfigure intelligent alerts based on thresholds and anomalies to ensure timelyincident response. Reports SchedulingAutomate regular reporting to provide insights into system performance,reliability, and trends. Reliability MetricsDefine and track metrics such as SLOs, SLIs, and error budgets to measure andmaintain system reliability. Observability SkillsApply observability practices including distributed tracing, logging, and metricscollection to gain deep insights into system behavior. AI-Driven Monitoring & AutomationUtilize AIOps techniques to proactively detect anomalies, automate incidentresponse, and enable self-healing systems through intelligent alerting andpredictive analytics. Observability & ML IntegrationIntegrate machine learning models with observability tools to enhance systeminsights, optimize performance, and ensure reliability of AI-powered services inproduction. Cross-Team CollaborationWork closely with development and support teams to enhance service reliabilitythrough rigorous testing and release procedures. Capacity PlanningParticipate in system design reviews and capacity planning to ensure scalabilityand performance. Debugging and Incident ResponseLead incident response efforts, analyze debugging information, and managerollbacks of faulty software deployments. Mentoring Support TeamsGuide and mentor L1/L2 support teams to establish best practices in monitoringand observability. Infrastructure ManagementManage infrastructure using tools like Chef, Ansible, Terraform, GitLab CI/CD,and Kubernetes. DocumentationMaintain comprehensive documentation of processes and procedures to ensureoperational consistency and reduce redundancy. Proactive MindsetApproach challenges with enthusiasm, ownership, and a continuousimprovement mindset.