Jobs
Interviews

6 Observability Stacks Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

12.0 - 14.0 years

0 Lacs

hyderabad, telangana, india

On-site

We are hiring for world-class payments network with human-centric customer service, trusted by 15,000+ partners across 235 territories through our flagship brands. With a diverse team of 800+ professionals across 10 global offices , we are reshaping the international property market by connecting buyers, sellers, legal firms, banks, and real estate agents through innovative payments and embedded software solutions . Our mission is clear: to make buying property abroad faster, simpler, and safer. About the Role We are looking for a dynamic and experienced Senior Engineering Manager with a strong architectural mindset to lead and evolve our Platform & DevOps function . This hybrid technical-leadership role demands both strategic vision and hands-on depth to build scalable, cost-efficient, AI-enabled, and reliable infrastructure and development workflows. You will lead cross-functional initiatives across cloud architecture, CI/CD automation, DevSecOps and FinOps, while mentoring engineers and influencing architectural decisions across the organization. Key Responsibilities Strategic & Architecture Leadership Architect and oversee multi-cloud and hybrid infrastructure strategies. Provide architectural guidance on scalable infrastructure, deployment patterns , and resilient system design. Identify and drive opportunities to embed AI and automation into platform capabilities for operational efficiency and developer enablement. Partner with product, engineering, and operations teams to align on technical direction and platform strategy. ? Cloud Architecture, Optimization & FinOps Lead modernization initiatives with cloud-native, containerized, and serverless architecture design. Embed automation across infrastructure provisioning, deployment, and operations. Drive cloud cost optimization through automation, governance, and resource efficiency. Champion FinOps practices, partnering with Finance to ensure usage aligns with cost accountability. DevSecOps & Security Automation Integrate security practices into the CI/CD pipeline, enabling shift-left security. Drive adoption of secure defaults , vulnerability scanning, and compliance automation across cloud and app infrastructure. Partner with InfoSec and engineering teams to ensure governance, auditability, and policy enforcement across environments. Promote a security-first culture in developer workflows without compromising delivery speed. CI/CD Automation & Developer Productivity Design and evolve secure, scalable CI/CD pipelines to streamline code delivery. Promote deployment patterns like blue/green, canary, and automated rollbacks. Improve developer experience through build tooling, workflow automation , and policy-based governance. People & Technical Leadership Act as a technical mentor and trusted advisor, guiding engineers on cloud and platform decisions. Foster a culture of ownership, technical excellence, and continuous improvement. Enable team autonomy by building reusable frameworks, automation templates, and governance standards. Qualifications 12+ years in DevOps, Security, infrastructure, or platform engineering roles. 5+ years in engineering leadership or architectural roles. Proven experience with AWS (preferred), GCP or Azure, and Infrastructure-as-Code (e.g., Terraform, Pulumi). Strong expertise in cloud-native design, automation, and platform scalability. Working knowledge or practical exposure to AI-powered DevOps tools, AI observability, or intelligent automation Proficiency in CI/CD tools (GitLab, GitHub Actions, Jenkins, ArgoCD) and observability stacks (OTEL, Prometheus, Grafana, ELK). Excellent leadership, communication, administration and cross-functional collaboration skills . Preferred Qualifications Experience with FinOps frameworks and large-scale cloud spend optimization. Background in hybrid or multi-cloud architecture design and governance. A balance of deep technical expertise and people management experience. CKA, AWS Cloud certifications. Show more Show less

Posted 6 days ago

Apply

15.0 - 19.0 years

0 Lacs

hyderabad, telangana

On-site

About Mobius: Mobius is an AI-native platform that surpasses current AI products by blending neural networks, symbolic reasoning, graph intelligence, and autonomous agent coordination into a cohesive digital ecosystem. It represents the next phase of cloud-native AI platforms, specifically engineered to construct, oversee, and enhance intelligent software automatically. Mobius is the convergence point of data and reasoning, automation and intelligence, where software learns to self-construct. The Role: As a key figure, you will steer the architectural strategy of Mobius's core orchestration and infrastructure layer. This layer is crucial for driving all automation, workflow execution, and backend intelligence, shaping how digital systems are established, tested, deployed, and refined over time. Your primary responsibility is to design an AI-centric orchestration layer that is both modular, scalable, and autonomous. What You'll Own: - Design and oversee orchestration layers for various components including business process automation (BPMN, API orchestration), machine learning pipelines (MLFlow, Kubeflow), large language model workflows (LLMOps), DevSecOps automation (CI/CD, ArgoCD), data workflows (ETL pipelines using Airflow/SeaTunnel), distributed databases (NoSQL, Graph, Vector, RDBMS), and governance systems (identity, access, compliance workflows). - Establish a unified abstraction layer for AI-driven workflow composition. - Ensure runtime safety, observability, and dynamic scalability. - Enable real-time graph-based reconfiguration of infrastructure. - Collaborate with AI, data, and product teams to bolster intelligent automation. What We're Looking For: - A minimum of 15 years of experience in cloud-native architecture, infrastructure automation, or workflow orchestration. - Proficiency in orchestrating large-scale systems involving ML, APIs, data, and software delivery. - In-depth knowledge of Kubernetes, container systems, service mesh, and CI/CD frameworks. - Familiarity with BPMN, workflow engines such as Camunda and Argo, and process modeling tools. - Strong grasp of distributed systems, observability stacks, and runtime graph engines. - Ability to model and execute dynamic workflows based on declarative specifications. Bonus Points: - Prior involvement in designing orchestration platforms for AI/ML agents. - Publication of work or patents related to system design, process automation, or software-defined infrastructure. - Understanding of decision modeling, agent behavior, or adaptive workflows. Mobius is in search of architects for the future, not mere builders. If you have envisioned software that can think, adapt, and evolve, then you have imagined Mobius.,

Posted 1 week ago

Apply

10.0 - 12.0 years

0 Lacs

bengaluru, karnataka, india

On-site

About Wekan Enterprise Solutions Wekan Enterprise Solutions is a leading Technology Consulting company and a strategic investment partner of MongoDB. We help companies drive innovation in the cloud by adopting modern technology solutions that help them achieve their performance and availability requirements. With strong capabilities around Mobile, IoT, and Cloud environments, we have an extensive track record of helping Fortune 500 companies modernize their most critical legacy and on-premise applications, migrating them to the cloud, and leveraging the most cutting-edge technologies. Senior Manager Engineering Operations Location: Bangalore Reports to: Chief Operating Officer Job Summary The Senior Manager Engineering Operations owns the operational execution that turns Director-level engineering quality, governance and reliability goals into measurable, repeatable outcomes across product teams and client projects. You will lead centralized QA & test-automation operations, CI/CD & observability practices, release-readiness and SLA enforcement, engineering metrics/reporting, and resource-roster governance and act as the primary operational partner to Delivery and Product leadership. Core responsibilities Engineering Quality & Governance (Enforcement) Operationalize and enforce Quality KPIs defined by the Director of Engineering Run weekly operational checks, monthly quality audits, and quarterly governance reviews; own remediation tracking. Ensure secure-coding checks (OWASP, SAST) and PR gates into developer workflows. Ensure SLA adherence and maintain runbooks for remediation. Centralized QA & Test-Automation Operations Run, scale and operate a centralized QA/Test-Automation function (hiring, tooling, playbooks, SLAs). Enforce automation targets and reduce prioritized test-suite execution time Integrate automation quality gates into CI/CD and release pipelines. DevOps, CI/CD & Observability Own CI/CD pipeline operation with integrated security, quality and performance gates. Implement and operationalize performance/load testing for applicable applications; feed results into release decisions. Maintain performance and stability dashboards, deployment success metrics and build-health monitoring. Standardize observability (metrics, logs, traces) for production applications. Release Readiness & SLA Enforcement Lead release-readiness processes: defect triage, UAT cycles, approvals, rollback plans, smoke checks and post-release validation. Monitor runtime health (crash-free sessions, uptime SLAs) and enforce time-to-resolution SLAs. Produce release-level engineering scorecards for Delivery and Product leadership with clear go/no-go signals. Engineering Metrics Reporting & Visibility Implement centralized dashboards covering velocity, quality, automation, incidents and release health (Q3 2025 target). Establish and run weekly and monthly reporting cadences Achieve 100% coverage of active engineering teams and initiatives in reporting scope. Day-to-Day Operational Oversight & Risk Mitigation Maintain daily/weekly project tracking to surface velocity, blockers, dependencies and risks. Drive proactive identification and remediation of engineering risks with adherence to mitigation plans. Reduce delivery-impacting escalations via early-warning signals and cross-functional coordination with Delivery Leads. Resource Planning, Allocation & Roster Management Jointly own the centralized Engineering Resource Roster with the Delivery Leas segmented by Pipeline (Signed / Upcoming / Internal), Project-wise allocation (Projected vs Actual), and Resource-wise allocation. Produce monthly resourcing-gap reports and coordinate hiring with Talent Acquisition. Ensure forward-looking allocation plans Experience & skills - Required 10+ years in software engineering/platform/SRE/DevOps with 4+ years leading cross-functional operational teams. Proven experience operationalizing quality, governance and centralized QA/testing at org scale. Deep hands-on experience with CI/CD tooling (Jenkins/GitLab CI/GitHub Actions), IaC (Terraform/CloudFormation), containers & orchestration (Docker/Kubernetes). Strong background in test automation (API/backend frameworks), performance testing (JMeter), observability stacks (Prometheus/Grafana, ELK/EFK) and incident management. Track record of building dashboards and automating reporting pipelines. Excellent stakeholder management and executive communication skills. Degree in Computer Science/Engineering or equivalent experience. Preferred Cloud certifications (AWS/GCP/Azure), SRE/DevOps certifications, ISTQB Test-Automation. Experience with mobile/web performance testing and device-farm integrations. Familiarity with Relational and NoSQL databases Show more Show less

Posted 1 week ago

Apply

3.0 - 7.0 years

0 Lacs

punjab

On-site

ABOUT XENONSTACK XenonStack is the fastest-growing data and AI foundry for agentic systems, enabling people and organizations to gain real-time and intelligent business insights. Agentic Systems for AI Agents: akira.ai Vision AI Platform: xenonstack.ai Inference AI Infrastructure for Agentic Systems: nexastack.ai THE OPPORTUNITY We are seeking an Agentic Infrastructure Observability Engineer to design, implement, and maintain visibility, monitoring, and assurance systems for large-scale AI agent deployments. This role focuses on observability, telemetry, and evaluation pipelines across multi-agent and multi-context workflows, ensuring AI systems are measurable, trustworthy, and compliant in enterprise and regulated environments. If you're passionate about SRE principles for AI, LLM evaluation, and agentic system transparency, this role offers the chance to shape observability for the next generation of intelligent automation. RESPONSIBILITIES Design and Implement Telemetry Pipelines Build observability infrastructure to capture logs, metrics, traces, and behavioral data from AI agents, orchestration layers, and integrated tools. Develop Evaluation Dashboards & KPIs Track accuracy, latency, reliability, cost, token usage, and success rates for agentic workflows. Enable Full-Stack Tracing Build execution flow tracing for multi-agent, multi-tool pipelines, with attribution for each decision, prompt, and retrieval step. Monitor Behavioral Reliability Detect and flag hallucinations, decision drift, prompt degradation, or tool misuse in real time. Integrate with Evaluation Frameworks Work with LLM eval tools like TruLens, Ragas, Arize AI, and custom scoring systems for continuous quality monitoring. Ensure Compliance & Auditability Implement observability features for regulatory audits (e.g., PCI-DSS, GDPR), including secure logging of prompts, retrieved context, and decisions. Cost & Resource Observability Track model/API usage, compute cost, and token consumption to enable optimization decisions. Collaborate Across Teams Partner with AgentOps Engineers, AI Interaction Engineers, and Model Reliability teams to turn observability insights into operational improvements. SKILLS & QUALIFICATIONS Must-Have: 35 years in SRE, DevOps, AI infrastructure, or ML systems engineering. Proficiency in Python and observability stacks (Prometheus, OpenTelemetry, Grafana, ELK, etc.). Familiarity with LLM architectures, multi-agent orchestration frameworks (LangGraph, LangChain, AgentBridge), and context pipelines. Experience with logging, tracing, and performance profiling for distributed systems. Understanding of LLM evaluation metrics (factuality, coherence, toxicity, cost efficiency). Knowledge of privacy and compliance standards for AI systems. Good-to-Have: Hands-on experience with LLM eval tools (TruLens, Ragas, Arize AI, Weights & Biases). Familiarity with RAG, vector databases, and knowledge graph-based retrieval. Experience in regulated industries (BFSI, healthcare, GRC). Background in anomaly detection or behavioral monitoring for ML systems. CAREER GROWTH & BENEFITS Continuous Learning & Growth Training and certifications in AI observability, LLM evaluation, and Responsible AI. Hands-on exposure to enterprise-scale agentic infrastructure. Recognition & Rewards Incentives for innovations in AI observability and monitoring. Fast-track opportunities into AI Reliability Architecture or Model Ops Leadership roles. Work Benefits & Well-Being Comprehensive medical insurance and project-based allowances. Cab facilities for women employees and special project perks. XENONSTACK CULTURE JOIN US & MAKE AN IMPACT! We foster a culture of cultivation with bold, human-centric leadership principles. We value deep work, experimentation, and ownership in every initiative, and we are on a mission to reshape how enterprises adopt AI + Human Intelligence systems. Product Values: Obsessed with Adoption Making AI accessible and enterprise-ready. Obsessed with Simplicity Turning complexity into seamless, intuitive AI experiences. Be a part of our vision to accelerate the world's transition to AI + Human Intelligence.,

Posted 3 weeks ago

Apply

12.0 - 16.0 years

0 Lacs

hyderabad, telangana

On-site

The Senior Technical Architect, Generative AI and Agent Factory, plays a key role in leading the architecture, design, and strategic enablement of PepsiCo's enterprise-grade GenAI platforms including PepGenX, Agent Factory, and PepVigil. You will define scalable, event-driven agent orchestration frameworks, modular agent templates, and integration patterns to enable intelligent, governed, and reusable agent ecosystems. This position is vital in accelerating the deployment of AI-driven solutions across commercial, reporting, and enterprise automation use cases. Your responsibilities will include architecting and governing the design of scalable, modular AI agent frameworks for enterprise-wide reuse. You will define event-driven orchestration and agentic execution patterns to enable intelligent, context-aware workflows. Additionally, you will drive platform integration across PepGenX, Agent Factory, and PepVigil to ensure consistency in observability, security, and orchestration patterns. Developing reusable agent templates, blueprints, and context frameworks to accelerate use case onboarding across domains will also be a key aspect of your role. You will establish architecture standards to embed Responsible AI (RAI), data privacy, and policy enforcement within GenAI agents, as well as lead technical architecture planning across delivery sprints, vendor integration tracks, and GenAI product releases. Qualifications: - Bachelor's or Master's degree in Computer Science, Engineering, or a related field - 12+ years of experience in enterprise software or AI architecture roles, with a recent focus on LLMs, Generative AI, or intelligent agents - Proven experience in designing and implementing agent-based frameworks or AI orchestration platforms - Strong understanding of technologies such as LangGraph, Temporal, vector databases, multimodal RAGs, event-driven systems, and memory management - Hands-on experience with Kubernetes, Azure AI/AKS, REST APIs, and observability stacks - Demonstrated ability to influence enterprise architecture and guide cross-functional engineering teams - Excellent communication skills, with the ability to articulate complex technical concepts to diverse stakeholders.,

Posted 1 month ago

Apply

12.0 - 16.0 years

0 Lacs

hyderabad, telangana

On-site

The Senior Technical Architect - Generative AI and Agent Factory is accountable for overseeing the overall architecture, design, and strategic advancement of PepsiCo's enterprise-level GenAI platforms, namely PepGenX, Agent Factory, and PepVigil. Your primary objective will be to establish scalable, event-driven agent orchestration frameworks, modular agent templates, and integration strategies that empower intelligent, governed, and reusable agent ecosystems, thus expediting the implementation of AI-driven solutions across various commercial, reporting, and enterprise automation scenarios. You will be responsible for architecting and managing the development of scalable, modular AI agent frameworks (such as Agent Mesh, Orchestrator, Memory, Canvas) for broad organizational utility. Your role will also involve defining event-driven orchestration and agentic execution patterns (e.g., Temporal, LangGraph, AST-RAG, reflection) to facilitate intelligent, context-aware workflows. Additionally, you will drive platform integration across PepGenX, Agent Factory, and PepVigil to ensure uniformity in observability, security, and orchestration approaches. Developing reusable agent templates, blueprints, and context frameworks (e.g., MCP, semantic caching) to expedite use case onboarding across various domains will also fall under your purview. Moreover, you will be instrumental in setting up architecture standards to incorporate Responsible AI (RAI), data privacy, and policy enforcement within GenAI agents, and lead technical architecture planning for delivery sprints, vendor integration tracks, and GenAI product releases. To be considered for this role, you should possess a Bachelor's or Master's degree in Computer Science, Engineering, or a related field, along with a minimum of 12 years of experience in enterprise software or AI architecture roles, with recent specialization in LLMs, Generative AI, or intelligent agents. You must have a proven track record in designing and implementing agent-based frameworks or AI orchestration platforms, and a solid understanding of technologies like LangGraph, Temporal, vector databases, multimodal RAGs, event-driven systems, and memory management. Hands-on experience with Kubernetes, Azure AI/AKS, REST APIs, and observability stacks is essential, as is the ability to influence enterprise architecture and provide guidance to cross-functional engineering teams. Excellent communication skills are a must, allowing you to effectively convey intricate technical concepts to a diverse set of stakeholders.,

Posted 1 month ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies