Jobs
Interviews

479 Opentelemetry Jobs - Page 5

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

7.0 - 12.0 years

10 - 15 Lacs

Pune

Work from Office

Sarvaha would like to welcome a skilled Observability Engineer with a minimum of 7 years of experience to contribute to designing, deploying, and scaling our monitoring and logging infrastructure on Kubernetes. In this role, you will play a key part in enabling end-to-end visibility across cloud environments by processing Petabyte data scales, helping teams enhance reliability, detect anomalies early, and drive operational excellence. Sarvaha is a niche software development company that works with some of the best funded startups and established companies across the globe. What Youll Do : - Configure and manage observability agents across AWS, Azure & GCP. - Use IaC techniques and tools such as Terraform, Helm & GitOps, to automate deployment of Observability stack. - Experience with different language stacks such as Java, Ruby, Python and Go. - Instrument services using OpenTelemetry and integrate telemetry pipelines. - Optimize telemetry metrics storage using time-series databases such as Mimir & NoSQL DBs. - Create dashboards, set up alerts, and track SLIs/SLOs. - Enable RCA and incident response using observability data. - Secure the observability pipeline. You Bring : - BE/BTech/MTech (CS/IT or MCA), with an emphasis in Software Engineering. - Strong skills in reading and interpreting logs, metrics, and traces. - Proficiency with LGTM (Loki, Grafana, Tempo, Mimir) or similar stack, Jaeger, Datadog, Zipkin, InfluxDB etc. - Familiarity with log frameworks such as log4j, lograge, Zerolog, loguru etc. - Knowledge of OpenTelemetry, IaC, and security best practices. - Clear documentation of observability processes, logging standards & instrumentation guidelines. - Ability to proactively identify, debug, and resolve issues using observability data. - Focused on maintaining data quality and integrity across the observability pipeline.

Posted 2 weeks ago

Apply

3.0 - 7.0 years

0 Lacs

Mumbai, Maharashtra, India

On-site

About Last9 Last9 is a unified observability platform for AI Native teams dealing with complex, high-cardinality data that needs to be retained long-term. We unify Metrics, Events, Logs, and Traces in a single solution, enabling comprehensive observability across infrastructure, application, business, and product domains with enterprise-grade APM capabilities. We work with top organizations in India such as Jio, Games24x7, Probo, Zoomcar, CleverTap, among others. Role Overview As an Account Executive, you'll be at the forefront of our revenue growth, working directly with CTOs, VPs of Engineering, and DevOps leaders to help them adapt Last9's observability solutions. You'll own the full sales cycle from prospecting to close, focusing on mid-market accounts. This role is ideal for a smart, quick learner with technical sales experience who wants to make a significant impact in a fast-growing startup. What You'll Do Drive new business acquisition across mid-market and enterprise accounts in your territory Build and maintain relationships with technical decision-makers including CTOs, VPs of Engineering, SRE leaders, and DevOps teams Navigate complex technical sales cycles with multiple stakeholders Articulate technical value propositions Own the full sales cycle from initial contact through contract negotiation and close Develop and execute account strategies to meet and exceed quarterly/annual quotas Partner with SDRs to follow up on qualified leads and develop pipeline Provide accurate sales forecasts and maintain detailed records in CRM Contribute to go-to-market strategy and help refine our sales playbook Represent Last9 at industry events and conferences What we're looking for Bachelor's degree in any discipline 3-7 years of B2B SaaS sales experience with a track record of meeting/exceeding quotas Experience selling observability, monitoring, database, or DevOps solutions (strong preference) Strong understanding of technical concepts and ability to translate them to business value Excellent presentation and communication skills Self-motivated with ability to work independently in a fast-paced environment Experience with CRM systems (HubSpot/Salesforce/Attio preferred) Willingness to travel as needed for customer meetings and events Preferred Skills Experience selling to engineering and technical buyers Understanding of cloud-native architectures, Kubernetes, and modern development practices Familiarity with OpenTelemetry, Prometheus, or other open-source observability tools. Experience at an early-stage startup or high-growth company Why Join Last9 Direct exposure to leadership and strategic decision-making Clear career progression path with opportunity to build and lead a team Tons of learning, fast career growth, and leadership opportunities as we scale Last9 is an equal-opportunity employer. We embrace and celebrate diversity. All employment decisions are based on qualifications, merit, and business needs.

Posted 2 weeks ago

Apply

0 years

0 Lacs

India

On-site

Job description Company Description Evallo is a leading provider of a comprehensive SaaS platform for tutors and tutoring businesses, revolutionizing education management. With features like advanced CRM, profile management, standardized test prep, automatic grading, and insightful dashboards, we empower educators to focus on teaching. We're dedicated to pushing the boundaries of ed-tech and redefining efficient education management. Why this role matters Evallo is scaling from a focused tutoring platform to a modular operating system for all service businesses that bill by the hour. As we add payroll, proposals, white-boarding, and AI tooling, we need a Solution Architect who can translate product vision into a robust, extensible technical blueprint. You’ll be the critical bridge between product, engineering, and customers—owning architecture decisions that keep us reliable at 5k+ concurrent users and cost-efficient at 100k+ total users. Outcomes we expect Map current backend + frontend, flag structural debt, and publish an Architecture Gap Report Define naming & layering conventions, linter / formatter rules, and a lightweight ADR process Ship reference architecture for new modules Lead cross-team design reviews; no major feature ships without architecture sign-off Eventual goal is to have Evallo run in a fully observable, autoscaling environments with < 10 % infra cost waste. Monitoring dashboards should trigger < 5 false positives per month. Day-to-day Solution Design: Break down product epics into service contracts, data flows, and sequence diagrams. Choose the right patterns—monolith vs. microservice, event vs. REST, cache vs. DB index—based on cost, team maturity, and scale targets. Platform-wide Standards: Codify review checklists (security, performance, observability) and enforce via GitHub templates and CI gates. Champion a shift-left testing mindset; critical paths reach 80 % automated coverage before QA touches them. Scalability & Cost Optimization: Design load-testing scenarios that mimic 5 k concurrent tutoring sessions; guide DevOps on autoscaling policies and CDN strategy. Audit infra spend monthly; recommend serverless, queuing, or data-tier changes to cut waste. Release & Environment Strategy: Maintain clear promotion paths: local → sandbox → staging → prod with one-click rollback. Own schema-migration playbooks; zero-downtime releases are the default, not the exception. Technical Mentorship: Run fortnightly architecture clinics; level-up engineers on domain-driven design and performance profiling. Act as tie-breaker on competing technical proposals, keeping debates respectful and evidence-based. Qualifications 5+ yrs engineering experience, 2+ yrs in a dedicated architecture or staff-level role on a high-traffic SaaS product. Proven track record designing multi-tenant systems that scaled beyond 50 k users or 1k RPM. Deep knowledge of Node.js / TypeScript (our core stack), MongoDB or similar NoSQL, plus comfort with event brokers (Kafka, NATS, or RabbitMQ). Fluency in AWS (preferred) or GCP primitives—EKS, Lambda, RDS, CloudFront, IAM. Hands-on with observability stacks (Datadog, New Relic, Sentry, or OpenTelemetry). Excellent written communication; you can distill technical trade-offs in one page for execs and in one diagram for engineers.

Posted 2 weeks ago

Apply

10.0 years

0 Lacs

Hyderabad, Telangana, India

On-site

Experience: 10+ years Location: Hyderabad Work Mode: Hybrid (3 days in office) Notice Period: Immediate to 15 days only Work timings: 1:30 PM to 10:30 PM IST Mandatory Skills: Python, NodeJS, Any cloud, Microservices Architectural Mastery: Deep expertise in Microservices, Event-Driven Architecture (EDA) (EventBridge, SQS/SNS), distributed system design, scalability, and resiliency patterns. Cloud & Infra Leadership: Expert in AWS, Terraform, and Kubernetes/Argo CD. Observability & Reliability: OpenTelemetry, Sentry Polyglot Proficiency: Command of Python and FastAPI, strong understanding of TypeScript/Node.js, implementing both GraphQL and RESTful api interfaces. Leadership: Mentors senior/junior engineers, drives technical consensus, and influences strategic roadmap. Previous Financial (Writing software for accounting - closing books, quarterly/annual reporting, bookkeeping, etc) experience is a must.

Posted 2 weeks ago

Apply

10.0 - 15.0 years

1 - 3 Lacs

Bengaluru, Karnataka, India

On-site

Lead and mentor a high-performing team of platform engineers, fostering a culture of innovation, ownership, and continuous improvement. Define and execute the platform engineering strategy aligned with organizational goals. Own the design, implementation, and maintenance of core platform services, including CI/CD, container orchestration, secrets management, and infrastructure as code. Collaborate with engineering teams to enhance developer productivity and reliability through self-service tools and platforms. Drive platform adoption through well-defined APIs, documentation, and enablement programs. Ensure platform services meet SLOs around availability, performance, and security. Manage team capacity, hiring, career development, and performance management. Stay current with emerging technologies and industry best practices, evaluating and integrating tools and frameworks where appropriate. Own and manage operational excellence, including incident response, root cause analysis, and cost optimization. Bring Your Passion, Do What You Love. Here s What We re Looking For: Typically requires a Bachelor s degree in Computer Science and a minimum of 10 years of related experience; or an advanced degree with 6+ years of experience; or equivalent relevant work experience. Typically requires 2-5 years managing and developing employees. Strong experience with cloud platforms (AWS, Azure, or GCP), Kubernetes, and infrastructure as code (e.g., Terraform, Pulumi). Deep understanding of CI/CD pipelines, containerization, and modern DevOps practices. Proven ability to build platforms that empower software teams to deliver features quickly and safely. Excellent leadership, communication, and cross-functional collaboration skills. Experience managing on-call rotations and operating production-grade infrastructure. Preferred: Prior experience leading platform modernization or cloud migration initiatives. Familiarity with networking (e.g., Istio/Traefik), GitOps, and event-driven architectures. Experience with observability tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog). Background in financial services, fintech, or other regulated environments is a plus. This position requires fluent written and oral communication in English. Health & Wellness Hybrid Work Opportunities Flexible Time Off Career Development & Mentoring Programs Health & Wellness Benefits, including competitive health insurance offerings and generous paid parental leave for eligible new parents Community Volunteering & Company Philanthropy Programs Employee Peer Recognition Programs - You Earned it

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

Chennai, Tamil Nadu, India

On-site

TCS Hiring for Observability Tools Tech Lead_PAN India Experience: 8 to 12 Years Only Job Location: PAN India TCS Hiring for Observability Tools Tech Lead_PAN India Required Technical Skill Set: Core Responsibilities: Designing and Implementing Observability Solutions: This involves selecting, configuring, and deploying tools and platforms for collecting, processing, and analyzing telemetry data (logs, metrics, traces). Developing and Maintaining Monitoring and Alerting Systems: Creating dashboards, setting up alerts based on key performance indicators (KPIs), and ensuring timely notification of issues. Instrumenting Applications and Infrastructure: Working with development teams to add instrumentation code to applications to generate meaningful telemetry data. This often involves using open standards like Open Telemetry. Analyzing and Troubleshooting System Performance: Investigating performance bottlenecks, identifying root causes of issues, and collaborating with development teams to resolve them. Defining and Tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Working with stakeholders to define acceptable levels of performance and reliability and tracking these metrics. Improving Incident Response and Post-Mortem Processes: Using observability data to understand incidents, identify contributing factors, and implement preventative measures. Collaborating with Development, Operations, and SRE Teams: Working closely with other teams to ensure observability practices are integrated throughout the software development lifecycle. Educating and Mentoring Teams on Observability Best Practices: Promoting a culture of observability within the organization. Managing and Optimizing Observability Infrastructure Costs: Ensuring the cost-effectiveness of observability tools and platforms. Staying Up to Date with Observability Trends and Technologies: Continuously learning about new tools, techniques, and best practices. Key Skills: Strong Understanding of Observability Principles: Deep knowledge of logs, metrics, and traces and how they contribute to understanding system behavior. Proficiency with Observability Tools and Platforms: Experience with tools like: Logging: Elasticsearch, Splunk, Fluentd, Logstash, etc., Metrics: Prometheus, Grafana, InfluxDB, Graphite, etc., Tracing: OpenTelemetry, DataDog APM, etc., APM (Application Performance Monitoring): DataDog, New Relic, AppDynamics, etc, Programming and Scripting Skills: Proficiency in languages like Python, Go, Java, or scripting languages like Bash for automation and tool integration. Experience with Cloud Platforms: Familiarity with cloud providers like AWS, Azure, or GCP and their monitoring and logging services. Understanding of Distributed Systems: Knowledge of how distributed systems work and the challenges of monitoring and troubleshooting them. Troubleshooting and Problem-Solving Skills: Strong analytical skills to identify and resolve complex issues. Communication and Collaboration Skills: Ability to effectively communicate technical concepts to different audiences and work collaboratively with other teams. Knowledge of DevOps and SRE Practices: Understanding of continuous integration/continuous delivery (CI/CD), infrastructure as code, and site reliability engineering principles. Data Analysis and Visualization Skills: Ability to analyze telemetry data and create meaningful dashboards and reports. Experience with Containerization and Orchestration: Familiarity with Docker, Kubernetes, and related technologies. Kind Regards, Priyankha M

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

Kolkata, West Bengal, India

On-site

TCS Hiring for Observability Tools Tech Lead_PAN India Experience: 8 to 12 Years Only Job Location: PAN India TCS Hiring for Observability Tools Tech Lead_PAN India Required Technical Skill Set: Core Responsibilities: Designing and Implementing Observability Solutions: This involves selecting, configuring, and deploying tools and platforms for collecting, processing, and analyzing telemetry data (logs, metrics, traces). Developing and Maintaining Monitoring and Alerting Systems: Creating dashboards, setting up alerts based on key performance indicators (KPIs), and ensuring timely notification of issues. Instrumenting Applications and Infrastructure: Working with development teams to add instrumentation code to applications to generate meaningful telemetry data. This often involves using open standards like Open Telemetry. Analyzing and Troubleshooting System Performance: Investigating performance bottlenecks, identifying root causes of issues, and collaborating with development teams to resolve them. Defining and Tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Working with stakeholders to define acceptable levels of performance and reliability and tracking these metrics. Improving Incident Response and Post-Mortem Processes: Using observability data to understand incidents, identify contributing factors, and implement preventative measures. Collaborating with Development, Operations, and SRE Teams: Working closely with other teams to ensure observability practices are integrated throughout the software development lifecycle. Educating and Mentoring Teams on Observability Best Practices: Promoting a culture of observability within the organization. Managing and Optimizing Observability Infrastructure Costs: Ensuring the cost-effectiveness of observability tools and platforms. Staying Up to Date with Observability Trends and Technologies: Continuously learning about new tools, techniques, and best practices. Key Skills: Strong Understanding of Observability Principles: Deep knowledge of logs, metrics, and traces and how they contribute to understanding system behavior. Proficiency with Observability Tools and Platforms: Experience with tools like: Logging: Elasticsearch, Splunk, Fluentd, Logstash, etc., Metrics: Prometheus, Grafana, InfluxDB, Graphite, etc., Tracing: OpenTelemetry, DataDog APM, etc., APM (Application Performance Monitoring): DataDog, New Relic, AppDynamics, etc, Programming and Scripting Skills: Proficiency in languages like Python, Go, Java, or scripting languages like Bash for automation and tool integration. Experience with Cloud Platforms: Familiarity with cloud providers like AWS, Azure, or GCP and their monitoring and logging services. Understanding of Distributed Systems: Knowledge of how distributed systems work and the challenges of monitoring and troubleshooting them. Troubleshooting and Problem-Solving Skills: Strong analytical skills to identify and resolve complex issues. Communication and Collaboration Skills: Ability to effectively communicate technical concepts to different audiences and work collaboratively with other teams. Knowledge of DevOps and SRE Practices: Understanding of continuous integration/continuous delivery (CI/CD), infrastructure as code, and site reliability engineering principles. Data Analysis and Visualization Skills: Ability to analyze telemetry data and create meaningful dashboards and reports. Experience with Containerization and Orchestration: Familiarity with Docker, Kubernetes, and related technologies. Kind Regards, Priyankha M

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

Hyderabad, Telangana, India

On-site

TCS Hiring for Observability Tools Tech Lead_PAN India Experience: 8 to 12 Years Only Job Location: PAN India TCS Hiring for Observability Tools Tech Lead_PAN India Required Technical Skill Set: Core Responsibilities: Designing and Implementing Observability Solutions: This involves selecting, configuring, and deploying tools and platforms for collecting, processing, and analyzing telemetry data (logs, metrics, traces). Developing and Maintaining Monitoring and Alerting Systems: Creating dashboards, setting up alerts based on key performance indicators (KPIs), and ensuring timely notification of issues. Instrumenting Applications and Infrastructure: Working with development teams to add instrumentation code to applications to generate meaningful telemetry data. This often involves using open standards like Open Telemetry. Analyzing and Troubleshooting System Performance: Investigating performance bottlenecks, identifying root causes of issues, and collaborating with development teams to resolve them. Defining and Tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Working with stakeholders to define acceptable levels of performance and reliability and tracking these metrics. Improving Incident Response and Post-Mortem Processes: Using observability data to understand incidents, identify contributing factors, and implement preventative measures. Collaborating with Development, Operations, and SRE Teams: Working closely with other teams to ensure observability practices are integrated throughout the software development lifecycle. Educating and Mentoring Teams on Observability Best Practices: Promoting a culture of observability within the organization. Managing and Optimizing Observability Infrastructure Costs: Ensuring the cost-effectiveness of observability tools and platforms. Staying Up to Date with Observability Trends and Technologies: Continuously learning about new tools, techniques, and best practices. Key Skills: Strong Understanding of Observability Principles: Deep knowledge of logs, metrics, and traces and how they contribute to understanding system behavior. Proficiency with Observability Tools and Platforms: Experience with tools like: Logging: Elasticsearch, Splunk, Fluentd, Logstash, etc., Metrics: Prometheus, Grafana, InfluxDB, Graphite, etc., Tracing: OpenTelemetry, DataDog APM, etc., APM (Application Performance Monitoring): DataDog, New Relic, AppDynamics, etc, Programming and Scripting Skills: Proficiency in languages like Python, Go, Java, or scripting languages like Bash for automation and tool integration. Experience with Cloud Platforms: Familiarity with cloud providers like AWS, Azure, or GCP and their monitoring and logging services. Understanding of Distributed Systems: Knowledge of how distributed systems work and the challenges of monitoring and troubleshooting them. Troubleshooting and Problem-Solving Skills: Strong analytical skills to identify and resolve complex issues. Communication and Collaboration Skills: Ability to effectively communicate technical concepts to different audiences and work collaboratively with other teams. Knowledge of DevOps and SRE Practices: Understanding of continuous integration/continuous delivery (CI/CD), infrastructure as code, and site reliability engineering principles. Data Analysis and Visualization Skills: Ability to analyze telemetry data and create meaningful dashboards and reports. Experience with Containerization and Orchestration: Familiarity with Docker, Kubernetes, and related technologies. Kind Regards, Priyankha M

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

Kochi, Kerala, India

On-site

TCS Hiring for Observability Tools Tech Lead_PAN India Experience: 8 to 12 Years Only Job Location: PAN India TCS Hiring for Observability Tools Tech Lead_PAN India Required Technical Skill Set: Core Responsibilities: Designing and Implementing Observability Solutions: This involves selecting, configuring, and deploying tools and platforms for collecting, processing, and analyzing telemetry data (logs, metrics, traces). Developing and Maintaining Monitoring and Alerting Systems: Creating dashboards, setting up alerts based on key performance indicators (KPIs), and ensuring timely notification of issues. Instrumenting Applications and Infrastructure: Working with development teams to add instrumentation code to applications to generate meaningful telemetry data. This often involves using open standards like Open Telemetry. Analyzing and Troubleshooting System Performance: Investigating performance bottlenecks, identifying root causes of issues, and collaborating with development teams to resolve them. Defining and Tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Working with stakeholders to define acceptable levels of performance and reliability and tracking these metrics. Improving Incident Response and Post-Mortem Processes: Using observability data to understand incidents, identify contributing factors, and implement preventative measures. Collaborating with Development, Operations, and SRE Teams: Working closely with other teams to ensure observability practices are integrated throughout the software development lifecycle. Educating and Mentoring Teams on Observability Best Practices: Promoting a culture of observability within the organization. Managing and Optimizing Observability Infrastructure Costs: Ensuring the cost-effectiveness of observability tools and platforms. Staying Up to Date with Observability Trends and Technologies: Continuously learning about new tools, techniques, and best practices. Key Skills: Strong Understanding of Observability Principles: Deep knowledge of logs, metrics, and traces and how they contribute to understanding system behavior. Proficiency with Observability Tools and Platforms: Experience with tools like: Logging: Elasticsearch, Splunk, Fluentd, Logstash, etc., Metrics: Prometheus, Grafana, InfluxDB, Graphite, etc., Tracing: OpenTelemetry, DataDog APM, etc., APM (Application Performance Monitoring): DataDog, New Relic, AppDynamics, etc, Programming and Scripting Skills: Proficiency in languages like Python, Go, Java, or scripting languages like Bash for automation and tool integration. Experience with Cloud Platforms: Familiarity with cloud providers like AWS, Azure, or GCP and their monitoring and logging services. Understanding of Distributed Systems: Knowledge of how distributed systems work and the challenges of monitoring and troubleshooting them. Troubleshooting and Problem-Solving Skills: Strong analytical skills to identify and resolve complex issues. Communication and Collaboration Skills: Ability to effectively communicate technical concepts to different audiences and work collaboratively with other teams. Knowledge of DevOps and SRE Practices: Understanding of continuous integration/continuous delivery (CI/CD), infrastructure as code, and site reliability engineering principles. Data Analysis and Visualization Skills: Ability to analyze telemetry data and create meaningful dashboards and reports. Experience with Containerization and Orchestration: Familiarity with Docker, Kubernetes, and related technologies. Kind Regards, Priyankha M

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

Pune, Maharashtra, India

On-site

TCS Hiring for Observability Tools Tech Lead_PAN India Experience: 8 to 12 Years Only Job Location: PAN India TCS Hiring for Observability Tools Tech Lead_PAN India Required Technical Skill Set: Core Responsibilities: Designing and Implementing Observability Solutions: This involves selecting, configuring, and deploying tools and platforms for collecting, processing, and analyzing telemetry data (logs, metrics, traces). Developing and Maintaining Monitoring and Alerting Systems: Creating dashboards, setting up alerts based on key performance indicators (KPIs), and ensuring timely notification of issues. Instrumenting Applications and Infrastructure: Working with development teams to add instrumentation code to applications to generate meaningful telemetry data. This often involves using open standards like Open Telemetry. Analyzing and Troubleshooting System Performance: Investigating performance bottlenecks, identifying root causes of issues, and collaborating with development teams to resolve them. Defining and Tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Working with stakeholders to define acceptable levels of performance and reliability and tracking these metrics. Improving Incident Response and Post-Mortem Processes: Using observability data to understand incidents, identify contributing factors, and implement preventative measures. Collaborating with Development, Operations, and SRE Teams: Working closely with other teams to ensure observability practices are integrated throughout the software development lifecycle. Educating and Mentoring Teams on Observability Best Practices: Promoting a culture of observability within the organization. Managing and Optimizing Observability Infrastructure Costs: Ensuring the cost-effectiveness of observability tools and platforms. Staying Up to Date with Observability Trends and Technologies: Continuously learning about new tools, techniques, and best practices. Key Skills: Strong Understanding of Observability Principles: Deep knowledge of logs, metrics, and traces and how they contribute to understanding system behavior. Proficiency with Observability Tools and Platforms: Experience with tools like: Logging: Elasticsearch, Splunk, Fluentd, Logstash, etc., Metrics: Prometheus, Grafana, InfluxDB, Graphite, etc., Tracing: OpenTelemetry, DataDog APM, etc., APM (Application Performance Monitoring): DataDog, New Relic, AppDynamics, etc, Programming and Scripting Skills: Proficiency in languages like Python, Go, Java, or scripting languages like Bash for automation and tool integration. Experience with Cloud Platforms: Familiarity with cloud providers like AWS, Azure, or GCP and their monitoring and logging services. Understanding of Distributed Systems: Knowledge of how distributed systems work and the challenges of monitoring and troubleshooting them. Troubleshooting and Problem-Solving Skills: Strong analytical skills to identify and resolve complex issues. Communication and Collaboration Skills: Ability to effectively communicate technical concepts to different audiences and work collaboratively with other teams. Knowledge of DevOps and SRE Practices: Understanding of continuous integration/continuous delivery (CI/CD), infrastructure as code, and site reliability engineering principles. Data Analysis and Visualization Skills: Ability to analyze telemetry data and create meaningful dashboards and reports. Experience with Containerization and Orchestration: Familiarity with Docker, Kubernetes, and related technologies. Kind Regards, Priyankha M

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

Noida, Uttar Pradesh, India

On-site

TCS Hiring for Observability Tools Tech Lead_PAN India Experience: 8 to 12 Years Only Job Location: PAN India TCS Hiring for Observability Tools Tech Lead_PAN India Required Technical Skill Set: Core Responsibilities: Designing and Implementing Observability Solutions: This involves selecting, configuring, and deploying tools and platforms for collecting, processing, and analyzing telemetry data (logs, metrics, traces). Developing and Maintaining Monitoring and Alerting Systems: Creating dashboards, setting up alerts based on key performance indicators (KPIs), and ensuring timely notification of issues. Instrumenting Applications and Infrastructure: Working with development teams to add instrumentation code to applications to generate meaningful telemetry data. This often involves using open standards like Open Telemetry. Analyzing and Troubleshooting System Performance: Investigating performance bottlenecks, identifying root causes of issues, and collaborating with development teams to resolve them. Defining and Tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Working with stakeholders to define acceptable levels of performance and reliability and tracking these metrics. Improving Incident Response and Post-Mortem Processes: Using observability data to understand incidents, identify contributing factors, and implement preventative measures. Collaborating with Development, Operations, and SRE Teams: Working closely with other teams to ensure observability practices are integrated throughout the software development lifecycle. Educating and Mentoring Teams on Observability Best Practices: Promoting a culture of observability within the organization. Managing and Optimizing Observability Infrastructure Costs: Ensuring the cost-effectiveness of observability tools and platforms. Staying Up to Date with Observability Trends and Technologies: Continuously learning about new tools, techniques, and best practices. Key Skills: Strong Understanding of Observability Principles: Deep knowledge of logs, metrics, and traces and how they contribute to understanding system behavior. Proficiency with Observability Tools and Platforms: Experience with tools like: Logging: Elasticsearch, Splunk, Fluentd, Logstash, etc., Metrics: Prometheus, Grafana, InfluxDB, Graphite, etc., Tracing: OpenTelemetry, DataDog APM, etc., APM (Application Performance Monitoring): DataDog, New Relic, AppDynamics, etc, Programming and Scripting Skills: Proficiency in languages like Python, Go, Java, or scripting languages like Bash for automation and tool integration. Experience with Cloud Platforms: Familiarity with cloud providers like AWS, Azure, or GCP and their monitoring and logging services. Understanding of Distributed Systems: Knowledge of how distributed systems work and the challenges of monitoring and troubleshooting them. Troubleshooting and Problem-Solving Skills: Strong analytical skills to identify and resolve complex issues. Communication and Collaboration Skills: Ability to effectively communicate technical concepts to different audiences and work collaboratively with other teams. Knowledge of DevOps and SRE Practices: Understanding of continuous integration/continuous delivery (CI/CD), infrastructure as code, and site reliability engineering principles. Data Analysis and Visualization Skills: Ability to analyze telemetry data and create meaningful dashboards and reports. Experience with Containerization and Orchestration: Familiarity with Docker, Kubernetes, and related technologies. Kind Regards, Priyankha M

Posted 2 weeks ago

Apply

10.0 years

0 Lacs

Noida, Uttar Pradesh, India

On-site

Job Title: Generative AI Architect Experience: 10+ Years Location: Noida, Mumbai, Pune, Chennai, Gurgaon (Hybrid) Contract Duration: Short Term Work Time: IST Shift Job Purpose We are seeking a highly skilled Generative AI Architect to lead the design, development, and deployment of cutting-edge GenAI solutions across enterprise-grade applications. This role demands deep expertise in large language models (LLMs), prompt engineering, and scalable AI system architecture, along with hands-on experience in MLOps, cloud, and data engineering. Key Responsibilities: Design and implement scalable, secure GenAI solutions using LLMs such as GPT, Claude, LLaMA, or Mistral Architect Retrieval-Augmented Generation (RAG) pipelines using LangChain, LlamaIndex, Weaviate, FAISS, or ElasticSearch Lead prompt engineering and evaluation frameworks for accuracy, safety, and contextual relevance Collaborate with product, engineering, and data teams to integrate GenAI into existing applications and workflows Build reusable GenAI modules like function calling, summarization engines, Q&A bots, and document chat solutions Deploy and optimize GenAI workloads on AWS Bedrock, Azure OpenAI, and Vertex AI Ensure robust monitoring, logging, and observability using Grafana, OpenTelemetry, and Prometheus Apply MLOps practices including CI/CD of AI pipelines, model versioning, validation, and rollback Research and prototype innovations like multi-agent systems, autonomous agents, and fine-tuning methods Implement security best practices, data governance, and compliance protocols such as PII masking, encryption, and audit logs Required Skills & Experience: 8+ years in AI/ML with at least 2–3 years in LLMs or Generative AI Proficient in Python with experience in Transformers (Hugging Face), LangChain, OpenAI SDKs Strong knowledge of vector databases like Pinecone, Weaviate, FAISS, Qdrant Experience working with AWS (SageMaker, Bedrock), Azure (OpenAI), and GCP (Vertex AI) Hands-on expertise in RAG pipelines, summarization, and chat-based applications Familiarity with LLM orchestration frameworks like LangGraph, AutoGen, CrewAI Understanding of MLOps tools: MLflow, Airflow, Docker, Kubernetes, FastAPI Exposure to prompt injection mitigation, hallucination control, and LLMOps practices Ability to evaluate GenAI solutions using BERTScore, BLEU, GPTScore Strong communication skills with experience in architecture leadership and mentoring Preferred (Nice to Have): Experience fine-tuning open-source LLMs (LLaMA, Mistral, Falcon) using LoRA or QLoRA Knowledge of multi-modal AI systems (text-image, voice assistants) Domain-specific LLM knowledge in Healthcare, BFSI, Legal, or EdTech Contributions to published work, patents, or open-source GenAI projects

Posted 2 weeks ago

Apply

10.0 - 14.0 years

0 Lacs

karnataka

On-site

As a Senior Software DevOps Engineer, you will be responsible for leading the design, implementation, and evolution of telemetry pipelines and DevOps automation to enable next-generation observability for distributed systems. Your main focus will be on leveraging a deep understanding of Open Telemetry architecture along with strong DevOps practices to construct a reliable, high-performance, and self-service observability platform that spans hybrid cloud environments such as AWS and Azure. Your primary goal will be to provide engineering teams with actionable insights through rich metrics, logs, and traces while promoting automation and innovation at all levels. In your role, you will be involved in the following key activities: Observability Strategy & Implementation: - Design and manage scalable observability solutions using OpenTelemetry (OTel), including deploying OTel Collectors for ingesting and exporting telemetry data, guiding teams on instrumentation best practices, building telemetry pipelines for data routing, and utilizing processors and extensions for advanced enrichment and routing. DevOps Automation & Platform Reliability: - Take ownership of the CI/CD experience using GitLab Pipelines, integrate infrastructure automation with Terraform, Docker, and scripting in Bash and Python, and develop resilient and reusable infrastructure-as-code modules across AWS and Azure ecosystems. Cloud-Native Enablement: - Create observability blueprints for cloud-native applications on AWS and Azure, optimize cost and performance of telemetry pipelines, and ensure SLA/SLO adherence for observability services. Monitoring, Dashboards, and Alerting: - Build and maintain role-based dashboards in tools like Grafana and New Relic for real-time visibility into service health and business KPIs, implement alerting best practices, and integrate with incident management systems. Innovation & Technical Leadership: - Drive cross-team observability initiatives to reduce MTTR and enhance engineering velocity, lead innovation projects such as self-service observability onboarding and AI-assisted root cause detection, and mentor engineering teams on telemetry standards and operational excellence. Qualifications and Skills: - 10+ years of experience in DevOps, Site Reliability Engineering, or Observability roles - Deep expertise with OpenTelemetry, GitLab CI/CD, Terraform, Docker, and scripting languages (Python, Bash, Go) - Hands-on experience with AWS and Azure services, cloud automation, and cost optimization - Proficiency with observability backends such as Grafana, New Relic, Prometheus, and Loki - Strong passion for building automated, resilient, and scalable telemetry pipelines - Excellent documentation and communication skills to drive adoption and influence engineering culture Nice to Have: - Certifications in AWS, Azure, or Terraform - Experience with OpenTelemetry SDKs in Go, Java, or Node.js - Familiarity with SLO management, error budgets, and observability-as-code approaches - Exposure to event streaming technologies (Kafka, RabbitMQ), Elasticsearch, Vault, and Consul,

Posted 2 weeks ago

Apply

10.0 - 14.0 years

0 Lacs

andhra pradesh

On-site

You are seeking a highly skilled Technical Architect with expertise in Java Spring Boot, React.js, IoT system architecture, and a strong foundation in DevOps practices. As the ideal candidate, you will play a pivotal role in designing scalable, secure, and high-performance IoT solutions, leading full-stack teams, and collaborating across product, infrastructure, and data teams. Your key responsibilities will include designing and implementing scalable and secure IoT platform architecture, defining data flow and event processing pipelines, architecting micro services-based solutions, and integrating them with React-based front-ends. You will also be responsible for defining CI/CD pipelines, managing containerization & orchestration, driving infrastructure automation, ensuring platform monitoring and observability, and enabling auto-scaling and zero-downtime deployments. In addition, you will need to collaborate with product managers and business stakeholders to translate requirements into technical specs, mentor and lead a team of developers and engineers, conduct code and architecture reviews, set goals and targets, and provide coaching and professional development to team members. Your role will also involve conducting unit testing, identifying risks, using coding standards and best practices to ensure quality, and maintaining a long-term outlook on the product roadmap and its enabling technologies. To be successful in this role, you must have hands-on IoT project experience, experience in designing and deploying multi-tenant SaaS platforms, strong knowledge of security best practices in IoT and cloud, excellent problem-solving, communication, and team leadership skills. It would be beneficial if you have experience with Edge Computing frameworks, AI/ML model integration into IoT pipelines, exposure to industrial protocols, experience with digital twin concepts, and certifications in relevant technologies. Ideally, you should have a Bachelor's or Master's degree in Computer Science, Engineering, or a related field. By joining us, you will have the opportunity to lead architecture for cutting-edge industrial IoT platforms, work with a passionate team in a fast-paced and innovative environment, and gain exposure to cross-disciplinary challenges in IoT, AI, and cloud-native technologies.,

Posted 2 weeks ago

Apply

3.0 - 7.0 years

0 Lacs

chennai, tamil nadu

On-site

You are seeking a hands-on backend expert to elevate your FastAPI-based platform to the next level by developing production-grade model-inference services, agentic AI workflows, and seamless integration with third-party LLMs and NLP tooling. In this role, you will be responsible for various key areas: 1. Core Backend Enhancements: - Building APIs - Strengthening security with OAuth2/JWT, rate-limiting, SecretManager, and enhancing observability through structured logging and tracing - Adding CI/CD, test automation, health checks, and SLO dashboards 2. Awesome UI Interfaces: - Developing UI interfaces using React.js/Next.js, Redact/Context, and various CSS frameworks like Tailwind, MUI, Custom-CSS, and Shadcn 3. LLM & Agentic Services: - Designing micro/mini-services to host and route to platforms such as OpenAI, Anthropic, local HF models, embeddings & RAG pipelines - Implementing autonomous/recursive agents that orchestrate multi-step chains for Tools, Memory, and Planning 4. Model-Inference Infrastructure: - Setting up GPU/CPU inference servers behind an API gateway - Optimizing throughput with techniques like batching, streaming, quantization, and caching using tools like Redis and pgvector 5. NLP & Data Services: - Managing the NLP stack with Transformers for classification, extraction, and embedding generation - Building data pipelines to combine aggregated business metrics with model telemetry for analytics You will be working with a tech stack that includes Python, FastAPI, Starlette, Pydantic, Async SQLAlchemy, Postgres, Docker, Kubernetes, AWS/GCP, Redis, RabbitMQ, Celery, Prometheus, Grafana, OpenTelemetry, and more. Experience in building production Python REST APIs, SQL schema design in Postgres, async patterns & concurrency, UI application development, RAG, LLM/embedding workflows, cloud container orchestration, and CI/CD pipelines is essential for this role. Additionally, experience with streaming protocols, NGINX Ingress, SaaS security hardening, data privacy, event-sourced data models, and other related technologies would be advantageous. This role offers the opportunity to work on evolving products, tackle real challenges, and lead the scaling of AI services while working closely with the founder to shape the future of the platform. If you are looking for meaningful ownership and the chance to solve forward-looking problems, this role could be the right fit for you.,

Posted 2 weeks ago

Apply

2.0 - 6.0 years

0 Lacs

karnataka

On-site

As a Senior Information Security Engineer at NTT DATA in Bangalore, Karnataka (IN-KA), India, you will be part of a dynamic team that values exceptional, innovative, and passionate individuals who are eager to grow with us. If you are seeking to join an inclusive, adaptable, and forward-thinking organization, this opportunity is for you. You should have a minimum of 5 years of experience in IT Technology, with at least 2 years of hands-on experience in AI / ML, particularly with a strong working knowledge in neural networks. Additionally, you should possess 2+ years of data engineering experience, preferably using tools such as AWS Glue, Cribl, SignalFx, OpenTelemetry, or AWS Lambda. Proficiency in Python coding, including numpy, vectorization, and Tensorflow, is essential. Moreover, you must have 2+ years of experience in leading complex enterprise-wide integration programs as an individual contributor. Preferred qualifications for this role include a background in Mathematics or Physics and technical knowledge in cloud technologies like AWS, Azure, or GCP. Excellent verbal, written, and interpersonal communication skills are highly valued, as well as the ability to deliver strong customer service. NTT DATA is a $30 billion global innovator that serves 75% of the Fortune Global 100. As a Global Top Employer, we have a diverse team of experts in over 50 countries and a robust partner ecosystem. Our services encompass business and technology consulting, data and artificial intelligence, industry solutions, and the development, implementation, and management of applications, infrastructure, and connectivity. Join us as we continue to lead in digital and AI infrastructure globally and help organizations navigate confidently into the digital future. If you are ready to contribute your skills and expertise to a leading technology services provider, apply now and be a part of our journey towards innovation, optimization, and transformation for long-term success. Visit us at us.nttdata.com to learn more about our organization and the exciting opportunities we offer.,

Posted 2 weeks ago

Apply

10.0 years

3 - 8 Lacs

Bengaluru

On-site

As a Senior Software DevOps Engineer, you will lead the design,implementation, and evolution of telemetry pipelines and DevOps automation that enable next-generation observability for distributed systems. You will blend a deep understanding of Open Telemetry architecture with strong DevOps practices to build a reliable, high-performance and self-service observability platform across hybrid cloud environments (AWS & Azure). Your mission: empower engineering teams with actionable insights through rich metrics, logs, and traces, while championing automation and innovation at every layer. WHAT YOU WILL BE DOING Observability Strategy & Implementation Architect and manage scalable observability solutions using OpenTelemetry (OTel),encompassing: Collectors: Design and deploy OTel Collectors (agent/gateway modes) for ingesting and exporting telemetry across services. Instrumentation: Guide teams on auto/manual instrumentation for services (metrics, traces, and logs). Export Pipelines: Build telemetry pipelines to route data to backends like Grafana, Prometheus, Loki, New Relic, and Azure Monitor. Processors & Extensions: Leverage OTel processors (batching, filtering, resource detection) and extensions for advanced enrichment and routing. DevOps Automation & Platform Reliability Own the CI/CD experience using GitLab Pipelines, integrating infrastructure automation with Terraform, Docker, and scripting in Bash and Python. Build resilient and reusable infrastructure-as-code modules across AWS and Azure ecosystems.Manage containerized workloads, registries, secrets, and secure cloud-native deployments with best practices. Cloud-Native Enablement Develop observability blueprints for cloud-native apps across AWS (ECS, EC2, VPC,IAM, CloudWatch) and Azure (AKS, App Services, Monitor). Optimize cost and performance of telemetry pipelines while ensuring SLA/SLO adherence for observability services. Monitoring, Dashboards, and Alerting Build and maintain intuitive, role-based dashboards in Grafana ,New Relic..., enabling real-time visibility into service health, business KPIs, and SLOs. Implement alerting best practices (noise reduction, deduplication, alert grouping)integrated with incident management systems. Innovation & Technical Leadership Drive cross-team observability initiatives that reduce MTTR and elevate engineering velocity. Champion innovation projects—including self-service observability onboarding, log/metric reduction strategies, AI-assisted root cause detection, and more. Mentor engineering teams on instrumentation, telemetry standards, and operational excellence. WHAT YOU BRING 10+years of experience in DevOps, Site Reliability Engineering, or Observability roles. Deep expertise with OpenTelemetry, including Collector configurations, receivers/exporters (OTLP, HTTP, Prometheus, Loki), and semantic conventions. Proficient in GitLab CI/CD, Terraform, Docker, and scripting (Python, Bash, Go). Strong hands-on experience with AWS and Azure services, cloud automation, and cost optimization. Proficiency with observability backends: Grafana, New Relic, Prometheus, Loki, or equivalent APM/log platforms. Passion for building automated, resilient, and scalable telemetry pipelines. Excellent documentation and communication skills to drive adoption and influence engineering culture. Nice to Have) Certifications in AWS, Azure, or Terraform. Experience with OpenTelemetry SDKs in Go, Java, or Node.js. Familiarity with SLO management, error budgets, and observability-as-code approaches. Exposure to event streaming (Kafka,rabbitmq), Elasticsearch ,Vault,consul

Posted 2 weeks ago

Apply

5.0 years

0 Lacs

Bengaluru

On-site

Senior Software Engineer-I (Open Source Collection) Location: Bengaluru/Noida Our team: At Sumo Logic, we ingest petabytes of data every day and empower our customers by providing them with extremely reliable and fast tools to derive meaningful insights from their data. The Open Source/Open Telemetry Collector team provides the next-generation data collector built on Open Telemetry to simplify and streamline the performance and behavior monitoring of complex distributed systems. Responsibilities: Design and implement features for an extremely robust and lean OpenTelemetry collection engine. Good hands-on understanding of Kubernetes and the Ability to quickly diagnose and resolve complex issues in a production environment. Write robust code with unit and integration tests. Contribute to the upstream OpenTelemetry project. Analyze and improve the efficiency, scalability, and reliability of our backend systems. Work as a team member, helping the team respond quickly and effectively to business needs Experience with CI and on-call production support Requirements B.Tech, M.Tech, or Ph.D. in Computer Science or related discipline 5-8 years of industry experience with a proven track record of ownership Experience with GoLang or other statically typed language (Java, Scala, C++). Willingness to learn GoLang if you don't have the experience. Strong communication skills and the ability to work in a team environment Understand the performance characteristics of commonly used data structures (maps, lists, trees, etc) Demonstrated ability to learn quickly, solve problems and adapt to new technologies Nice to have Contributing to an open-source project and preferably open-source telemetry collection Familiarity with the monitoring/observability space. Working experience with GitHub Actions or other CI pipelines A GitHub account with recent activity and contributions to open source projects Experience in multi-threaded programming and distributed systems is highly desirable. Comfortable working with Unix-type operating systems (Linux, OS X) Familiarity with Docker, Kubernetes, Helm, Terraform, etc. Agile software development experience (test-driven development, iterative and incremental development) About Us Sumo Logic, Inc. empowers the people who power modern, digital business. Sumo Logic enables customers to deliver reliable and secure cloud-native applications through its Sumo Logic SaaS Analytics Log Platform, which helps practitioners and developers ensure application reliability, secure and protect against modern security threats, and gain insights into their cloud infrastructures. Customers worldwide rely on Sumo Logic to get powerful real-time analytics and insights across observability and security solutions for their cloud-native applications. For more information, visit www.sumologic.com. Sumo Logic Privacy Policy. Employees will be responsible for complying with applicable federal privacy laws and regulations, as well as organizational policies related to data protection.

Posted 2 weeks ago

Apply

10.0 years

0 Lacs

Pune, Maharashtra, India

On-site

At NiCE, we don’t limit our challenges. We challenge our limits. Always. We’re ambitious. We’re game changers. And we play to win. We set the highest standards and execute beyond them. And if you’re like us, we can offer you the ultimate career opportunity that will light a fire within you. So, what’s the role all about? We are looking for a highly skilled and motivated Site Reliability Engineering (SRE) Manager to lead a team of SREs in designing, building, and maintaining scalable, reliable, and secure infrastructure and services. You will work closely with engineering, product, and security teams to improve system performance, availability, and developer productivity through automation and best practices. How will you make an impact? Build server-side software using Java Lead and mentor a team of SREs; support their career growth and ensure strong team performance. Drive initiatives to improve availability, reliability, observability, and performance of applications and infrastructure. Establish SLOs/SLAs and implement monitoring systems, dashboards, and alerting to measure and uphold system health. Develop strategies for incident management, root cause analysis, and postmortem reporting. Build scalable automation solutions for infrastructure provisioning, deployments, and system maintenance. Collaborate with cross-functional teams to design fault-tolerant and cost-effective architectures. Promote a culture of continuous improvement and reliability-first engineering. Participate in capacity planning and infrastructure scaling. Manage on-call rotations and ensure incident response processes are effective and well-documented. Work in a fast-paced, fluid landscape while managing and prioritizing multiple responsibilities Have you got what it takes? Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field. 10+ years of overall experience in SRE/DevOps roles, with at least 2 years managing technical teams. Proficiency in at least one programming language (e.g., Python, Go, Java, C#) and experience with scripting languages (e.g., Bash, PowerShell). Deep understanding of cloud computing platforms (e.g., AWS), the working and reliability constraints of some of the prominent services (e.g., EC2, ECS, Lambda, DynamoDB etc) Experience with infrastructure as code tools such as CloudFormation, Terraform. Deep understanding of CI/CD concepts and experience with CI/CD tools such as Jenkins, GitLab CI/CD, or CircleCI. Strong knowledge of containerization technologies (e.g., Docker, Kubernetes) and microservices architecture. Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK). Working experience of Grafana Observability Suite (Loki, Mimir, Tempo). Experience in implementing OpenTelemetry protocol in Microservice environment. Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems. Experience of Incident management and blameless postmortems that includes driving the incident response efforts during outages and other critical incidents, resolution, and communication in a cross-functional team setup. Good to have skills: Handson experience of working with large Kubernetes Cluster. Certification will be an added plus. Administration and/or development experience of standard monitoring and automation tools such as Splunk, Datadog, Pagerduty Rundeck. Familiarity with configuration management tools like Ansible, Puppet, or Chef. Certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or equivalent. About NiCE NICE Ltd. (NASDAQ: NICE) software products are used by 25,000+ global businesses, including 85 of the Fortune 100 corporations, to deliver extraordinary customer experiences, fight financial crime and ensure public safety. Every day, NiCE software manages more than 120 million customer interactions and monitors 3+ billion financial transactions. Known as an innovation powerhouse that excels in AI, cloud and digital, NiCE is consistently recognized as the market leader in its domains, with over 8,500 employees across 30+ countries. NiCE is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, age, sex, marital status, ancestry, neurotype, physical or mental disability, veteran status, gender identity, sexual orientation or any other category protected by law.

Posted 2 weeks ago

Apply

4.0 - 5.0 years

0 Lacs

Mumbai, Maharashtra, India

On-site

Kubernetes Administrator/DevOps - Sr Consultant Qualification : B. TECH/BE/M TECH from reputed university Location : Mumbai Responsibilities: Design, provision and manage Kubernetes clusters for applications based on micro-services and event-driven architectures · Ensure seamless integration of applications with Kubernetes orchestrated environments · Configure and manage Kubernetes resources including pods, services, deployments, and namespaces · Monitor and troubleshoot Kubernetes clusters to identify and resolve performance issues, system errors, and other operational challenges · Implement infrastructure as code (IAC) using tools like Ansible, terraform for configuration management · Design and implement cluster and application monitoring using one or combination of Prometheus, Grafana, OpenTelemetry and Datadog · Manage and optimize AWS cloud resources and infrastructure for Managed containerized environments (ECR, EKS and Fargate, EC2) · Ensure high availability, scalability, and security of all infrastructure components · Monitor system performance, identify bottlenecks, and implement necessary optimizations · Troubleshoot and resolve complex issues related to the DevOps stack · Develop and maintain documentation for DevOps processes and best practices · Stay current with industry trends and emerging technologies to drive continuous improvement · Create and Manage DevOps pipelines/IAC/CI-CD/Cloud Platforms Required Skills (must have) 4-5 years of extensive hands-on experience in Kubernetes Administration, Docker, Ansible/Terraform, AWS, EKS and corresponding cloud environments · Hands-on experience in Designing and implementing Service Discovery, Service Mesh and Load Balancers · Extensive experience in defining and creating declarative files in YAML for provisioning · Experience in troubleshooting containerized environments using combination of Monitoring tools / logs · Scripting and automation skills (e.g., Bash, Python) for managing Kubernetes configurations and deployments · Hands-on experience with Helm charts, API gateways, ingress/egress gateways and service meshes (ISTIO, etc.) · Hands-on experience in managing Kubernetes Network (Services, Endpoints, DNS, Load Balancers) and storages (PV, PVC, Storage Classes, Provisioners) · Design, enhance, and implement additional services for centralized Observability Platforms, ensuring efficient log management based on the Elastic Stack, and effective monitoring and alerting powered by Prometheus · Design and Implement CI/CD pipelines, hands-on experience in IAC, git, monitoring tools like Prometheus, Grafana, Kibana etc. Good to Have Skills Relevant certifications (e.g., Certified Kubernetes Administrator – CKA / CKAD) are a plus · Experience with cloud platforms (e.g., AWS, Azure, GCP) and their managed Kubernetes services · Perform capacity planning for Kubernetes clusters and optimize costs in On-Prem and cloud environments Preferred Experience (years) 4 – 5 Years of experience in Kubernetes, Docker/Containerization

Posted 2 weeks ago

Apply

0 years

12 - 20 Lacs

Mumbai Metropolitan Region

On-site

Role Overview As a Backend Developer at LearnTube.ai, you will ship the backbone that powers 2.3 million learners in 64 countries—owning APIs that crunch 1 billion learning events & the AI that supports it with <200 ms latency. What You'll Do At LearnTube, we’re pushing the boundaries of Generative AI to revolutionise how the world learns. As a Backend Engineer, you will be building the backend for an AI system and working directly on AI. Your roles and responsibilities will include: Ship Micro-services – Build FastAPI services that handle ≈ 800 req/s today and will triple within a year (sub-200 ms p95). Power Real-Time Learning – Drive the quiz-scoring & AI-tutor engines that crunch millions of events daily. Design for Scale & Safety – Model data (Postgres, Mongo, Redis, SQS) and craft modular, secure back-end components from scratch. Deploy Globally – Roll out Dockerised services behind NGINX on AWS (EC2, S3, SQS) and GCP (GKE) via Kubernetes. Automate Releases – GitLab CI/CD + blue-green / canary = multiple safe prod deploys each week. Own Reliability – Instrument with Prometheus / Grafana, chase 99.9 % uptime, trim infra spend. Expose Gen-AI at Scale – Publish LLM inference & vector-search endpoints in partnership with the AI team. Ship Fast, Learn Fast – Work with founders, PMs, and designers in weekly ship rooms; take a feature from Figma to prod in What makes you a great fit? Must-Haves 2+ yrs Python back-end experience (FastAPI) Strong with Docker & container orchestration Hands-on with GitLab CI/CD, AWS (EC2, S3, SQS) or GCP (GKE / Compute) in production SQL/NoSQL (Postgres, MongoDB) + You’ve built systems from scratch & have solid system-design fundamentals Nice-to-Haves k8s at scale, Terraform, Experience with AI/ML inference services (LLMs, vector DBs) Go / Rust for high-perf services Observability: Prometheus, Grafana, OpenTelemetry About Us At LearnTube, we’re on a mission to make learning accessible, affordable, and engaging for millions of learners globally. Using Generative AI, we transform scattered internet content into dynamic, goal-driven courses with: AI-powered tutors that teach live, solve doubts in real time, and provide instant feedback. Seamless delivery through WhatsApp, mobile apps, and the web, with over 1.4 million learners across 64 countries. Meet The Founders LearnTube was founded by Shronit Ladhani and Gargi Ruparelia, who bring deep expertise in product development and ed-tech innovation. Shronit, a TEDx speaker, is an advocate for disrupting traditional learning, while Gargi’s focus on scalable AI solutions drives our mission to build an AI-first company that empowers learners to achieve career outcomes. We’re proud to be recognised by Google as a Top 20 AI Startup and are part of their 2024 Startups Accelerator: AI First Program, giving us access to cutting-edge technology, credits, and mentorship from industry leaders. Why Work With Us? Role At LearnTube, we believe in creating a work environment that’s as transformative as the products we build. Here’s why this role is an incredible opportunity: Cutting-Edge Technology: You’ll work on state-of-the-art generative AI applications, leveraging the latest advancements in LLMs, multimodal AI, and real-time systems. Autonomy and Ownership: Experience unparalleled flexibility and independence in a role where you’ll own high-impact projects from ideation to deployment. Rapid Growth: Accelerate your career by working on impactful projects that pack three years of learning and growth into one. Founder and Advisor Access: Collaborate directly with founders and industry experts, including the CTO of Inflection AI, to build transformative solutions. Team Culture: Join a close-knit team of high-performing engineers and innovators, where every voice matters, and Monday morning meetings are something to look forward to. Mission-Driven Impact: Be part of a company that’s redefining education for millions of learners and making AI accessible to everyone. Skills:- Python, FastAPI, Amazon Web Services (AWS), MongoDB, CI/CD, Docker and Kubernetes

Posted 2 weeks ago

Apply

10.0 years

0 Lacs

Bengaluru, Karnataka

On-site

Location Bengaluru, Karnataka, India Job ID R-231362 Date posted 13/07/2025 Job Title: Senior DevOps Engineer (AWS) – Evinova Global Career Level: E Only applications based in India will be considered Introduction to role: Are you ready to be part of the future of healthcare? Are you able to think big, be bold, and harness the power of digital and AI to tackle longstanding life sciences challenges? Then Evinova, a new health tech business part of the AstraZeneca Group might be for you! Transform billions of patients’ lives through technology, data and cutting-edge ways of working. You’re disruptive, decisive and transformative. Someone who’s excited to use technology to improve patients’ health. We’re building a new healthtech business – Evinova, a fully-owned subsidiary of AstraZeneca Group. Evinova delivers market-leading digital health solutions that are science-based, evidence-led, and human experience-driven. Thoughtful risks and quick decisions come together to accelerate innovation across the life sciences sector. Be part of a diverse team that pushes the boundaries of science by digitally empowering a deeper understanding of the patients we’re helping. Launch pioneering digital solutions that improve the patients’ experience and deliver better health outcomes. Together, we have the opportunity to combine deep scientific expertise with digital and artificial intelligence to serve the wider healthcare community and create new standards across the sector. Accountabilities: We are seeking a passionate and experienced Senior DevOps Engineer to lead the transformation of our SaaS platform infrastructure and operations. Join us in leveraging cutting-edge technology, data, and AI to revolutionize life sciences and improve billions of lives globally. In this pivotal role, you will design, implement, and optimize robust cloud-based infrastructure and operational frameworks that enable rapid innovation and deliver exceptional system reliability. You will also guide and mentor team members, sharing your expertise in AWS CDK automation, Kubernetes, networking, and DevOps best practices. Key Responsibilities: Infrastructure Design & Management: Architect and manage scalable, multi-tenant AWSbased infrastructure using AWS CDK, ensuring modular and maintainable codebases. Kubernetes & EKS: Lead the deployment and management of Kubernetes clusters using Amazon EKS, implementing best practices for scalability and security. CI/CD Pipelines: Build, manage, and enhance automated CI/CD pipelines to ensure efficient, reliable deployments using tools like ArgoCD and GitHub Actions. IAM Role Management: Design, maintain, and optimize IAM roles, policies, and guardrails to ensure least privilege access across AWS resources. Networking: Architect and maintain AWS networking components such as VPCs, Transit Gateway, ALB, and Security Groups, ensuring robust security and performance. Security & Compliance: Implement DevSecOps best practices, including IAM security, encryption standards, and compliance with industry regulations (GXP, GDPR, HIPAA,NIST). AWS WAF & Firewall Policies: Design and implement firewall policies and AWS WAF configurations to protect applications from web threats. Automation: Lead efforts to automate infrastructure provisioning, application releases, and ETL workflows, reducing manual intervention and improving efficiency. Monitoring & Incident Response: Develop and implement comprehensive monitoring, logging, and alerting systems using OpenTelemetry, Prometheus, Grafana, AWS CloudWatch, and AWS CloudTrail. AWS EventBridge & CloudTrail: Utilize AWS EventBridge for event-driven automation and roubleshoot security and operational issues using AWS CloudTrail. Governance & Strategic Input: Drive governance processes, including security reviews, cost optimization, and operational consistency across the platform. AWS Control Tower & Multi-Account Management: Manage multiple AWS accounts using AWS Control Tower and best practices for account isolation. AI & Machine Learning: Exposure to AI tools and frameworks is a plus. Mentorship & Leadership: Mentor and guide junior and mid-level engineers, fostering a culture of learning and collaboration. Provide technical leadership in the adoption of AWS CDK and best practices for cloud automation. Collaboration: Partner with cross-functional teams, including product management and security, to align DevOps strategies with business goals and ensure cohesive development and operational workflows. Essential Skills/Experience: Experience: 10+ years in DevOps or cloud infrastructure roles, with significant experience in SaaS and multi-tenant platforms. Proven track record of mentoring team members. Cloud Expertise: Expert knowledge of AWS services, including VPC, IAM, EC2, S3, RDS, Lambda, EKS, AWS WAF, AWS EventBridge, and AWS CloudTrail. Containerization & Orchestration: Deep proficiency in Docker, Kubernetes, Helm, and associated ecosystem tools. CI/CD Proficiency: Expertise in CI/CD tools such as ArgoCD and GitHub Actions. Infrastructure as Code (IaC): Advanced experience with AWS CDK (TypeScript preferred) and CloudFormation. Networking: Strong understanding of AWS networking services such as VPCs, Transit Gateway, ALB, and Security Groups. Security: In-depth knowledge of IAM, AWS KMS, encryption standards, AWS WAF, and security compliance frameworks including NIST. Monitoring & Alerting: Extensive experience with OpenTelemetry, Prometheus, Grafana, AWS CloudWatch, and AWS CloudTrail for monitoring and incident response. Data & ETL Pipelines: Familiarity with AWS Glue and Managed Kafka for real-time and batch data processing. Programming & Automation: Strong scripting and automation skills using TypeScript and Bash. Multi-Account AWS Management: Experience managing multiple AWS accounts with AWS Control Tower. Communication & Collaboration: Exceptional verbal and written communication skills, with the ability to explain complex technical concepts to diverse stakeholders. Desired Skills: Advanced expertise in AWS CDK, including building complex, reusable constructs and pipelines. Familiarity with Projen for automating CDK project configuration and management. Hands-on experience with Helm charts and Kubernetes manifests. Experience with monitoring and logging tools such as Prometheus, Grafana, and AWS CloudWatch. Exposure to multi-tenant SaaS platforms and best practices. Experience working with AI tools and frameworks. Personal Attributes: Mentor & Leader: Enjoys mentoring team members and fostering a collaborative, innovation-driven team culture. Organized & Adaptable: Able to manage multiple priorities and thrive in a fast-paced environment. Innovative: Passionate about leveraging technology to solve complex problems and drive efficiency. Customer-Focused: Dedicated to building infrastructure that delivers measurable business and customer value. When we put unexpected teams in the same room, we unleash bold thinking with the power to inspire life-changing medicines. In-person working gives us the platform we need to connect, work at pace and challenge perceptions. That's why we work, on average, a minimum of three days per week from the office. But that doesn't mean we're not flexible. We balance the expectation of being in the office while respecting individual flexibility. Join us in our unique and ambitious world. Why Evinova (AstraZeneca)? Evinova draws on AstraZeneca’s deep experience developing novel therapeutics, informed by insights from thousands of patients and clinical researchers. Together, we can accelerate the delivery of life-changing medicines, improve the design and delivery of clinical trials for better patient experiences and outcomes, and think more holistically about patient care before, during and after treatment. We know that regulators, healthcare professionals and care teams at clinical trial sites do not want a fragmented approach. They do not want a future where every pharmaceutical company provides their own, different digital solutions. They want solutions that work across the sector, simplify their workload and benefit patients broadly. By bringing our solutions to the wider healthcare community, we can help build more unified approaches to how we all develop and deploy digital technologies, better serving our teams, physicians and ultimately patients. Evinova represents a unique opportunity to deliver meaningful outcomes with digital and AI to serve the wider healthcare community and create new standards for the sector. Join us on our journey of building a new kind of health tech business to reset expectations of what a bio-pharmaceutical company can be. This means we’re opening new ways to work, pioneering cutting edge methods and bringing unexpected teams together. Interested? Come and join our journey. Ready to embark on this exciting journey with us? Apply now and be part of a team that is redefining the future of healthcare! Date Posted 14-Jul-2025 Closing Date 31-Jul-2025 AstraZeneca embraces diversity and equality of opportunity. We are committed to building an inclusive and diverse team representing all backgrounds, with as wide a range of perspectives as possible, and harnessing industry-leading skills. We believe that the more inclusive we are, the better our work will be. We welcome and consider applications to join our team from all qualified candidates, regardless of their characteristics. We comply with all applicable laws and regulations on non-discrimination in employment (and recruitment), as well as work authorization and employment eligibility verification requirements.

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

karnataka

On-site

As a Site Reliability Engineering (SRE) Technical Leader on the Network Assurance Data Platform (NADP) team at ThousandEyes, you will be responsible for ensuring the reliability, scalability, and security of cloud and big data platforms. Your role will involve representing the NADP SRE team, working in a dynamic environment, and providing technical leadership in defining and executing the team's technical roadmap. Collaborating with cross-functional teams, including software development, product management, customers, and security teams, is essential. Your contributions will directly impact the success of machine learning (ML) and AI initiatives by ensuring a robust and efficient platform infrastructure aligned with operational excellence. In this role, you will design, build, and optimize cloud and data infrastructure to ensure high availability, reliability, and scalability of big-data and ML/AI systems. Collaboration with cross-functional teams will be crucial in creating secure, scalable solutions that support ML/AI workloads and enhance operational efficiency through automation. Troubleshooting complex technical problems, conducting root cause analyses, and contributing to continuous improvement efforts are key responsibilities. You will lead the architectural vision, shape the team's technical strategy and roadmap, and act as a mentor and technical leader to foster a culture of engineering and operational excellence. Engaging with customers and stakeholders to understand use cases and feedback, translating them into actionable insights, and effectively influencing stakeholders at all levels are essential aspects of the role. Utilizing strong programming skills to integrate software and systems engineering, building core data platform capabilities and automation to meet enterprise customer needs, is a crucial requirement. Developing strategic roadmaps, processes, plans, and infrastructure to efficiently deploy new software components at an enterprise scale while enforcing engineering best practices is also part of the role. Qualifications for this position include 8-12 years of relevant experience and a bachelor's engineering degree in computer science or its equivalent. Candidates should have the ability to design and implement scalable solutions with a focus on streamlining operations. Strong hands-on experience in Cloud, preferably AWS, is required, along with Infrastructure as a Code skills, ideally with Terraform and EKS or Kubernetes. Proficiency in observability tools like Prometheus, Grafana, Thanos, CloudWatch, OpenTelemetry, and the ELK stack is necessary. Writing high-quality code in Python, Go, or equivalent programming languages is essential, as well as a good understanding of Unix/Linux systems, system libraries, file systems, and client-server protocols. Experience in building Cloud, Big data, and/or ML/AI infrastructure, architecting software and infrastructure at scale, and certifications in cloud and security domains are beneficial qualifications for this role. Cisco emphasizes diversity and encourages candidates to apply even if they do not meet every single qualification. Diverse perspectives and skills are valued, and Cisco believes that diverse teams are better equipped to solve problems, innovate, and create a positive impact.,

Posted 2 weeks ago

Apply

0 years

7 - 10 Lacs

Bengaluru

On-site

We help the world run better At SAP, we enable you to bring out your best. Our company culture is focused on collaboration and a shared passion to help the world run better. How? We focus every day on building the foundation for tomorrow and creating a workplace that embraces differences, values flexibility, and is aligned to our purpose-driven and future-focused work. We offer a highly collaborative, caring team environment with a strong focus on learning and development, recognition for your individual contributions, and a variety of benefit options for you to choose from. What You'll Do At SAP, we enable you to bring out your best. Our company culture is centered around collaboration and a shared passion to help the world run better. How? We focus every day on building the foundation for tomorrow and creating a workplace that embraces differences, values flexibility, and aligns with our purpose-driven and future-focused mission. We offer a highly collaborative and caring team environment, with a strong emphasis on learning and development, recognition for your individual contributions, and a variety of benefit options to choose from. What You Bring DevOps / SRE / Cloud Platform & Observability Core Technical Expertise (Must-Have): Automation, Virtualization, Containers: Expert in Linux, Ansible, Terraform, Python, and Bash. Strong proficiency with Docker, Kubernetes (including Helm), and VMware vCenter. Cloud Platforms: Proficient in AWS, GCP, Azure, and OpenStack. Monitoring & Observability: Expert in Prometheus and Grafana (across all layers: hardware, hypervisor, OS, containers, applications). Hands-on experience with Promtail, Loki, OpenTelemetry, ELK stack, and Jaeger. DevOps Tooling: Experienced with ArgoCD and Workflows, GitHub Actions, Jenkins, YAML and JSON, REST APIs, forward & reverse proxies, load balancers, Kubernetes Ingress, and MongoDB. ITSM & Alerting Integration: Integration experience with ServiceNow, JIRA, Microsoft Teams, and PagerDuty. Network Troubleshooting: Skilled across infrastructure and cloud layers using tools such as OpenSSL, nmap, Wireshark, curl, SSH, SNMP, TLS/SSL, HTTPS, and common Linux & Windows networking commands. Certifications (Must-Have): AWS Certified (Cloud) Red Hat Ansible Certified (Automation) CNCF Certified Kubernetes Administrator (CKA) Meet Your Team SAP Enterprise Cloud Services (ECS) Delivery is a private cloud managed services provider offering SAP applications through the HANA Enterprise Cloud. Our team, ECS Delivery XDU , is responsible for the 24x7 operation of these business-critical SAP systems in the cloud. We are looking for a Senior Linux & Cloud Administrator to support our cloud platform operations (across Azure, AWS, Google Cloud, and SAP data centers) and help drive automation and continuous improvement. You will be primarily responsible for ensuring seamless 24/7 operations across technologies such as Prometheus, Grafana, Kubernetes, Ansible, ArgoCD, AWS, GitHub Actions , and more. Your responsibilities will include network troubleshooting, architecture design, cluster setup and configuration, and development of related automation to deliver world-class cloud services for SAP applications to enterprise customers across the globe. Bring out your best SAP innovations help more than four hundred thousand customers worldwide work together more efficiently and use business insight more effectively. Originally known for leadership in enterprise resource planning (ERP) software, SAP has evolved to become a market leader in end-to-end business application software and related services for database, analytics, intelligent technologies, and experience management. As a cloud company with two hundred million users and more than one hundred thousand employees worldwide, we are purpose-driven and future-focused, with a highly collaborative team ethic and commitment to personal development. Whether connecting global industries, people, or platforms, we help ensure every challenge gets the solution it deserves. At SAP, you can bring out your best. We win with inclusion SAP’s culture of inclusion, focus on health and well-being, and flexible working models help ensure that everyone – regardless of background – feels included and can run at their best. At SAP, we believe we are made stronger by the unique capabilities and qualities that each person brings to our company, and we invest in our employees to inspire confidence and help everyone realize their full potential. We ultimately believe in unleashing all talent and creating a better and more equitable world. SAP is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to the values of Equal Employment Opportunity and provide accessibility accommodations to applicants with physical and/or mental disabilities. If you are interested in applying for employment with SAP and are in need of accommodation or special assistance to navigate our website or to complete your application, please send an e-mail with your request to Recruiting Operations Team: Careers@sap.com For SAP employees: Only permanent roles are eligible for the SAP Employee Referral Program, according to the eligibility rules set in the SAP Referral Policy. Specific conditions may apply for roles in Vocational Training. EOE AA M/F/Vet/Disability: Qualified applicants will receive consideration for employment without regard to their age, race, religion, national origin, ethnicity, age, gender (including pregnancy, childbirth, et al), sexual orientation, gender identity or expression, protected veteran status, or disability. Successful candidates might be required to undergo a background verification with an external vendor. Requisition ID: 429041 | Work Area: Information Technology | Expected Travel: 0 - 10% | Career Status: Professional | Employment Type: Regular Full Time | Additional Locations: #LI-Hybrid.

Posted 3 weeks ago

Apply

3.0 years

4 - 8 Lacs

Bengaluru

On-site

As a member of our SRE / Platform Engineering team, you will design and maintain observability solutions focused on Splunk, ensuring critical systems are monitored and issues detected early. You will onboard data sources, build searches and dashboards, and help automate observability workflows across cloud-native and hybrid environments. Responsibilities Install and configure Splunk Universal Forwarders and manage basic configuration files (inputs.conf, props.conf, transforms.conf) to onboard logs, metrics, and traces. Develop and maintain SPL searches, dashboards, and alerts that provide actionable insights to engineering and operations teams. Monitor Splunk platform health, index growth, and license usage; assist with routine upgrades and patching. Write simple automation scripts (Python, Bash, or PowerShell) or CI/CD jobs to streamline data onboarding and alert verification. Collaborate with DevOps, SRE, and application teams to understand monitoring requirements and continuously improve observability coverage. Stay current with emerging observability tools and practices; contribute to evaluations of technologies such as Dynatrace, Datadog, OpenTelemetry, and Grafana. Qualifications 1–3 years of hands‑on experience with Splunk Enterprise or Splunk Cloud in production or lab environments. Proficiency in crafting basic SPL queries, dashboards, and alerts. Familiarity with Linux command‑line, networking fundamentals, and at least one public cloud (AWS, Azure, or GCP) or container runtime (Docker/Kubernetes). Scripting knowledge in Python, Bash, or PowerShell. Strong analytical and troubleshooting skills, plus a desire to learn and grow.

Posted 3 weeks ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies