Jobs
Interviews

476 Opentelemetry Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

5.0 years

0 Lacs

Noida, Uttar Pradesh, India

On-site

The Senior Product Architecture Engineer is a lead individual contributor responsible for steering the architectural roadmap of our platform. In this role, you will take ownership of designing and evolving our platforms microservices architecture, integrating advanced features, and guiding cross-team efforts to enhance scalability, observability, and reliability. Youll work hands-on with modern cloud technologies and function as a technical leader, ensuring that our engineering teams are following the best practices and that our platform architecture aligns with business goals. Job Responsibilities: Drive microservices architecture design and evolution, owning the roadmap (service boundaries, integration, tech choices) for scalability, and defining Kubernetes container sizing and resource allocation best practices. Deep expertise in microservices architecture, designing RESTful/event-driven services, defining boundaries, optimizing communication, with experience in refactoring/greenfield and cloud patterns (Saga, Circuit Breaker). Lead platform improvements, overseeing technical enhancements for AI-driven features like our AI Mapping Tool for smarter capabilities. Architect comprehensive observability, deploying metrics, tracing, logging tools (OpenTelemetry, Prometheus, Grafana, Loki, Tempo) for real-time monitoring and high uptime. Define container sizing and lead Kubernetes performance benchmarking, analyzing bottlenecks to guide resource tuning and scaling for platform growth. Provide deployment/infrastructure expertise, guiding Helm for Kubernetes and collaborating on infrastructure needs (Terraform a plus). Lead tooling/automation enhancements, streamlining deployment via Helm improvements, simpler YAML, and pre-deployment validation to reduce errors. Lead evolution to event-driven, distributed workflows, decoupling orchestrators with RabbitMQ and patterns like Saga/pub-sub, integrating Redis for state/caching, improving fault tolerance/scalability. Collaborate across teams and stakeholders for architectural alignment, translating requirements into design and partnering for seamless implementation. Mentor engineers on coding, design, and architecture best practices, leading reviews and fostering engineering excellence. Responsible for documenting architecture decisions (diagrams, ADRs), clearly communicating complex technical concepts for roadmap transparency. Required Skills • Required 5+ years in software engineering, significant experience in designing distributed systems, and a proven track record of improving scalability/maintainability. • Extensive production experience with Kubernetes and Docker, proficient in deploying, scaling, and managing apps on clusters, including cluster management on major cloud platforms. • Proficient in deployment automation/config management, required Helm charts experience, familiar with CI/CD/GitOps, and Terraform/IaC exposure is a plus. • Strong experience implementing observability via monitoring/logging frameworks (Prometheus, Grafana, ELK/Loki, tracing), able to instrument applications, and proven in optimizing distributed system performance. • Hands-on with message brokers (RabbitMQ/Kafka) and distributed data stores like Redis, skilled in asynchronous system design and solution selection. • Excellent technical communication and leadership, proven ability to lead architectural discussions/build consensus, comfortable driving projects and collaborating with Agile, cross-functional teams. • Adept at technical documentation/diagrams, with an analytical mindset for evaluating new technologies and foreseeing design impacts on scalability, security, and maintainability.

Posted 22 hours ago

Apply

5.0 years

0 Lacs

Chennai, Tamil Nadu, India

On-site

We are seeking a Senior Observability Engineer with strong expertise in Grafana and Python to lead telemetry, monitoring, and automation efforts across our cloud-native infrastructure. This role is critical in shaping our observability strategy, building real-time dashboards, and automating alerting pipelines to ensure high system availability and performance. Requirements Key Responsibilities Design, develop, and maintain Grafana dashboards for real-time infrastructure and application monitoring. Build and enhance Python-based automation tools for telemetry data processing, health checks, and alerts. Integrate observability solutions with Azure Monitor, Log Analytics, Prometheus, and OpenTelemetry. Define and implement SLIs, SLOs, and proactive alerting mechanisms. Collaborate with SREs, DevOps, and developers to improve monitoring coverage and incident response. Contribute to infrastructure automation and CI/CD workflows using Python, Git, and DevOps tools. Lead tool selection, observability best practices, and adoption across engineering teams. Requirements 5+ years of experience in observability, DevOps, or SRE roles Strong hands-on experience with Grafana, including templating, alerting, and data source integration Proficient in Python scripting for automation and data processing Experience with Prometheus, Azure Monitor, Log Analytics, and Kubernetes Familiarity with distributed systems, tracing, and telemetry pipelines Exposure to tools like Loki, OpenTelemetry, ArgoCD, or Terraform is a plus Nice to Have Experience with CI/CD pipelines (Jenkins, Azure DevOps, GitHub Actions) Knowledge of containerized environments (Docker, Kubernetes, AKS) Ability to design cost-efficient monitoring solutions and dashboards Benefits Fun, happy and politics-free work culture built on the principles of lean and self-organisation; Work with large scale systems powering global businesses; Competitive salary and benefits About Mindera At Mindera we use technology to build products we are proud of, with people we love. Software Engineering Applications, including Web and Mobile, are at the core of what we do at Mindera. We partner with our clients, to understand their product and deliver high performance, resilient and scalable software systems that create an impact in their users and businesses across the world. You get to work with a bunch of great people, where the whole team owns the project together. Our culture reflects our lean and self-organisation attitude. We encourage our colleagues to take risks, make decisions, work in a collaborative way and talk to everyone to enhance communication. We are proud of our work and we love to learn all and everything while navigating through an Agile, Lean and collaborative environment. Check out our Blog: http://mindera.com/ and our Handbook: http://tinyurl.com/zc599tr Our offices are located: Aveiro, Portugal | Porto, Portugal | Leicester, UK | San Diego, USA | San Francisco, USA | Chennai, India | Bengaluru, India

Posted 1 day ago

Apply

3.0 - 7.0 years

0 Lacs

maharashtra

On-site

You will be joining StellarTech, an international rapidly growing product IT company, as a Technical Project Manager (devops) where your strong leadership skills will play a key role. Your responsibilities will include managing and prioritizing the backlog of DevOps tasks to ensure efficient execution aligned with company objectives. You will oversee DevOps project planning, tracking, and delivery using Agile methodologies such as Scrum and Kanban. Working closely with engineering leadership, you will define and track Service Level Agreements (SLAs) for the centralized DevOps function. Collaboration across cross-functional teams including Product, Platform, Data, and R&D will be essential for smooth communication and effective coordination. In the realm of DevOps operations and planning, you will coordinate the execution of the DevOps roadmap which encompasses infrastructure automation, CI/CD improvements, and cloud cost optimizations. You will establish incident management processes including on-call rotations and tooling setup, such as PagerDuty, to ensure efficient workflows. Improving the DevOps team's efficiency in managing infrastructure as code and facilitating the integration and adoption of DevOps tools and best practices across the organization will be part of your duties. Technical incident management will be a crucial aspect of your role where you will organize, implement, and manage the technical incident response process to minimize downtime and conduct effective root cause analysis. Owning incident response tooling setup and automation, along with defining and documenting post-mortem processes and best practices for incident handling, will be essential tasks. Monitoring and observability will also fall under your purview, where you will ensure the effective implementation of monitoring tools like Prometheus, Grafana, OpenTelemetry, Datadog, and Sentry or similar solutions. Driving standardization of monitoring metrics and logging across teams to enhance system reliability will be a key focus area. To excel in this role, you should have proven experience in Project Management using Agile/Scrum methodologies and tools like Jira and Confluence. Strong organizational skills with attention to detail, effective cross-functional communication abilities, stakeholder management, and high stress resilience are essential. Technical expertise in AWS services, DevOps tooling, monitoring, and observability tools is also required. Preferred qualifications include previous experience as a DevOps Engineer or System Administrator, hands-on experience with incident response frameworks and automation, knowledge of cloud security best practices, scripting abilities in Bash, Python, or Go, experience in Linux system administration, familiarity with databases, and DBA and SQL experience. Working with StellarTech will offer you impactful work shaping the company's future, an innovative environment encouraging experimentation, flexibility in a remote or hybrid role, health benefits, AI solutions, competitive salary, work-life balance with flexible paid time off, and a collaborative culture where you will work alongside driven professionals.,

Posted 1 day ago

Apply

10.0 years

0 Lacs

Bangalore Urban, Karnataka, India

On-site

About the Role: As the SRE Architect for Flipkart’s Reliability & Productivity Charter, you will own the vision and strategic roadmap for our Reliability charter—defining what “resilient at scale” means for Flipkart and how we measure success. You will architect and drive key platform initiatives including: ● Centralized Observability Stack: End-to-end design of metrics, tracing, logging, and alerting pipelines to give every engineering team a single pane of glass into system health. ● Public Cloud Management: Define best practices, guardrails, and automation for Flipkart’s multi-region GCP footprint to ensure cost-effective, secure, and compliant operations. ● SRE Platform Innovations: Lead the architecture of chaos engineering (Chaos Platform), mass code migration (CodeLift with OpenRewrite), golden-image enforcement and artifact scanning (ImageScanning), and other next-generation reliability tools. In this role, you will collaborate closely with engineering, product, and operations stakeholders to translate high-level reliability objectives into concrete, scalable systems and processes that empower thousands of engineers to build, deploy, and operate Flipkart’s services with confidence. About Flipkart’s Reliability & Productivity Charter Join a dynamic SRE team focused on elevating Flipkart’s platform resilience, developer productivity, and operational excellence. We build and own the platforms and tooling that enable thousands of engineers to deliver high-quality features at scale and with confidence. Key Responsibilities ● Architect & Design ○ Define the end-to-end architecture for centralized observability (metrics, tracing, logs, alerting) and ensure scalability, security, and cost-efficiency ○ Drive the technical roadmap for platforms such as Chaos Platform, CodeLift, and Image Scanning ○ Establish best-practice patterns (golden paths) for multi-region, multi-cloud deployments aligned with BCP/DR requirements ● Platform Delivery & Governance ○ Lead cross-functional design reviews, proof-of-concepts, and production rollouts for new platform components ○ Ensure robust standards for API design, data modeling, and service-level objectives (SLOs) ○ Define and enforce policy as code (e.g., quota management, image enforcement, CI/CD pipelines) ● Technology Leadership & Mentorship ○ Coach and guide SRE Engineers and Platform Engineers on system design, reliability patterns, and performance optimizations ○ Evangelize “shift-left” practices: resilience testing, security scanning (Snyk, Artifactory integration), and automated feedback loops ○ Stay abreast of industry trends (service meshes, event stores, distributed tracing backends) and evaluate their applicability ● Performance & Capacity Planning ○ Collaborate with FinanceOps and CloudOps to optimize public cloud cost, capacity, and resource utilization ○ Define monitoring, alerting, and auto-remediation strategies to maintain healthy error budgets What We’re Looking For ● Experience & Expertise ○ 10+ years in large-scale distributed systems architecture, with at least 3 years in an SRE or platform engineering context ○ Hands-on mastery of observability stacks (Prometheus, OpenTelemetry, Jaeger/Zipkin, ELK/EFK, Grafana, Alertmanager) ○ Proven track record of designing chaos engineering frameworks and non-functional testing workflows ● Technical Skills ○ Deep knowledge of public cloud platforms (GCP preferred), container orchestration (Kubernetes), and IaC (Terraform, Helm) ○ Strong background in language-agnostic tooling (Go, Java, Python) and API-driven microservices architectures ○ Familiarity with OpenRewrite for mass code migration and vulnerability management tools (Snyk, Trivy) ● Leadership & Collaboration ○ Demonstrated ability to influence stakeholders across engineering, product, and operations teams ○ Excellent written and verbal communication—able to translate complex architectures into clear, actionable plans ○ Passion for mentoring and growing engineering talent in reliability and productivity best practice

Posted 1 day ago

Apply

5.0 years

0 Lacs

India

On-site

Job Description Do you enjoy constantly learning and using new technologies? Would you like to impact the digital experiences of millions? Come join our Cloud Technology Group! We're looking for a passionate and strategic Senior Product Manager to lead the Metrics pillar within our Cloud Observability platform. This is a critical leadership role driving the product vision, strategy, and roadmap for how customers observe, understand, and act on their infrastructure and application telemetry data. Partner with the best As the product manager for metrics, you will be responsible for building a scalable and self-service platform that integrates seamlessly across all cloud services. You will define pricing and packaging models, partner with architects and engineering leaders to execute the roadmap, and deliver an exceptional developer and operator experience. As a Product Manager Senior, you will be responsible for: Defining and evolve the long-term vision and product strategy for metrics as part of our observability suite Prioritizing, plan, and deliver features working closely with engineering, UX, and partner teams across cloud services Ensuring seamless, self-service integration with all core and emerging cloud services, exposing metrics with minimal friction. Driving the creation of a reusable, scalable metrics ingestion and query platform that serves internal and external. Developing and iterate pricing models that are competit scalable, and align with customer value. Do What You Love To be successful in this role you will: 5+ years of product management experience, with at least 3+ years in observability or related functions Have technical understanding of metrics, time-series databases, telemetry pipelines, and cloud-native observability practices (Prometheus, OpenTelemetry, etc.) Have excellent written and verbal communication skills. Able to clearly articulate product decisions and influence stakeholders at all levels. Have Knowledge on SaaS pricing models, usage-based billing, and monetization strategies. Have experience, delivering platform products that serve external customers. Work in a way that works for you FlexBase, Akamai's Global Flexible Working Program, is based on the principles that are helping us create the best workplace in the world. When our colleagues said that flexible working was important to them, we listened. We also know flexible working is important to many of the incredible people considering joining Akamai. FlexBase, gives 95% of employees the choice to work from their home, their office, or both (in the country advertised). This permanent workplace flexibility program is consistent and fair globally, to help us find incredible talent, virtually anywhere. We are happy to discuss working options for this role and encourage you to speak with your recruiter in more detail when you apply. Learn what makes Akamai a great place to work Connect with us on social and see what life at Akamai is like! We power and protect life online, by solving the toughest challenges, together. At Akamai, we're curious, innovative, collaborative and tenacious. We celebrate diversity of thought and we hold an unwavering belief that we can make a meaningful difference. Our teams use their global perspectives to put customers at the forefront of everything they do, so if you are people-centric, you'll thrive here. Working for you Benefits At Akamai, we will provide you with opportunities to grow, flourish, and achieve great things. Our benefit options are designed to meet your individual needs for today and in the future. We provide benefits surrounding all aspects of your life: Your health Your finances Your family Your time at work Your time pursuing other endeavors Our benefit plan options are designed to meet your individual needs and budget, both today and in the future. About Us Akamai powers and protects life online. Leading companies worldwide choose Akamai to build, deliver, and secure their digital experiences helping billions of people live, work, and play every day. With the world's most distributed compute platform from cloud to edge we make it easy for customers to develop and run applications, while we keep experiences closer to users and threats farther away. Join us Are you seeking an opportunity to make a real difference in a company with a global reach and exciting services and clients? Come join us and grow with a team of people who will energize and inspire you!

Posted 1 day ago

Apply

5.0 years

0 Lacs

Ahmedabad, Gujarat, India

Remote

Employment Type: Full-Time Location: Onsite | Remote Experience Required: 5+ Years About Techiebutler Techiebutler partners with startup founders and CTOs to deliver high-quality products quickly. We’re a focused team dedicated to execution, innovation, and solving real-world challenges with minimal bureaucracy. Role Overview We’re seeking a Senior Golang Backend Engineer to lead the design and development of scalable, high-performance backend systems. You’ll play a pivotal role in shaping our solutions, tech stack, driving technical excellence, and mentoring the team to deliver robust solutions. Key Responsibilities Design and develop scalable, high-performance backend services using Go Optimize systems for reliability, efficiency, and maintainability Establish technical standards for development, testing Mentor team members and conduct code reviews to enhance code quality Monitor and troubleshoot systems using tools like DataDog, Prometheus Collaborate with cross-functional teams on API design, integration, and architecture. What We’re Looking For Experience: 5+ years in backend development, with 3+ years in Go Cloud & Serverless: Proficient in AWS (Lambda, DynamoDB, SQS) Containerization: Hands-on experience with Docker and Kubernetes Microservices: Expertise in designing and maintaining microservices and distributed systems Concurrency: Strong understanding of concurrent programming and performance optimization Domain-Driven Design: Practical experience applying DDD principles Testing: Proficient in automated testing, TDD, and BDD CI/CD & DevOps: Familiarity with GitLab CI, GitHub Actions, or Jenkins Observability: Experience with ELK Stack, OpenTelemetry, or similar tools Collaboration: Excellent communication and teamwork skills in Agile/Scrum environments. Why Join Us? Work with cutting-edge technologies to shape our platform’s future Thrive in a collaborative, inclusive environment that values innovation Competitive salary and career growth opportunities Contribute to impactful projects in a fast-paced tech company. Apply Now If you’re passionate about building scalable systems and solving complex challenges, join our high-performing team! Apply today to be part of Techiebutler’s journey

Posted 2 days ago

Apply

1.0 - 10.0 years

0 Lacs

karnataka

On-site

Our client values developer experience and quality infrastructure as crucial components in delivering high-performance, resilient, and secure data products. As the Engineering Manager for the Developer Experience & Services team, you will lead an essential engineering group dedicated to enhancing developer productivity, internal tooling, and quality assurance infrastructure. This role is a blend of platform engineering and quality engineering, where your team's focus will be on constructing systems, tools, and automation frameworks that drive engineering velocity, product reliability, and operational excellence. You will play a pivotal role in evolving the core developer platform and executing strategies for test infrastructure, performance benchmarking, fault tolerance verification, and chaos testing. In this leadership position, your responsibilities will include: - Leading and expanding a high-impact team responsible for developer experience, platform tooling, and quality infrastructure. - Owning and advancing the company-wide developer platform, encompassing internal tools for build and deployment, observability, monitoring, alerting, remote dev environments, local dev tooling, and engineering standards. - Developing quality assurance infrastructure such as scalable test automation frameworks, infrastructure for performance testing and benchmarking, chaos engineering and fault injection systems, and support for deployment strategies. - Driving the adoption of engineering best practices in testing, reliability, and continuous delivery. - Collaborating with engineers to identify and alleviate friction points through tooling and automation. - Defining metrics and SLAs for engineering productivity, test coverage, release confidence, and platform uptime to ensure continuous improvement. - Leading technical architecture discussions to ensure the scalability and maintainability of internal platforms and tooling. - Cultivating a culture of ownership, experimentation, and learning within the team. Key Requirements: - 10+ years of software engineering experience with a proven track record in building infrastructure or platforms. - At least 1 year in a team leadership or engineering management role. - Customer-centric mindset, growth mindset, and drive for impact. - Strong coding, design, and architectural skills to serve as a technical leader. - Analytical and problem-solving skills. - Proficiency in data-driven metrics for operational excellence. - Excellent oral and written communication skills. - Cross-team communication abilities with a focus on productivity and quality. - Familiarity with tools and frameworks like GitHub Actions, ArgoCD, Spinnaker, Jenkins, Pytest, Selenium, JUnit, JMeter, Locust, Chaos Mesh, Gremlin, Prometheus, Grafana, OpenTelemetry, Elastic Stack. If you have experience in DevX teams and are passionate about making a hands-on impact on transformative projects, please reach out to rajeshwari.vh@careerxperts.com.,

Posted 2 days ago

Apply

0.0 years

0 Lacs

Chennai, Tamil Nadu, India

On-site

? We are seeking a skilled Software Engineer with experience in both .NET (C#) and Java to join our team. The ideal candidate should have a solid background in building scalable backend systems, developing APIs, and working across both technology stacks. Hands-on knowledge of microservices architecture, SQL/NoSQL databases, and RESTful services is essential. Experience with OpenTelemetry or similar observability frameworks is highly preferred, as the role involves implementing monitoring and tracing for distributed systems. Familiarity with containerization (Docker/Kubernetes), CI/CD pipelines, and cloud platforms (AWS, Azure, or GCP) is a plus. Strong problem-solving skills, attention to detail, and the ability to collaborate effectively in a fast-paced environment are key to success in this role. Show more Show less

Posted 2 days ago

Apply

5.0 years

0 Lacs

West Bengal

On-site

Job Information Date Opened 30/07/2025 Job Type Full time Industry IT Services Work Experience 5+ Years City Kolkata Province West Bengal Country India Postal Code 700091 About Us We are a fast growing technology company specializing in current and emerging internet, cloud and mobile technologies. Job Description CodelogicX is a forward-thinking tech company dedicated to pushing the boundaries of innovation and delivering cutting-edge solutions. We are seeking a Senior DevOps Engineer with at least 5 years of hands-on experience in building, managing, and optimizing scalable infrastructure and CI/CD pipelines. The ideal candidate will play a crucial role in automating deployment workflows, securing cloud environments and managing container orchestration platforms. You will leverage your expertise in AWS, Kubernetes, ArgoCD, and CI/CD to streamline our development processes, ensure the reliability and scalability of our systems, and drive the adoption of best practices across the team. Key Responsibilities: Design, implement, and maintain CI/CD pipelines using GitHub Actions and Bitbucket Pipelines. Develop and manage Infrastructure as Code (IaC) using Terraform for AWS-based infrastructure. Setup and administer SFTP servers on cloud-based VMs using chroot configurations and automate file transfers to S3-backed Glacier . Manage SNS for alerting and notification integration. Ensure cost optimization of AWS services through billing reviews and usage audits. Implement and maintain secure secrets management using AWS KMS , Parameter Store , and Secrets Manager . Configure, deploy, and maintain a wide range of AWS services, including but not limited to: Compute Services o Provision and manage compute resources using EC2, EKS, AWS Lambda, and EventBridge for compute-driven, serverless and event-driven architectures. Storage & Content Delivery o Manage data storage and archival solutions using S3, Glacier, and content delivery through CloudFront. Networking & Connectivity o Design and manage secure network architectures with VPCs, Load Balancers, Security Groups, VPNs, and Route 53 for DNS routing and failover. Ensure proper functioning of Network Services like TCP/IP, reverse proxies (e.g., NGINX). Monitoring & Observability o Implement monitoring, logging, and tracing solutions using CloudWatch, Prometheus, Grafana, ArgoCD, and OpenTelemetry to ensure system health and performance visibility. Database Services o Deploy and manage relational databases via RDS for MySQL, PostgreSQL, Aurora, and healthcare-specific FHIR database configurations. Security & Compliance o Enforce security best practices using IAM (roles, policies), AWS WAF, Amazon Inspector, GuardDuty, Security Hub, and Trusted Advisor to monitor, detect, and mitigate risks. GitOps o Apply excellent knowledge of GitOps practices, ensuring all infrastructure and application configuration changes are tracked and versioned through Git commits. Architect and manage Kubernetes environments (EKS) , implementing Helm charts, ingress controllers, autoscaling (HPA/VPA), and service meshes (Istio), troubleshoot advanced issues related to pods, services, DNS, and kubelets. Apply best practices in Git workflows (trunk-based, feature branching) in both monorepo and multi-repo environments. Maintain, troubleshoot, and optimize Linux-based systems (Ubuntu, CentOS, Amazon Linux). Support the engineering and compliance teams by addressing requirements for HIPAA, GDPR, ISO 27001, SOC 2 , and ensuring infrastructure readiness. Perform rollback and hotfix procedures with minimal downtime. Collaborate with developers to define release and deployment processes. Manage and standardize build environments across dev, staging, and production. Manage release and deployment processes across dev, staging, and production. Work cross-functionally with development and QA teams. Lead incident postmortems and drive continuous improvement. Perform root cause analysis and implement corrective/preventive actions for system incidents. Set up automated backups/snapshots, disaster recovery plans, and incident response strategies. Ensure on-time patching. Mentor junior DevOps engineers. Requirements Required Qualifications: Bachelor's degree in Computer Science, Engineering, or equivalent practical experience. 5+ years of proven DevOps engineering experience in cloud-based environments. Advanced knowledge of AWS , Terraform , CI/CD tools , and Kubernetes (EKS) . Strong scripting and automation mindset. Solid experience with Linux system administration and networking. Excellent communication and documentation skills. Ability to collaborate across teams and lead DevOps initiatives independently. Preferred Qualifications: Experience with infrastructure as code tools such as Terraform or CloudFormation. Experience with GitHub Actions is a plus. Certifications in AWS (e.g., AWS DevOps Engineer, AWS SysOps Administrator) or Kubernetes (CKA/CKAD). Experience working in regulated environments (e.g., healthcare or fintech). Exposure to container security tools and cloud compliance scanners. Experience: 5-10 Years Working Mode: Hybrid Job Type: Full-Time Location: Kolkata Benefits Health insurance Hybrid working mode Provident Fund Parental leave Yearly Bonus Gratuity

Posted 2 days ago

Apply

5.0 years

0 Lacs

Telangana, India

On-site

Ignite the Future of Language with AI at Teradata! What You'll Do: Shape the Way the World Understands Data At Teradata, we're not just managing data; we're unleashing its full potential. Our ClearScape Analytics™ platform and pioneering Enterprise Vector Store are empowering the world's largest enterprises to derive unprecedented value from their most complex data. We're rapidly pushing the boundaries of what's possible with Artificial Intelligence, especially in the exciting realm of autonomous and agentic systems We’re building intelligent systems that go far beyond automation — they observe, reason, adapt, and drive complex decision-making across large-scale enterprise environments. As a member of our AI engineering team, you’ll play a critical role in designing and deploying advanced AI agents that integrate deeply with business operations, turning data into insight, action, and measurable outcomes. You’ll work alongside a high-caliber team of AI researchers, engineers, and data scientists tackling some of the hardest problems in AI and enterprise software — from scalable multi-agent coordination and fine-tuned LLM applications, to real-time monitoring, drift detection, and closed-loop retraining systems. If you're passionate about building intelligent systems that are not only powerful but observable, resilient, and production-ready, this role offers the opportunity to shape the future of enterprise AI from the ground up. We are seeking a highly skilled Senior AI Engineer to drive the development and deployment of Agentic AI systems with a strong emphasis on AI observability and data platform integration. You will work at the forefront of cutting-edge AI research and its practical application—designing, implementing, and monitoring intelligent agents capable of autonomous reasoning, decision-making, and continuous learning. Who You'll Work With: Join Forces with the Best Imagine collaborating daily with some of the brightest minds in the company – individuals who champion diversity, equity, and inclusion as fundamental to our success. You'll be part of a cohesive force, laser-focused on delivering high-quality, critical, and highly visible AI/ML functionality within the Teradata Vantage platform. Your insights will directly shape the future of our intelligent data solutions. You'll report directly to the inspiring Sr. Manager, Software Engineering, who will champion your growth and empower your contributions. What Makes You a Qualified Candidate: Skills in Action Architect and implement Agentic AI systems capable of multi-step reasoning, tool use, and autonomous task execution. Build and maintain AI observability pipelines to monitor agent behavior, decision traceability, model drift, and overall system performance. Design and develop data platform components that support real-time and batch processing, data lineage, and high-availability systems for AI training and inference workflows. Integrate LLMs and multi-modal models into robust AI agents using frameworks like LangChain, OpenAI, Hugging Face, or custom stacks. Collaborate with product, research, and MLOps teams to ensure smooth integration between AI agents and user-facing applications Implement safeguards, feedback loops, and evaluation metrics to ensure AI safety, reliability, and compliance. Implement safeguards, feedback loops, and evaluation metrics to ensure AI safety, reliability, and compliance. Passion for staying current with AI research, especially in the areas of reasoning, planning, and autonomous systems. You are an excellent backend engineer who codes daily and owns systems end-to-end. Strong engineering background (Python/Java/Golang, API integration, backend frameworks) Strong system design skills and understanding of distributed systems. You’re obsessive about reliability, debuggability, and ensuring AI systems behave deterministically when needed. Hands-on experience with Machine learning & deep learning frameworks: TensorFlow, PyTorch, Scikit-learn Hands-on experience with LLMs, agent frameworks (LangChain, AutoGPT, ReAct, etc. ), and orchestration tools. Experience with AI observability tools and practices (e. g. , logging, monitoring, tracing, metrics for AI agents or ML models). Solid understanding of model performance monitoring, drift detection, and responsible AI principles. What You Bring: Passion and Potential A Bachelor's or Master's degree in Computer Science, Engineering, Data Science, or a related field – your academic foundation is key. A genuine excitement for AI and large language models (LLMs) is a significant advantage – you'll be working at the cutting edge! Design, develop, and deploy agentic systems integrated into the data platform. 5+ years of experience in software architecture, backend systems, or AI infrastructure. Strong experience with LLMs, transformers, and tools like OpenAI API, Anthropic Claude, or open-source LLMs. Deep understanding of AI observability (e. g. , tracing, monitoring, model explainability, drift detection, evaluation pipelines). Build dashboards and metrics pipelines to track key AI system indicators: latency, accuracy, tool invocation success, hallucination rate, and failure modes. Integrate observability tooling (e. g. , OpenTelemetry, Prometheus, Grafana) with LLM-based workflows and agent pipelines. Familiarity with modern data platform architecture Strong background in distributed systems, microservices, and cloud platforms (AWS, GCP, Azure). Experience in software development (Python, Go, or Java preferred). Familiarity with backend service development, APIs, and distributed systems. Familiarity with containerized environments (Docker, Kubernetes) and CI/CD pipelines. Bonus: Research experience or contributions to open-source agentic frameworks. You're knowledgeable about open-source tools and technologies and know how to leverage and extend them to build innovative solutions. Preferred Qualifications Experience with tools such as Arize AI, WhyLabs, Traceloop, or Prometheus + custom monitoring for AI/ML. Contributions to open-source agent frameworks or AI infra. Advanced degree (MS/PhD) in Computer Science, Artificial Intelligence, or related field. Experience working with multi-agent systems, real-time decision systems, or autonomous workflows. Why We Think You’ll Love Teradata We prioritize a people-first culture because we know our people are at the very heart of our success. We embrace a flexible work model because we trust our people to make decisions about how, when, and where they work. We focus on well-being because we care about our people and their ability to thrive both personally and professionally. We are an anti-racist company because our dedication to Diversity, Equity, and Inclusion is more than a statement. It is a deep commitment to doing the work to foster an equitable environment that celebrates people for all of who they are. Teradata invites all identities and backgrounds in the workplace. We work with deliberation and intent to ensure we are cultivating collaboration and inclusivity across our global organization. ​ We are proud to be an equal opportunity and affirmative action employer. We do not discriminate based upon race, color, ancestry, religion, creed, sex (including pregnancy, childbirth, breastfeeding, or related conditions), national origin, sexual orientation, age, citizenship, marital status, disability, medical condition, genetic information, gender identity or expression, military and veteran status, or any other legally protected status.

Posted 2 days ago

Apply

5.0 years

0 Lacs

Greater Kolkata Area

On-site

CodelogicX is a forward-thinking tech company dedicated to pushing the boundaries of innovation and delivering cutting-edge solutions. We are seeking a Senior DevOps Engineer with at least 5 years of hands-on experience in building, managing, and optimizing scalable infrastructure and CI/CD pipelines. The ideal candidate will play a crucial role in automating deployment workflows, securing cloud environments and managing container orchestration platforms. You will leverage your expertise in AWS, Kubernetes, ArgoCD, and CI/CD to streamline our development processes, ensure the reliability and scalability of our systems, and drive the adoption of best practices across the team. Key Responsibilities Design, implement, and maintain CI/CD pipelines using GitHub Actions and Bitbucket Pipelines. Develop and manage Infrastructure as Code (IaC) using Terraform for AWS-based infrastructure. Setup and administer SFTP servers on cloud-based VMs using chroot configurations and automate file transfers to S3-backed Glacier. Manage SNS for alerting and notification integration. Ensure cost optimization of AWS services through billing reviews and usage audits. Implement and maintain secure secrets management using AWS KMS, Parameter Store, and Secrets Manager. Configure, deploy, and maintain a wide range of AWS services, including but not limited to: Compute Services Provision and manage compute resources using EC2, EKS, AWS Lambda, and EventBridge for compute-driven, serverless and event-driven architectures. Storage & Content Delivery Manage data storage and archival solutions using S3, Glacier, and content delivery through CloudFront. Networking & Connectivity Design and manage secure network architectures with VPCs, Load Balancers, Security Groups, VPNs, and Route 53 for DNS routing and failover. Ensure proper functioning of Network Services like TCP/IP, reverse proxies (e.g., NGINX). Monitoring & Observability Implement monitoring, logging, and tracing solutions using CloudWatch, Prometheus, Grafana, ArgoCD, and OpenTelemetry to ensure system health and performance visibility. Database Services Deploy and manage relational databases via RDS for MySQL, PostgreSQL, Aurora, and healthcare-specific FHIR database configurations. Security & Compliance Enforce security best practices using IAM (roles, policies), AWS WAF, Amazon Inspector, GuardDuty, Security Hub, and Trusted Advisor to monitor, detect, and mitigate risks. GitOps Apply excellent knowledge of GitOps practices, ensuring all infrastructure and application configuration changes are tracked and versioned through Git commits. Architect and manage Kubernetes environments (EKS), implementing Helm charts, ingress controllers, autoscaling (HPA/VPA), and service meshes (Istio), troubleshoot advanced issues related to pods, services, DNS, and kubelets. Apply best practices in Git workflows (trunk-based, feature branching) in both monorepo and multi-repo environments. Maintain, troubleshoot, and optimize Linux-based systems (Ubuntu, CentOS, Amazon Linux). Support the engineering and compliance teams by addressing requirements for HIPAA, GDPR, ISO 27001, SOC 2, and ensuring infrastructure readiness. Perform rollback and hotfix procedures with minimal downtime. Collaborate with developers to define release and deployment processes. Manage and standardize build environments across dev, staging, and production. Manage release and deployment processes across dev, staging, and production. Work cross-functionally with development and QA teams. Lead incident postmortems and drive continuous improvement. Perform root cause analysis and implement corrective/preventive actions for system incidents. Set up automated backups/snapshots, disaster recovery plans, and incident response strategies. Ensure on-time patching. Mentor junior DevOps engineers. Requirements Required Qualifications: Bachelor's degree in Computer Science, Engineering, or equivalent practical experience. 5+ years of proven DevOps engineering experience in cloud-based environments. Advanced knowledge of AWS, Terraform, CI/CD tools, and Kubernetes (EKS). Strong scripting and automation mindset. Solid experience with Linux system administration and networking. Excellent communication and documentation skills. Ability to collaborate across teams and lead DevOps initiatives independently. Preferred Qualifications Experience with infrastructure as code tools such as Terraform or CloudFormation. Experience with GitHub Actions is a plus. Certifications in AWS (e.g., AWS DevOps Engineer, AWS SysOps Administrator) or Kubernetes (CKA/CKAD). Experience working in regulated environments (e.g., healthcare or fintech). Exposure to container security tools and cloud compliance scanners. Experience: 5-10 Years Working Mode: Hybrid Job Type: Full-Time Location: Kolkata Benefits Health insurance Hybrid working mode Provident Fund Parental leave Yearly Bonus Gratuity

Posted 3 days ago

Apply

19.0 years

0 Lacs

India

On-site

About Us Through Fintricity, our core consulting brand, has been at the forefront of technology and digital transformation for over 19 years, most recently with a focus on bringing innovation to leading brands through expertise in big data, analytics and technologies (such as Generative AI) that are transforming operations, products, services, business models and sectors. Pivoting to a AI services and venture firm, Fintricity is working on a range of exciting new projects and ventures to build new decentralised business models, technology platforms and disrupt and transform multiple industries. PLEASE DO NOT SEND A CONNECT ON LINKEDIN OR YOUR APPLICATION WILL BE IMMEDIATELY REJECTED. TAKE THE NEXT STEP: Are you truly collaborative? Succeeding at Fintricity means respecting, understanding and trusting colleagues and clients. Challenging others and being challenged in return. Being a strong communicator, passionate, entrepreneurial and innovative about what you do. Driving yourself forward, always wanting to do things the right way. Does that sound like you? Together. That’s how we do things. We offer a supportive, challenging and diverse working environment. We value your passion and commitment, and reward your performance. Keen to achieve the work-life agility that you desire? We're open to discussing how this could work for you (and us). About the Role We are looking for a Senior Python Engineer (Full Stack) to design and build the next generation of AgenticOps platforms, enabling scalable, observable, and safe deployment of autonomous AI agents and LLM workflows. This is a hands-on engineering role with broad responsibilities—from building backend services that orchestrate LLM pipelines, to designing web UIs that monitor, audit, and control agentic behavior. You’ll work at the intersection of Generative AI, DevOps, and full-stack development, helping to productize and operationalize intelligent agents across cloud environments. Responsibilities Backend Engineering (Python) Build modular services for prompt orchestration, vector DB interactions, agent memory/state stores, and retrieval pipelines. Create scalable APIs for agent task execution, human-in-the-loop control, and logging. AgenticOps & LLM Integration Develop tools to version, deploy, monitor, and rollback LLM agents and RAG-based workflows. Work with cloud-based LLM APIs (OpenAI, Claude, Gemini) and open-source models (LLaMA, Mistral, etc.) via LangChain or similar. DevOps & Infrastructure-as-Code Automate deployment pipelines (Docker, K8s, Terraform), observability stacks (Prometheus, Grafana), and secure API gateways. Own CI/CD and testing for both code and LLM-based behavior. Frontend Development (React, Nextjs or Vue preferred) Build intuitive UIs for monitoring agents in real time, visualizing traces, showing decision logs, version history, and alerts. Enable role-based access, interactive prompt testing, and behavior auditing. Data & Observability Create feedback collection pipelines, behavior traces, token usage dashboards, and incident response views. Integrate OpenTelemetry, custom metrics, and structured logging. Collaboration & Leadership Work cross-functionally with AI scientists, DevOps engineers, and product managers. Provide mentorship and guide best practices in full-stack design and Pythonic architecture. Requirements 5+ years in Python-based full-stack or backend development Solid experience with modern web frameworks (e.g., FastAPI, Django, Flask) Proficiency in frontend frameworks like React, Vue, or Svelte Strong DevOps/infra skills: Docker, Kubernetes, Terraform, GitHub Actions or similar Experience with LLM APIs and frameworks (LangChain, LlamaIndex, AutoGen, CrewAI) Familiarity with vector DBs (FAISS, Weaviate, Pinecone) and retrieval-based architectures Experience building monitoring/observability tooling (Grafana, Prometheus, ELK) Understanding of secure API design, authN/authZ, and agent sandboxing best practices Nice to Haves Experience building agent frameworks or workflow orchestration UIs Familiarity with LLMOps and Responsible AI patterns (prompt auditing, guardrails, evals) Contributions to open-source infra or AI tooling Experience with serverless (e.g., AWS Lambda, Google Cloud Functions) What You’ll Get Opportunity to shape the agent infrastructure stack of the future High-impact role in a cross-disciplinary team at the cutting edge of LLMs and automation Access to GPU/LLM infra and modern development tools Competitive compensation, flexible work, and continuous learning budget

Posted 3 days ago

Apply

8.0 years

0 Lacs

India

Remote

Voice AI Scheduling (Scale‑Ready, Multi‑Tenant) — Remote Company: Apex Dental Systems Location: Remote (must overlap 7+ hours with 8am–5pm Pacific / America‑Vancouver) Type: Full Time Engineer with ramp up into CTO Compensation: Engineer: $2000USD/month + 1% equity Upon promotion to CTO: $4000USD/month + 2% equity About Us Apex Dental Systems builds voice AI reception for dental/orthodontic clinics. We connect real phone calls (Retell AI + telephony) to booked appointments via NexHealth and, over time, direct PMS connectors. We’re moving from pilot to scale across 50–100+ clinics with high reliability and tight cost control. The Mission Own the scale‑ready backend platform : multi‑tenant onboarding automation, secure configuration management, rate‑limit and retries, SLO‑backed reliability, cost observability, and compliance (HIPAA/PIPEDA). Your work allows us to onboard dozens of clinics per week with minutes, not days , of setup. Outcomes You’ll Deliver in the First 4–6 Weeks Multi‑tenant architecture with tenant isolation, role‑based access (RBAC), and per‑clinic secrets (env‑less runtime or AWS Secrets Manager). Onboarding automation that reduces per‑clinic setup to ≤60 minutes : provider/location/appointment‑type sync, ID mapping, test calls, and health checks. Hardened tool endpoints used by the voice agent (Retell function calling): availability_search, appointment_book, appointment_reschedule, appointment_cancel, patient_find_or_create, note_create, warm_transfer. Reliability controls : idempotency keys, timeouts, retries with backoff, circuit breakers; graceful fallbacks + warm transfer. Observability & SLOs : structured logs, metrics, tracing; dashboards for p50/p95 latency , error rates, booking success %, transfers, cost per minute/call; alerts to Slack. Security & compliance : PHI minimization, at‑rest and in‑transit encryption, access logging, data‑retention policy, BAA‑aware configuration. Cost guardrails : per‑tenant budget meters for voice minutes/LLM/TTS usage and anomaly alerts. KPIs you’ll move: Median tool‑call latency < 800 ms (p95 < 1500 ms) ≥ 80% booking/reschedule success without human handoff (eligible calls) 99.9%+ middleware availability < 1% tool‑level error rate (after retries) ≤ 60 min time‑to‑onboard a new clinic (target 30 min by week 6) Responsibilities Design, implement, and document multi‑tenant REST/JSON services consumed by the voice agent. Integrate NexHealth now; design extension points for direct PMS (OpenDental/Dentrix/Eaglesoft/Dolphin) later. Build sync jobs to keep providers/locations/appointment types up‑to‑date (with caching via Redis, invalidation, and backfills). Implement idempotent booking flows with conflict detection and safe retries; log every state transition. Stand up observability (metrics/logs/traces) and alerting; define SLOs/SLA and on‑call basics. Ship CI/CD with linting, tests (unit, contract, integration), and minimal load tests. Enforce secrets management , least‑privilege IAM, and a clean audit trail . Partner with our conversation designer to refine tool schemas and edge‑case flows (insurance screening, multi‑location routing). Mentor a mid‑level engineer and coordinate with ops for smooth rollouts. Minimum Qualifications 5–8+ years building production backend systems (you’ve owned a system in prod). Expert in Node.js (TypeScript) or Python (FastAPI/Nest/Express). Deep experience with external API integrations (auth, pagination, rate limits, webhooks). Postgres (schema design, migrations) and Redis (caching, locks). Production reliability patterns: retries/backoff, timeouts, idempotency , circuit breakers. Observability: metrics, tracing, log correlation; incident triage. Security/compliance mindset; comfortable handling sensitive data flows. Strong written English; crisp architectural docs and PRs. Nice‑to‑Have Retell AI (or similar voice/LLM with function calling and barge‑in), Twilio/SIP . NexHealth or other healthcare scheduling APIs; PMS/EHR familiarity. HIPAA/PIPEDA exposure, SOC 2‑style controls. OpenTelemetry, Prometheus/Grafana, Sentry; AWS/GCP; Terraform; Docker/Kubernetes. High‑volume, low‑latency systems experience. Our Stack (target) Runtime: Node.js (TypeScript) or Python (FastAPI) Data: Postgres, Redis Infra: AWS (ECS/EKS or Fargate), Terraform, GitHub Actions Integrations: Retell AI (voice), NexHealth (scheduling), Twilio/SIP (telephony) Observability: OpenTelemetry + Prometheus/Grafana or Cloud provider equivalents How We Work Remote‑first; async‑friendly; 4+ hours overlap with Pacific time. Code in company repos, NDAs/PIAs/BAAs , DCO/CLA, and strict access hygiene. We optimize for reliability and patient privacy over quick hacks. Interview Process (fast, 7–10 days) Intro (20–30 min): Your background, past scale/reliability wins. Take‑home (90 min, paid for finalists): Implement availability_search + appointment_book against a stubbed NexHealth‑like API. Include idempotency keys, retries with backoff, timeouts, and basic tests. Provide a short runbook and a dashboard sketch for p95 latency & error‑rate alerts. Deep‑dive (60 min): Review your code; discuss multi‑tenant design, secrets, SLOs, and cost control. Final (30–45 min): Collaboration & comms. How to Apply Email info@apexdentalsystems.com with subject “Senior Backend — Scale‑Ready Voice AI” and include: CV + GitHub/portfolio 5–10 lines on a system you made multi‑tenant (what changed?) A time you prevented double bookings or handled idempotency at scale Your preferred stack (Node+TS or Python), availability, and comp expectations

Posted 3 days ago

Apply

6.0 - 10.0 years

0 Lacs

hyderabad, telangana

On-site

The role of HV Product at Hitachi Vantara is pivotal in the development of the VSP 360 platform's on-premises solution, ensuring strict adherence to delivery objectives. The VSP 360 platform is the cornerstone of the organization's management solution strategy. As a member of our global team, you will play a key role in empowering businesses to automate, optimize, innovate, and wow their customers with high-performance data infrastructure. To excel in this role, you should possess a Bachelor's degree in computer science or a related field, along with 6+ years of experience in DevOps or a related field. Your strong experience with cloud-based services, running Kubernetes as a service, managing Kubernetes clusters, and infrastructure automation and deployment tools such as Terraform, Ansible, Docker, Jenkins, GitHub, and GitHub Actions will be vital in driving the success of the VSP 360 platform. Additionally, your familiarity with monitoring tools like Grafana, Nagios, ELK, OpenTelemetry, Prometheus, Anthos/Istio Service Mesh, Cloud Native Computing Foundation (CNCF) projects, Kubernetes Operators, KeyCloak, and Linux systems administration will be highly beneficial. It would be advantageous to have proficiency in Python, Django, AWS solution design, cloud-based storage (S3, Blob, Google Storage), and storage area networks (SANs). At Hitachi Vantara, we value diversity, equity, and inclusion, as they are integral to our culture and identity. We encourage individuals from all backgrounds to apply, as we believe that diverse thinking and a commitment to allyship lead to powerful results. As part of our team, you will be supported with industry-leading benefits, services, and flexible arrangements to ensure your holistic health and wellbeing. We champion life balance and offer autonomy, freedom, and ownership in your work. Join us in co-creating meaningful solutions to complex challenges and becoming a data-driven leader that positively impacts industries and society. If you are passionate about innovation and believe in inspiring the future, Hitachi Vantara is the place for you to fulfill your purpose and reach your full potential.,

Posted 3 days ago

Apply

0.0 years

0 Lacs

Hyderabad, Telangana, India

On-site

About the Role We are looking for an experienced DevOps Engineer to join our engineering team. This role involves setting up, managing, and scaling development, staging, and production environments both on AWS cloud and on-premise (open source stack) . You will be responsible for CI/CD pipelines, infrastructure automation, monitoring, container orchestration, and model deployment workflows for our enterprise applications and AI platform. Key Responsibilities Infrastructure Setup & Management Design and implement cloud-native architectures on AWS and be able to manage on-premise open source environments when required . Automate infrastructure provisioning using tools like Terraform or CloudFormation. Maintain scalable environments for dev, staging, and production . CI/CD & Release Management Build and maintain CI/CD pipelines for backend, frontend, and AI workloads. Enable automated testing, security scanning, and artifact deployments. Manage configuration and secret management across environments. Containerization & Orchestration Manage Docker-based containerization and Kubernetes clusters (EKS, self-managed K8s) . Implement service mesh, auto-scaling, and rolling updates. Monitoring, Security, and Reliability Implement observability (logging, metrics, tracing) using open source or cloud tools. Ensure security best practices across infrastructure, pipelines, and deployed services. Troubleshoot incidents, manage disaster recovery, and support high availability. Model DevOps / MLOps Set up pipelines for AI/ML model deployment and monitoring (LLMOps). Support data pipelines, vector databases, and model hosting for AI applications. Required Skills and Qualifications Cloud & Infra Strong expertise in AWS services : EC2, ECS/EKS, S3, IAM, RDS, Lambda, API Gateway, etc. Ability to set up and manage on-premise or hybrid environments using open source tools. DevOps & Automation Hands-on experience with Terraform / CloudFormation . Strong skills in CI/CD tools such as GitHub Actions, Jenkins, GitLab CI/CD, or ArgoCD. Containerization & Orchestration Expertise with Docker and Kubernetes (EKS or self-hosted). Familiarity with Helm charts, service mesh (Istio/Linkerd). Monitoring / Observability Tools Experience with Prometheus, Grafana, ELK/EFK stack, CloudWatch . Knowledge of distributed tracing tools like Jaeger or OpenTelemetry. Security & Compliance Understanding of cloud security best practices . Familiarity with tools like Vault, AWS Secrets Manager. Model DevOps / MLOps Tools (Preferred) Experience with MLflow, Kubeflow, BentoML, Weights & Biases (W&B) . Exposure to vector databases (pgvector, Pinecone) and AI pipeline automation . Preferred Qualifications Knowledge of cost optimization for cloud and hybrid infrastructures . Exposure to infrastructure as code (IaC) best practices and GitOps workflows. Familiarity with serverless and event-driven architectures . Education Bachelors degree in Computer Science, Engineering, or related field (or equivalent experience). What We Offer Opportunity to work on modern cloud-native systems and AI-powered platforms . Exposure to hybrid environments (AWS and open source on-prem). Competitive salary, benefits, and growth-oriented culture. Show more Show less

Posted 3 days ago

Apply

0.0 years

0 Lacs

Hyderabad, Telangana, India

On-site

Job Summary We are looking for a highly skilled and adaptable Site Reliability Engineer to become a key member of our Cloud Engineering team. In this crucial role, you will be instrumental in designing and refining our cloud infrastructure with a strong focus on reliability, security, and scalability . As an SRE, you&aposll apply software engineering principles to solve operational challenges, ensuring the overall operational resilience and continuous stability of our systems. This position requires a blend of managing live production environments and contributing to engineering efforts such as automation and system improvements. Key Responsibilities: Cloud Infrastructure Architecture and Management: Design, build, and maintain resilient cloud infrastructure solutions to support the development and deployment of scalable and reliable applications. This includes managing and optimizing cloud platforms for high availability, performance, and cost efficiency. Enhancing Service Reliability: Lead reliability best practices by establishing and managing monitoring and alerting systems to proactively detect and respond to anomalies and performance issues. Utilize SLI, SLO, and SLA concepts to measure and improve reliability. Identify and resolve potential bottlenecks and areas for enhancement. Driving Automation and Efficiency: Contribute to the automation, provisioning, and standardization of infrastructure resources and system configurations. Identify and implement automation for repetitive tasks to significantly reduce operational overhead. Develop Standard Operating Procedures (SOPs) and automate workflows using tools like Rundeck or Jenkins. Incident Response and Resolution: Participate in and help resolve major incidents, conduct thorough root cause analyses, and implement permanent solutions. Effectively manage incidents within the production environment using a systematic problem-solving approach. Collaboration and Innovation: Work closely with diverse stakeholders and cross-functional teams, including software engineers, to integrate cloud solutions, gather requirements, and execute Proof of Concepts (POCs). Foster strong collaboration and communication. Guide designs and processes with a focus on resilience and minimizing manual effort. Promote the adoption of common tooling and components, and implement software and tools to enhance resilience and automate operations. Be open to adopting new tools and approaches as needed. Required Skills and Experience: Cloud Platforms: Demonstrated expertise in at least one major cloud platform (AWS, Azure, or GCP). Infrastructure Management: Proven proficiency in on-premises hosting and virtualization platforms (VMware, Hyper-V, or KVM). Solid understanding of storage internals (NAS, SAN, EFS, NFS) and protocols (FTP, SFTP, SMTP, NTP, DNS, DHCP). Experience with networking and firewall technologies. Strong hands-on experience with Linux internals and operating systems (RHEL, CentOS, Rocky Linux). Experience with Windows operating systems to support varied environments. Extensive experience with containerization (Docker) and orchestration (Kubernetes) technologies. Automation & IaC: Proficiency in scripting languages (shell and Python). Experience with configuration management tools (Ansible or Puppet). Must have exposure to Infrastructure as Code (IaC) tools (Terraform or CloudFormation). Monitoring & Observability: Experience setting up and configuring monitoring tools (Prometheus, Grafana, or the ELK stack). Hands-on experience implementing OpenTelemetry for observability. Familiarity with monitoring and logging tools for cloud-based applications. Service Reliability Concepts: A strong understanding of SLI, SLO, SLA, and error budgeting. Soft Skills & Mindset: Excellent communication and interpersonal skills for effective teamwork. We value proactive individuals who are eager to learn and adapt in a dynamic environment. Must possess a pragmatic and adaptable mindset, with a willingness to step outside comfort zones and acquire new skills. Ability to consider the broader system impact of your work. Must be a change advocate for reliability initiatives. Desired/Bonus Skills: Experience with DevOps toolchain elements like Git, Jenkins, Rundeck, ArgoCD, or Crossplane. Experience with database management, particularly MySQL and Hadoop. Knowledge of cloud cost management and optimization strategies. Understanding of cloud security best practices, including data encryption, access controls, and identity management. Experience implementing disaster recovery and business continuity plans. Familiarity with ITIL (Information Technology Infrastructure Library) processes Show more Show less

Posted 3 days ago

Apply

9.0 years

0 Lacs

India

Remote

Job Description JB-4: Senior Lead Engineer Our Purpose: At Majesco, we believe in connecting people and business to Insurance in ways that are Innovative, Hyper-Relevant, Compelling and Personal. We bring together the brightest minds to build the future of Insurance; a world where Insurance makes life and business easier, more connected, and better protected. We are seeking a Senior Lead Engineer to deliver platform-scale automation across insurance domains by integrating advanced AI tools, orchestrated workflows, and modular service components. This role sits at the intersection of cloud-native backend engineering, AI-driven experiences, and cross-functional automation. All About the Role: Lead the development and scaling of document ingestion pipelines, classifier engines, and API gateways supporting intelligent P&C workflows. Build modular backend services using FastAPI, Django, or Flask, leveraging asynchronous design, microservices, and cloud-native scalability. Design and deploy event-driven automation using Azure Logic Apps, Functions, and Service Bus across claims, billing, and policy processes. Containerize services using Docker, deploy and scale on Kubernetes, and ensure high availability through best practices. Integrate platform services with Microsoft Copilot Studio and Power Automate, enabling reusable actions, conversational agents, and business rule orchestration. Establish telemetry, traceability, and structured logging standards across APIs and workflows using Azure Monitor, App Insights, and OpenTelemetry. Drive performance profiling and system optimization initiatives across ingestion, classification, and agent orchestration layers. Explore and integrate AI capabilities such as voice embeddings, vector search, and immersive UI elements into the platform. Participate actively in PI planning, backlog grooming, and agile ceremonies across engineering and product teams. Mentor junior developers and lead sprint-level technical delivery with a focus on modularity, scalability, and AI-readiness. What You’ll Bring: Passion for staying ahead of AI developments and a builder mindset to turn AI capabilities into practical applications. Demonstrated ability to design systems with observability, orchestration, and automation at the core. A strong performance-first philosophy — ability to analyze, profile, and optimize services at scale. Vision for integrating AI into core insurance workflows, from agent recommendations to customer-facing explainability. Willing to work across time zones and with remote teams. All About You: 9+ years of experience in backend platform development and cloud-native architecture. Strong knowledge of FastAPI, Django, or Flask and event-driven microservice design. Minimum 5 years of experience on frontend framework - HTML5, React JS, Node JS, JavaScript, Angular Exposure to cloud applications including DevOps/DevSecOps, Scaling, Deployment and Automation; Cloud Exposure - MS Azure/AWS, OpenShift, Docker, Kubernetes, Jenkins, GitHub, Jira Hands-on experience with Azure Cloud Services including Logic Apps, Functions, Cosmos DB, Blob Storage, and Service Bus. Proficient with Docker and Kubernetes for containerized deployment and service scaling. Experience building intelligent orchestration workflows using Power Automate and Copilot Studio. Working knowledge of vector databases, embedding APIs, and LLM integration workflows (OpenAI/Azure OpenAI). Exposure to AI-enhanced UIs — such as embedded assistants, predictive agents, or conversational UI. Proficient in system performance optimization, error tracing, logging frameworks, and monitoring pipelines. Experience working in agile teams with PI planning, story-pointing, sprint demos, and cross-functional delivery. P&C insurance domain familiarity is preferred but not mandatory. Other Qualifications: Bachelor’s degree in computer science or engineering; master’s degree a plus Experience with SAFe Agile Development practices and processes. SAFe Practitioner certification preferred. Experience with the Majesco Platforms/Products is a plus Preferred Experience in developing packaged software (Products) preferably in the banking or financial services areas.

Posted 3 days ago

Apply

0 years

0 Lacs

Bengaluru, Karnataka, India

On-site

Responsibilities Design, build, and maintain scalable software development platforms. Lead technical discussions with clients and internal teams. Mentor junior engineers through pairing and collaborative sessions. Embrace dynamic roles and adapt to evolving tech stacks. Drive Agile/Lean practices across projects. Document solutions and promote knowledge sharing. Stay updated on emerging technologies and industry trends. Requirements Experience managing large-scale infrastructure systems. Proficiency in at least one programming language. Strong understanding of software delivery principles. Technical agility with the ability to adapt across stacks and tools. Excellent communication and collaboration skills. Technical Expertise Experience building infrastructure or business platforms (e. g., notification services). Strong knowledge of Linux, cloud, and software-defined networking. Hands-on with configuration management tools like Ansible, Chef. Deep understanding of CI/CD practices and tools likeJenkins, GitHub Actions, and GitLab CI. Experience with GitOps tools like ArgoCD, FluxCD is a plus. Familiarity with Cloud-Native tools (Prometheus, OpenTelemetry, Envoy). Experience with container orchestration (e. g., Kubernetes, Nomad). Proficient in infrastructure-as-code (Terraform, Pulumi, CloudFormation, AWS CDK). Exposure to at least one major cloud platform (AWS, GCP, or Azure). Understanding of observability and monitoring best practices. Knowledge of distributed systems and experience with SQL/NoSQL databases. Awareness of cloud security principles. This job was posted by Akanksha Sharma from Infraspec.

Posted 3 days ago

Apply

0 years

10 Lacs

Hyderābād

On-site

Company Description Entain India is the engineering and delivery powerhouse for Entain, one of the world’s leading global sports and gaming groups. Established in Hyderabad in 2001, we’ve grown from a small tech hub into a dynamic force, delivering cutting-edge software solutions and support services that power billions of transactions for millions of users worldwide. Our focus on quality at scale drives us to create innovative technology that supports Entain’s mission to lead the change in global sports and gaming sector. At Entain India, we make the impossible possible, together. Job Description We are seeking a talented and motivated SRE Engineer III to join our dynamic team. In this role, you will execute a range of site reliability activities, ensuring optimal service performance, reliability, and availability. You will collaborate with cross-functional engineering teams to develop scalable, fault-tolerant, and cost-effective cloud services. If you are passionate about site reliability engineering and ready to make a significant impact, we would love to hear from you! Key Responsibilities: Lead team of SRE Engineers Implement automation tools, frameworks, and CI/CD pipelines, promoting best practices and code reusability. Enhance site reliability through process automation, reducing mean time to detection, resolution, and repair. Identify and manage risks through regular assessments and proactive mitigation strategies. Develop and troubleshoot large-scale distributed systems in both on-prem and cloud environments. Deliver infrastructure as code to improve service availability, scalability, latency, and efficiency. Monitor support processing for early detection of issues and share knowledge on emerging site reliability trends. Analyze data to identify improvement areas and optimize system performance through scale testing. Take ownership of production issues within assigned domains, performing initial triage and collaborating closely with engineering teams to ensure timely resolution. Qualifications For Site Reliability Engineering (SRE) , key skills and tools are essential for maintaining system reliability, scalability, and efficiency. Given your expertise in observability, compliance, and platform stability , here’s a structured breakdown: Key SRE Skills Infrastructure as Code (IaC) – Automating provisioning with Terraform, Ansible, or Kubernetes. Observability & Monitoring – Implementing distributed tracing, logging, and metrics for proactive issue detection. Security & Compliance – Ensuring privileged access controls, audit logging, and encryption . Incident Management & MTTR Optimization – Reducing downtime with automated recovery mechanisms . Performance Engineering – Optimizing API latency, P99 response times, and resource utilization . Dependency Management – Ensuring resilience in microservices with circuit breakers and retries. CI/CD & Release Engineering – Automating deployments while maintaining rollback strategies . Capacity Planning & Scalability – Forecasting traffic patterns and optimizing resource allocation. Chaos Engineering – Validating system robustness through fault injection testing . Cross-Team Collaboration – Aligning SRE practices with DevOps, security, and compliance teams . Essential SRE Tools Monitoring & Observability : Datadog, Prometheus, Grafana, New Relic. Incident Response : PagerDuty, OpsGenie. Configuration & Automation : Terraform, Ansible, Puppet. CI/CD Pipelines : Jenkins, GitHub Actions, ArgoCD. Logging & Tracing : ELK Stack, OpenTelemetry, Jaeger. Security & Compliance : Vault, AWS IAM, Snyk. Additional Information We know that signing top players requires a great starting package, and plenty of support to inspire peak performance. Join us, and a competitive salary is just the beginning. Working for us in India, you can expect to receive great benefits like: Safe home pickup and home drop (Hyderabad Office Only) Group Mediclaim policy Group Critical Illness policy Communication & Relocation allowance Annual Health check And outside of this, you’ll have the chance to turn recognition from leaders and colleagues into amazing prizes. Join a winning team of talented people and be a part of an inclusive and supporting community where everyone is celebrated for being themselves. At Entain India, we do what’s right. It’s one of our core values and that’s why we're taking the lead when it comes to creating a diverse, equitable and inclusive future - for our people, and the wider global sports betting and gaming sector. However you identify, across any protected characteristic, our ambition is to ensure our people across the globe feel valued, respected and their individuality celebrated. We comply with all applicable recruitment regulations and employment laws in the jurisdictions where we operate, ensuring ethical and compliant hiring practices globally. Should you need any adjustments or accommodations to the recruitment process, at either application or interview, please contact us.

Posted 3 days ago

Apply

3.0 years

0 Lacs

Bangalore Urban, Karnataka, India

On-site

About Groww We are a passionate group of people focused on making financial services accessible to every Indian through a multi-product platform. Each day, we help millions of customers take charge of their financial journey. Customer obsession is in our DNA. Every product, every design, every algorithm down to the tiniest detail is executed keeping the customers’ needs and convenience in mind. Our people are our greatest strength. Everyone at Groww is driven by ownership, customer-centricity, integrity and the passion to constantly challenge the status quo. Are you as passionate about defying conventions and creating something extraordinary as we are? Let’s chat. Our Vision Every individual deserves the knowledge, tools, and confidence to make informed financial decisions. At Groww, we are making sure every Indian feels empowered to do so through a cutting-edge multi-product platform offering a variety of financial services. Our long-term vision is to become the trusted financial partner for millions of Indians. Our Values Our culture enables us to be what we are — India’s fastest-growing financial services company. Everyone at Groww enjoys the autonomy and flexibility to bring their best work to the table, as well as craft a promising career for themselves. The values that form our foundation are: Radical customer-centricity Ownership-driven culture Keeping everything simple Long-term thinking Complete transparency EXPERTISE AND QUALIFICATIONS Bachelor's degree in Computer Science, Software Engineering, or related technical field (or equivalent practical experience). 3+ years of professional experience developing and deploying React Native applications for iOS and Android. Strong proficiency in JavaScript/TypeScript, ES6+, and related modern frontend technologies. In-depth understanding of React Native internals, including the JavaScript runtime, bridge, Hermes engine, and component lifecycle. Proficiency with state management libraries (Redux Toolkit, Zustand, MobX, Recoil). Experience with UI libraries and design systems, ensuring pixel-perfect UI implementation. Demonstrated ability to optimize performance, reduce JS thread blockage, and implement efficient rendering. Strong problem-solving skills, including algorithmic and data structure competencies. Experience integrating native modules and bridging native APIs into React Native applications. Familiarity with modern CI/CD pipelines, unit and integration testing, and debugging tools. GOOD TO HAVE Familiarity with native development in either Android (Kotlin) or iOS (Swift/Objective-C). Experience publishing and managing apps on Google Play and Apple App Store. Knowledge of Firebase, Bugsnag and other cloud services for analytics, crash reporting, and performance monitoring. Exposure to OpenTelemetry or other observability and monitoring frameworks. Understanding of monorepo setups (Yarn workspaces, TurboRepo, etc.) and experience working within them.

Posted 4 days ago

Apply

3.0 years

0 Lacs

Gurugram, Haryana, India

On-site

Job Title: Senior DevOps Engineer (SRE2) Location: Gurugram Experience: 3+ Years About HaaNaa HaaNaa is a skill-based opinion trading platform that lets users trade their opinions on diverse topics using simple Yes/No choices. From politics, crypto, and finance to sports, entertainment, and current affairs—HaaNaa transforms opinions into assets. With a gamified interface, users get rewarded for informed predictions, while tracking real-time trends, analyzing insights, and engaging with a vibrant community. Role Overview We are looking for a Senior DevOps Engineer (SRE2) to lead and scale our infrastructure as we grow our real-time trading platform. This role demands a mix of hands-on DevOps skills and strong ownership of system reliability, scalability, and observability. Key Responsibilities Design, deploy, and manage scalable, secure, and resilient infrastructure on AWS, focusing on EKS (Elastic Kubernetes Service) for container orchestration. Implement and manage service mesh using Istio, enabling traffic control, observability, and security across microservices. Drive Infrastructure-as-Code (IaC) using Terraform for consistent and repeatable provisioning of cloud resources. Build and maintain robust CI/CD pipelines (GitHub Actions, Jenkins, or CircleCI) to ensure efficient and automated delivery workflows. Ensure high system availability, performance, and reliability—taking ownership of SLIs/SLOs/SLAs, alerts, and dashboards. Implement observability practices using tools like Prometheus, Grafana, ELK/EFK, or OpenTelemetry. Manage incident response, root cause analysis (RCA), and drive postmortem culture. Collaborate with cross-functional teams (engineering, QA, product) to ensure DevOps and SRE best practices are followed. Harden platform against security threats (including DDoS) using Cloudflare, Akamai, or equivalent. Automate repetitive tasks using scripting (Python, Bash) and tools like Ansible. Contribute to platform cost optimization, auto-scaling, and multi-region failover strategies. Requirements 3+ years of hands-on DevOps/SRE experience including team mentorship or leadership. Proven expertise in managing AWS cloud-native architecture, especially EKS, IAM, VPC, ALB/NLB, S3, RDS, CloudWatch. Hands-on with Istio for service mesh and microservice observability/security. Deep experience with Terraform for managing cloud infrastructure. Proficiency in CI/CD and automation tools (GitHub Actions, Jenkins, CircleCI, Ansible). Strong scripting skills in Python, Bash, or equivalent. Familiar with Kubernetes administration, Helm charts, and container orchestration. Strong understanding of monitoring, alerting, and logging systems. Experience handling DDoS mitigation, WAF rules, and CDN configuration. Excellent problem-solving and incident management skills with a proactive mindset. Strong collaboration and communication skills. Nice to Have Experience in high-growth startups or gaming platforms. Understanding of security best practices, IAM policies, and compliance frameworks (SOC2, ISO, etc.). Experience in backend performance tuning, horizontal scaling, and chaos engineering. Familiarity with progressive delivery techniques like Canary deployments or Blue/Green strategies. Why Join HaaNaa? Ownership: Play a key role in shaping the platform’s infrastructure and reliability. Innovation: Work on scalable, low-latency systems powering real-time gamified trading. Teamwork: Join a dynamic, talented team solving complex engineering challenges. Growth: Be part of a rapidly expanding company with leadership growth opportunities. Perks & Benefits: Competitive salary, health insurance, and the freedom to experiment with the latest cloud-native tools. Skills: devops,terraform,ci/cd,cloudformation,go,networking,datadog,aws,grafana,sre,kubernetes,azure,security,prometheus,infrastructure-as-code,gcp,bash,docker,python,linux system administration,elk stack

Posted 4 days ago

Apply

5.0 years

0 Lacs

Delhi, India

On-site

Position Title: SRE Engineer Position Type: Regular - Full-Time Position Location: New Delhi Requisition ID: 30491 Job Purpose Reporting to the Sr Manager, DevSecOps & SRE, the Site Reliability Engineer will be responsible for: Site reliability engineers (SREs) are responsible for improving system reliability and resilience to make it faster and easier to develop and deploy new software capabilities. SREs focus especially on building automation to reduce manual effort and prevent operations incidents. Job Responsibilities Work with stakeholders such as product owners and Engineering to define service level objectives (SLOs) for system operations. Track performance against SLOs in partnership with monitoring teams or other stakeholders, and ensure systems continue to meet SLOs over time. Create dashboards and reports to communicate key metrics. Create software to improve performance, scalability, and stability of systems. Collaborate with development teams to promote the concept of reliability engineering during all phases of the software development lifecycle to detect and correct performance issues and meet availability goals. Design, code, test, and deliver infrastructure software to automate manual operational work (i.e., “toil”). Participate in operational support and on-call rotation shifts for supported systems and products. Conduct blameless post mortems to troubleshoot priority incidents. Perform analytics on previous incidents to understand root causes and better predict and prevent future issues. Use automation to reduce the probability and/or impact of problem recurrence. Identify, evaluate, and recommend monitoring tools and diagnostic techniques to improve system observability. Participate in system design consulting, platform management, capacity planning and launch reviews. Collaborate and share lessons learned regarding performance and reliability issues with all stakeholders including developers, other SREs, operations teams, and project management teams. Participate in communities of practice to share knowledge and foster continuous improvement. Remain current on site reliability engineering methods and trends such as observability-driven development and chaos engineering. Drive continuous improvement in software quality and infrastructure reliability and resilience. Oversee, design, implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation. SRE engineer will focus on Application Performance Monitoring (APM) including Design, Solution, POC, profiling and tuning application compute and data nodes and resources. Some key duties of this role are: Assist in defining SRE and Observability architecture, design Analyze, Implement new features of SRE and Observability Platform Full stack monitoring across all layers (Infrastructure/Network/Database/Application/Services/Third Party) Provide technical hands-on leadership in commercial and Open source/commercial monitoring Tool salection Implementation. Implement SRE driven automated Incident Detection -> automated Engagement –> Triage/Mitigate – RCA/Postmortems -> Problem task Remediation. AI Driven Correlation, De-duplication Noise Reduction and Auto Remediation Provide weekly monitoring and alert analysis and continuous improvement Create a model of the run-time environment (discovery) Profile the performance and behavior of user-defined transactions Establish Performance metrics from each of the applications/systems technical components (Webserver, App server, Database, etc.) Application performance management database APM tool Administration and Support Monitoring Tool design and implementation APM Setup/Usage policies and guidelines Capacity Planning and monitoring Monitor selected application performance Report vital statistics of application performance in production Make recommendations for improvements with Service Desk Make recommendations for adjustments to runtime resources to improve overall performance profile Key Qualification & Experiences Strong problem solving and analytical skills. Strong interpersonal and written and verbal communication skills. Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies. Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell). Experience with incident and response management. Experience with Agile and DevOps development methodologies. Experience with container technologies and supporting tools (e.g. Docker Swarm, Podman, Kubernetes, Mesos). Experience with working in cloud ecosystems (Microsoft Azure AWS, Google Cloud Platform,). Experience with monitoring and observability tools (e.g. Splunk, Cloudwatch, AppDynamics, NewRelic, ELK, Prometheus, OpenTelemetry). Experience with configuration management systems (e.g. Puppet, Ansible, Chef, Salt, Terraform). Experience working with continuous integration/continuous deployment tools (e.g. Git, Teamcity, Jenkin, Artifactory). Experience in GitOps based automation is Plus Bachelor’s degree (or equivalent years of experience). 5+ years of relevant work experience. SRE experience preferred. Background in Manufacturing, Platform/Tech compnies is preferred. Must have Public Cloud provider certifications (Azure, GCP or AWS) Having CNCF certification is plus Other Information Travel: as required. Job is primarily performed in a Hybrid office environment. McCain Foods is an equal opportunity employer. We see value in ensuring we have a diverse, antiracist, inclusive, merit-based, and equitable workplace. As a global family-owned company we are proud to reflect the diverse communities around the world in which we live and work. We recognize that diversity drives our creativity, resilience, and success and makes our business stronger. McCain is an accessible employer. If you require an accommodation throughout the recruitment process (including alternate formats of materials or accessible meeting rooms), please let us know and we will work with you to meet your needs. Your privacy is important to us. By submitting personal data or information to us, you agree this will be handled in accordance with the Global Employee Privacy Policy Job Family: Information Technology Division: Global Digital Technology Department: I and O Project Delivery Location(s): IN - India : National Capital Territory : New Delhi Company: McCain Foods(India) P Ltd

Posted 4 days ago

Apply

5.0 years

0 Lacs

Delhi, India

Remote

Position Title: Infrastructure Solution Architect Position Type: Regular - Full-Time Position Location: New Delhi Requisition ID: 32004 Job Purpose As a Cloud Infrastructure Solution Architect, you'll drive the success of our IT Architecture program through your design expertise and consultative approach. You'll collaborate with stakeholders to understand their technical requirements, designing and documenting tailored solutions. Your blend of architecture and operations experience will enable you to accurately size work efforts and determine the necessary skills and resources for projects. Strong communication, time management, and process skills are essential for success in this role. You should have deep experience in defining Infrastructure solutions: Design, Architecture and Solution Building blocks. Role Overview The cloud infrastructure architect role helps teams (such as product teams, platform teams and application teams) successfully adopt cloud infrastructure and platform services. It is heavily involved in design and implementation activities that result in new or improved cloud-related capabilities, and it brings skills and expertise to such areas as cloud technical architecture (for a workload’s use of infrastructure as a service [IaaS] and platform as a service [PaaS] components); automating cloud management tasks, provisioning and configuration management; and other aspects involved in preparing and optimizing cloud solutions. Successful outcomes are likely to embrace infrastructure-as-code (IaC), DevOps and Agile ways of working and associated automation approaches, all underpinned by the cloud infrastructure engineer’s solid understanding of networking and security in the cloud. The nature of the work involved means that the cloud infrastructure engineer will directly engage with customer teams, but will also work on cloud infrastructure platform capabilities that span multiple teams. The cloud infrastructure architect collaborates closely with other architects, product/platform teams, software developers, Cloud Engineers, site reliability engineers (SREs), security, and network specialists, as well as other roles, particularly those in the infrastructure and operations. Being an approachable team-player is therefore crucial for success, and willingness to lead initiatives is important too. The cloud infrastructure engineer also supports colleagues with complex (escalated) operational concerns in areas such as deployment activities, event management, incident and problem management, availability, capacity and service-level management, as well as service continuity. The cloud infrastructure architect is expected to demonstrate strong attention to detail and a customer-centric mindset. Inquisitiveness, determination, creativity, communicative and collaboration skills are important qualities too. Key Responsibilities Provide expert knowledge on cloud infrastructure and platforms solutions architecture, to ensure our organization achieves its goals for cloud adoption. This involves translating cloud strategy and architecture into efficient, resilient, and secure technical implementations. Define cloud infrastructure landing zones, regional subscriptions, Availability Zone, to ensure HA, resiliency and reliability of Infrastructure and applciations Offer cloud-engineering thought leadership in areas to define specific cloud use cases, cloud service providers, and/or strategic tools and technologies Support cloud strategy working on new cloud solutions including analysing requirements, supporting technical architecture activities, prototyping, design and development of infrastructure artifacts, testing, implementation, and the preparation for ongoing support. Work on cloud migration projects, including analyzing requirements and backlogs, identifying migration techniques, developing migration artifacts, executing processes, and ensuring preparations for ongoing support. Design, build, deliver, maintain and improve infrastructure solutions. This includes automation strategies such as IaC, configuration-as-code, policy-as-code, release orchestration and continuous integration/continuous delivery (CI/CD) pipelines, and collaborative ways of working (e.g., DevOps). Participate in change and release management processes, carrying out complex provisioning and configuration tasks manually, where needed. Research and prototype new tools and technologies to enhance cloud platform capabilities. Proactively identify innovative ways to reduce toil, and teach, coach or mentor others to improve cloud outcomes using automation. Improve reliability, scalability and efficiency by working with product engineers and site reliability engineers to ensure well-architected and thoughtfully operationalized cloud infrastructures. This includes assisting with nonfunctional requirements, such as data protection, high availability, disaster recovery, monitoring requirements and efficiency considerations in different environments. Provide subject matter expertise for all approved IaaS and PaaS services, respond promptly to escalated incidents and requests, and build reusable artifacts ready for deployment to cloud environments. Exert influence that lifts cloud engineering competency by participating in (and, where applicable, leading) organizational learning practices, such as communities of practice, dojos, hackathons and centers of excellence (COEs). Actively participate in mentoring. Practice continuous improvement and knowledge sharing (e.g., providing KB articles, training and white papers). Participate in planning and optimization activities, including capacity, reliability, cost management and performance engineering. Establish FinOps Practices — Cloud Cost management, Scale up/down, Environment creation/deletion based on consumption Work closely with security specialists to design, implement and test security controls, and ensure engineering activities align to security configuration guidance. Establish logging, monitoring and observability solutions, including identification of requirements, design, implementation and operationalization. Optimize infrastructure integration in all scenarios — single cloud, multicloud and hybrid. Convey the pros and cons of cloud services and other cloud engineering topics to others at differing levels of cloud maturity and experience, and in different roles (e.g., developers and business technologists). Be forthcoming and open when the cloud is not the best solution. Work closely with third-party suppliers, both as an individual contributor and as a project lead, when required. Engage with vendor technical support as the customer lead role when appropriate. Participate/Lead problem management activities, including post-mortem incident analysis, providing technical insight, documented findings, outcomes and recommendations as part of a root cause analysis. Support resilience activities — e.g., disaster recovery (DR) testing, performance testing and tabletop planning exercises. The role holder is also expected to: Ensure that activities are tracked and auditable by leveraging service enablement systems, logging activity in the relevant systems of record, and following change and release processes. Collaborate with peers from other teams, such as security, compliance, enterprise architecture, service governance, and IT finance to implement technical controls to support governance, as necessary. Work in accordance with the organization’s published standards and ensure that services are delivered in compliance with policy. Promptly respond to requests for engineering assistance from technical customers as needed. Provide engineering support, present ideas and create best-practice guidance materials. Strive to meet service-level expectations. Foster ongoing, closer and repeatable engagement with customers to achieve better, scalable outcomes. Take ownership of personal development, working with line management to identify development opportunities. Work with limited guidance, independently and/or as part of a team on complex problems, potentially requiring close collaboration with remotely based employees and third-party providers. Follow standard operating procedures, propose improvements and develop new standard operating procedures to further industrialize our approach. Advocate for simplification and workflow optimization, and follow documentation standards. Skills And Experience Skills and Experience in the following activities/working styles is essential: Collaboration with developers (and other roles, such as SREs and DevSecOps Engineers) to plan, design, implement, operationalize and problem solve workloads that leverage cloud infrastructure and platform services. Working in an infrastructure or application support team. Cloud migration project experience. [Data center to Cloud IAAS, Cloud Native, Hybrid Cloud] Securing cloud platforms and cloud workloads in collaboration with security teams. Familiarity or experience with DevOps/DevSecOps. Agile practices (such as Scrum/Sprints, Customer Journey Mapping, Kanban). Proposing new standards, addressing peer feedback and advocating for improvement. Understanding of software engineering principles (source control, versioning, code reviews, etc.) Working in an environment that complies with Health and, Manufacturing Event-based architectures and associated infrastructure patterns Experience working with specific technical teams: [R&D teams, Data and analytics teams, etc.] Experience where immutable infrastructure approaches have been used Implementing highly available systems, using multi-AZ and multi region approaches Skills And Experience In The Following Technology Areas Experience with Azure, GCP, AWS, SAP cloud provider services (Azure and SAP preferred) Experience with these cloud provider services is preferred: Infra, Data, App, API and Integration Services DevOps-tooling such as CI/CD (e.g., Jenkins, Jira, Confluence, Azure DevOps/ADO, TeamCity, GitHub, GitLab) Infrastructure-as-code approaches, role-specific automation tools and associated programming languages (e.g., Ansible, ARM, Chef, Cloud Formation, Pulumi, Puppet, Terraform, Salt, AWS CDK, Azure SDK) Orchestration Tools (e.g., Morpheus Data, env0, Cloudify, Pliant, Quali, RackN, VRA, Crossplane, ArgoCD) Knowledge of software development frameworks/Languages; [e.g., Spring, Java, GOlang, PHP, Python] Container management (e.g., Docker, Rancher, Kubernetes, AKS, EKS, GKE, RHOS, VMware Tanzu) Virtualization platforms (e.g., VMware, Hyper-V) Operating systems (e.g., Windows and Linux including scripting experience) Database technologies and caching (e.g., Postgres, MSSQL, NoSQL, Redis, CDN) Identity and access management (e.g., Active Directory/Azure AD, Group Policy, SSO, cloud RBAC and hierarchy and federation) Monitoring tools (e.g., AWS CloudWatch, Elastic Stack (Elastic Search/Logstash/Kibana), Datadog, LogicMonitor, Splunk) Cloud networking (e.g., Subnetting, Route Tables, Security Groups, VPC, VPC Peering, NACLS, VPN, Transit Gateways, optimizing for egress costs) Cloud security (e.g., key management services, encryption, other core security services/controls the organization uses) Landing Zone Automation solutions (e.g., AWS Control tower) Policy guardrails (e.g., policy-as-code approaches, cloud provider native policy tools, Hashicorp Sentinel, Open Policy Agent) Scalable architectures, including APIs, microservices and PaaS. Analyzing cloud spending and optimizing resources (e.g., Apptio Cloudability, Flexera One, IBM Turbonomic, Netapp Spot, VMware CloudHealth) Implementing resilience (e.g., multi-AZ, multi-region, backup and recovery tools) Cloud provider frameworks (e.g., Well-Architected) Working with architecture tools and associated artifacts General skills, behaviors, competencies and experience required includes: Strong communication skills (both written and verbal), including the ability to adapt style to a nontechnical audience Ability to stay calm and focused under pressure Collaborative working Proactive and detail-oriented, strong analytical skills, and the ability to leverage a data-driven approach Willing to share expertise and best practices, including mentoring and coaching others Continuous learning mindset, keen to learn and explore new areas — not afraid of starting from a novice level Ability to present solutions, defend criticism of ideas, and provide constructive peer reviews Ability to build consensus, make decisions based on many variables and gain support for initiatives Business acumen, preferably industry and domain-specific knowledge relevant to the enterprise and its business units Deep understanding of current and emerging I&O, and, in particular, cloud, technologies and practices Achieve compliance requirements by applying technical capabilities, processes and procedures as required Job Requirements Education and Qualifications Essential Bachelor’s or master's degree in computer science, information systems, a related field, or equivalent work experience Ten or more years of related experience in similar roles Must have worked on implementing cloud at enterprise scale Desirable Cloud provider/Hyperscalers certifications preferred. Must Have Skills and Experience Strong problem solving and analytical skills. Strong interpersonal and written and verbal communication skills. Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies. Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell). Experience with incident and response management. Experience with Agile and DevOps development methodologies. Experience with container technologies and supporting tools (e.g. Docker Swarm, Podman, Kubernetes, Mesos). Experience with working in cloud ecosystems (Microsoft Azure AWS, Google Cloud Platform,). Experience with monitoring and observability tools (e.g. Splunk, Cloudwatch, AppDynamics, NewRelic, ELK, Prometheus, OpenTelemetry). Experience with configuration management systems (e.g. Puppet, Ansible, Chef, Salt, Terraform). Experience working with continuous integration/continuous deployment tools (e.g. Git, Teamcity, Jenkin, Artifactory). Experience in GitOps based automation is Plus Qualifications Bachelor’s degree (or equivalent years of experience). 5+ years of relevant work experience. SRE experience preferred. Background in Manufacturing, Platform/Tech compnies is preferred. Must have Public Cloud provider certifications (Azure, GCP or AWS) Having CNCF certification is plus Started sharing status update to Function Owner and CC to Hiring Manager twice a week Approaching Hiring Manager for the status keeping in CC, McCain's HR Head and TA Head Started interacting with Hiring Managers on MS Teams every alternate days McCain Foods is an equal opportunity employer. We see value in ensuring we have a diverse, antiracist, inclusive, merit-based, and equitable workplace. As a global family-owned company we are proud to reflect the diverse communities around the world in which we live and work. We recognize that diversity drives our creativity, resilience, and success and makes our business stronger. McCain is an accessible employer. If you require an accommodation throughout the recruitment process (including alternate formats of materials or accessible meeting rooms), please let us know and we will work with you to meet your needs. Your privacy is important to us. By submitting personal data or information to us, you agree this will be handled in accordance with the Global Employee Privacy Policy Job Family: Information Technology Division: Global Digital Technology Department: Infrastructure Architecture Location(s): IN - India : Haryana : Gurgaon Company: McCain Foods(India) P Ltd

Posted 4 days ago

Apply

0 years

4 - 6 Lacs

Hyderābād

On-site

Job Summary We are looking for a highly skilled and adaptable Site Reliability Engineer to become a key member of our Cloud Engineering team. In this crucial role, you will be instrumental in designing and refining our cloud infrastructure with a strong focus on reliability, security, and scalability . As an SRE, you'll apply software engineering principles to solve operational challenges, ensuring the overall operational resilience and continuous stability of our systems. This position requires a blend of managing live production environments and contributing to engineering efforts such as automation and system improvements. Key Responsibilities: Cloud Infrastructure Architecture and Management: Design, build, and maintain resilient cloud infrastructure solutions to support the development and deployment of scalable and reliable applications. This includes managing and optimizing cloud platforms for high availability, performance, and cost efficiency. Enhancing Service Reliability: Lead reliability best practices by establishing and managing monitoring and alerting systems to proactively detect and respond to anomalies and performance issues. Utilize SLI, SLO, and SLA concepts to measure and improve reliability. Identify and resolve potential bottlenecks and areas for enhancement. Driving Automation and Efficiency: Contribute to the automation, provisioning, and standardization of infrastructure resources and system configurations. Identify and implement automation for repetitive tasks to significantly reduce operational overhead. Develop Standard Operating Procedures (SOPs) and automate workflows using tools like Rundeck or Jenkins. Incident Response and Resolution: Participate in and help resolve major incidents, conduct thorough root cause analyses, and implement permanent solutions. Effectively manage incidents within the production environment using a systematic problem-solving approach. Collaboration and Innovation: Work closely with diverse stakeholders and cross-functional teams, including software engineers, to integrate cloud solutions, gather requirements, and execute Proof of Concepts (POCs). Foster strong collaboration and communication. Guide designs and processes with a focus on resilience and minimizing manual effort. Promote the adoption of common tooling and components, and implement software and tools to enhance resilience and automate operations. Be open to adopting new tools and approaches as needed. Required Skills and Experience: Cloud Platforms: Demonstrated expertise in at least one major cloud platform (AWS, Azure, or GCP). Infrastructure Management: Proven proficiency in on-premises hosting and virtualization platforms (VMware, Hyper-V, or KVM). Solid understanding of storage internals (NAS, SAN, EFS, NFS) and protocols (FTP, SFTP, SMTP, NTP, DNS, DHCP). Experience with networking and firewall technologies. Strong hands-on experience with Linux internals and operating systems (RHEL, CentOS, Rocky Linux). Experience with Windows operating systems to support varied environments. Extensive experience with containerization (Docker) and orchestration (Kubernetes) technologies. Automation & IaC: Proficiency in scripting languages (shell and Python). Experience with configuration management tools (Ansible or Puppet). Must have exposure to Infrastructure as Code (IaC) tools (Terraform or CloudFormation). Monitoring & Observability: Experience setting up and configuring monitoring tools (Prometheus, Grafana, or the ELK stack). Hands-on experience implementing OpenTelemetry for observability. Familiarity with monitoring and logging tools for cloud-based applications. Service Reliability Concepts: A strong understanding of SLI, SLO, SLA, and error budgeting. Soft Skills & Mindset: Excellent communication and interpersonal skills for effective teamwork. We value proactive individuals who are eager to learn and adapt in a dynamic environment. Must possess a pragmatic and adaptable mindset, with a willingness to step outside comfort zones and acquire new skills. Ability to consider the broader system impact of your work. Must be a change advocate for reliability initiatives. Desired/Bonus Skills: Experience with DevOps toolchain elements like Git, Jenkins, Rundeck, ArgoCD, or Crossplane. Experience with database management, particularly MySQL and Hadoop. Knowledge of cloud cost management and optimization strategies. Understanding of cloud security best practices, including data encryption, access controls, and identity management. Experience implementing disaster recovery and business continuity plans. Familiarity with ITIL (Information Technology Infrastructure Library) processes

Posted 4 days ago

Apply

2.0 years

4 - 8 Lacs

Bengaluru

On-site

Company Description Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more than 200 countries and territories each year. Our mission is to connect the world through the most innovative, convenient, reliable, and secure payments network, enabling individuals, businesses, and economies to thrive while driven by a common purpose – to uplift everyone, everywhere by being the best way to pay and be paid. Make an impact with a purpose-driven industry leader. Join us today and experience Life at Visa. Job Description We are seeking a motivated Site Reliability Engineer (SRE) to join our Observability team. In this role, you will support the team in maintaining and improving the reliability, security, and performance of our systems. You will learn from experienced engineers while gaining hands-on experience with modern monitoring, logging, and automation tools. As an SRE I, you will assist in day-to-day operational tasks, help monitor system health, and participate in basic troubleshooting. You will also contribute to the maintenance of documentation and develop your technical skills through training and on-the-job experience. This is a hybrid position, requiring 2–3 days per week in the office, as determined by leadership. Responsibilities Assist in maintaining system security by applying hotfixes and operating system patches under guidance to protect against cybersecurity threats. Support the deployment and configuration of monitoring and logging tools. Help automate routine operational tasks to improve efficiency and support system integration. Assist with the maintenance and basic management of observability tools such as Splunk, ClickHouse, Grafana, Prometheus, OpenTelemetry, Fluent Bit, ElasticSearch, OpenSearch, and CloudWatch. Work with team members to help implement and maintain monitoring solutions in development, staging, and production environments. Learn and apply DevOps and SRE best practices as directed by senior engineers. Contribute to the setup and maintenance of CI CD pipelines to support automated build, test, and deployment processes. Provide support in managing cloud infrastructure (AWS, GCP) to help ensure availability and security. Learn to use infrastructure as code tools such as Terraform, Ansible, or CloudFormation to support environment configuration. Monitor system performance and assist in identifying and escalating issues for resolution. Support the implementation and management of containerization technologies like Docker and Kubernetes. Participate in basic troubleshooting and assist with root cause analysis for production incidents. Help create and update documentation for infrastructure, processes, and operational procedures. Provide first-level support for routine infrastructure and deployment issues, escalating complex problems as needed. Look for opportunities to automate repetitive tasks and suggest improvements to workflows. Justification Visa’s Observability ecosystem includes over 2,000 platform nodes, utilizing approximately 15 different tools for logging, monitoring, and tracing, alongside 80,000 client agents. The system handles daily log ingestion exceeding 100TB and oversees hundreds of critical applications, supporting vital alerts, dashboards, and reports. To maintain this high level of performance and reliability, we need a Site Reliability Engineer (SRE) with comprehensive knowledge and practical experience. This position requires an I4-level engineer who can operate independently with minimal supervision. About Visa’s PRE Observability Team Visa’s Product Reliability Engineering (PRE) Observability team partners with Product Development as well as Operations & Infrastructure teams to build and manage innovative, reliable, scalable, secure, and cost-effective observability platform solutions. We are looking for talented Senior Site Reliability Engineers to join our driven team, with a focus on maximizing system availability, performance, security, and reliability. This dynamic role requires technical leadership, strong problem-solving skills, and expertise in coding, testing, and debugging. This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager. Qualifications Basic Qualifications: Bachelor’s degree with at least 2 years of relevant work experience, OR Advanced degree (e.g., Master’s, MBA, JD, MD) with no required work experience, OR 5+ years of relevant professional experience. Preferred Qualifications: Academic, internship, or hands-on experience with at least one observability tool (e.g., Splunk, ClickHouse, Grafana, Prometheus, OpenTelemetry, Fluent Bit, ElasticSearch, OpenSearch, or CloudWatch). Familiarity with setting up or configuring exporters (such as Node exporter or Cert exporter) for collecting metrics. Exposure to containerization technologies such as Docker or Kubernetes, either through coursework, projects, or internships. Basic understanding or experience with CI CD tools and pipelines (e.g., GitHub Actions, Jenkins, or Ansible). Introductory knowledge of Infrastructure as Code concepts and tools like Terraform or Ansible. Awareness of query languages such as PromQL, SQL, or Splunk SPL. Experience using Linux or Unix environments and basic scripting skills in Python and or Shell. Interest in cloud platforms such as AWS or GCP,cloud certifications are a plus. Strong problem-solving and analytical skills, with a willingness to learn and grow in a collaborative environment. Effective verbal and written communication skills. Ability to work well in a team and take initiative in learning new technologies and practices. Additional Information Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.

Posted 4 days ago

Apply

Exploring Opentelemetry Jobs in India

The job market for opentelemetry professionals in India is rapidly growing, with many companies looking to adopt this technology to improve their observability and monitoring capabilities. If you are a job seeker interested in opentelemetry roles, there are plenty of opportunities waiting for you in India.

Top Hiring Locations in India

  1. Bangalore
  2. Hyderabad
  3. Pune
  4. Chennai
  5. Mumbai

Average Salary Range

The average salary range for opentelemetry professionals in India varies based on experience level: - Entry-level: INR 5-8 lakhs per annum - Mid-level: INR 10-15 lakhs per annum - Experienced: INR 18-25 lakhs per annum

Career Path

A typical career path in opentelemetry may progress as follows: - Junior Developer - Developer - Senior Developer - Tech Lead

Related Skills

In addition to proficiency in opentelemetry, employers often look for candidates with the following skills: - Proficiency in cloud platforms like AWS, GCP, or Azure - Knowledge of monitoring and observability tools - Strong programming skills in languages like Java, Python, or Go

Interview Questions

  • What is opentelemetry and how does it differ from other monitoring tools? (basic)
  • How would you set up opentelemetry in a microservices architecture? (medium)
  • Can you explain the benefits of distributed tracing in opentelemetry? (medium)
  • Describe how sampling works in opentelemetry. (medium)
  • How would you troubleshoot performance issues using opentelemetry data? (advanced)
  • Explain the role of exporters in opentelemetry. (basic)
  • What are the key components of an opentelemetry instrumentation library? (medium)
  • How does opentelemetry handle context propagation between services? (medium)
  • Can you explain the concept of spans and traces in opentelemetry? (basic)
  • How would you integrate opentelemetry with a logging framework? (medium)
  • Describe the process of creating custom metrics in opentelemetry. (advanced)
  • What are the common challenges faced when implementing opentelemetry in a large-scale system? (advanced)
  • How does opentelemetry handle data collection in a multi-tenant environment? (advanced)
  • What are the best practices for securing opentelemetry data transmissions? (advanced)
  • Can you explain the role of the opentelemetry collector in data processing? (medium)
  • How would you monitor the performance of opentelemetry itself? (advanced)
  • Describe a scenario where opentelemetry helped improve the performance of a system. (advanced)
  • How does opentelemetry handle sampling in a distributed system? (medium)
  • What are the key differences between opentelemetry and other APM tools? (medium)
  • How can opentelemetry be integrated with containerized applications? (medium)
  • Explain the concept of baggage in opentelemetry context propagation. (medium)
  • How would you handle log correlation with opentelemetry traces? (advanced)
  • Can you share your experience with migrating from a different monitoring tool to opentelemetry? (advanced)
  • What are the key considerations for scaling opentelemetry in a growing infrastructure? (advanced)
  • How would you contribute to the opentelemetry open-source project? (advanced)

Conclusion

As you prepare for opentelemetry job interviews in India, make sure to brush up on your technical knowledge, practice coding exercises, and familiarize yourself with common interview questions. With the right skills and preparation, you can confidently pursue a rewarding career in this exciting field. Good luck!

cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies