Jobs
Interviews

34 Fluentd Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

8.0 - 13.0 years

2 - 2 Lacs

hyderabad

Work from Office

SUMMARY Job Description: ITOps (Monitoring and Observability) Consultant - Lateral Hire (Minimum Relevant Experience 7 Years) Overview: We are seeking a skilled IT Operations Consultant specializing in Monitoring and Observability to design, implement, and optimize monitoring solutions for our customers. The ideal candidate will have a minimum of 7 years of relevant experience, with a strong background in monitoring, observability and IT service management. The ideal candidate will be responsible for ensuring system reliability, performance, and availability by creating robust observability architectures and leveraging modern monitoring tools. Primary Responsibilities: Design end-to-end monitoring and observability solutions to provide comprehensive visibility into infrastructure, applications, and networks. Implement monitoring tools and frameworks (e.g., Prometheus, Grafana, OpsRamp, Dynatrace, New Relic) to track key performance indicators and system health metrics. Integration of monitoring and observability solutions with IT Service Management Tools. Develop and deploy dashboards, alerts, and reports to proactively identify and address system performance issues. Architect scalable observability solutions to support hybrid and multi-cloud environments. Collaborate with infrastructure, development, and DevOps teams to ensure seamless integration of monitoring systems into CI/CD pipelines. Continuously optimize monitoring configurations and thresholds to minimize noise and improve incident detection accuracy. Automate alerting, remediation, and reporting processes to enhance operational efficiency. Utilize AIOps and machine learning capabilities for intelligent incident management and predictive analytics. Work closely with business stakeholders to define monitoring requirements and success metrics. Document monitoring architectures, configurations, and operational procedures. Required Skills: Strong understanding of infrastructure and platform development principles and experience with programming languages such as Python, Ansible, for developing custom scripts. Strong knowledge of monitoring frameworks, logging systems (ELK stack, Fluentd), and tracing tools (Jaeger, Zipkin) along with the OpenSource solutions like Prometheus, Grafana. Extensive experience with monitoring and observability solutions such as OpsRamp, Dynatrace, New Relic, must have worked with ITSM integration (e.g. integration with ServiceNow, BMC remedy, etc.) Working experience with RESTful APIs and understanding of API integration with the monitoring tools. Familiarity with AIOps and machine learning techniques for anomaly detection and incident prediction. Knowledge of ITIL processes and Service Management frameworks. Familiarity with security monitoring and compliance requirements. Excellent analytical and problem-solving skills, ability to debug and troubleshoot complex automation issues

Posted 4 days ago

Apply

8.0 - 14.0 years

0 Lacs

karnataka

On-site

As a Platform Development and Machine Learning expert at Adobe, you will play a crucial role in changing the world through digital experiences by building scalable AI platforms and designing ML pipelines. Your responsibilities will include: - Building scalable AI platforms that are customer-facing and evangelizing the platform with customers and internal stakeholders. - Ensuring platform scalability, reliability, and performance to meet business needs. - Designing ML pipelines for experiment management, model management, feature management, and model retraining. - Implementing A/B testing of models and designing APIs for model inferencing at scale. - Demonstrating proven expertise with MLflow, SageMaker, Vertex AI, and Azure AI. - Serving as a subject matter expert in LLM serving paradigms and possessing deep knowledge of GPU architectures. - Expertise in distributed training and serving of large language models and proficiency in model and data parallel training using frameworks like DeepSpeed and service frameworks like vLLM. - Demonstrating proven expertise in model fine-tuning and optimization techniques to achieve better latencies and accuracies in model results. - Reducing training and resource requirements for fine-tuning LLM and LVM models. - Having extensive knowledge of different LLM models and providing insights on the applicability of each model based on use cases. - Delivering end-to-end solutions from engineering to production for specific customer use cases. - Showcasing proficiency in DevOps and LLMOps practices, including knowledge in Kubernetes, Docker, and container orchestration. - Deep understanding of LLM orchestration frameworks like Flowise, Langflow, and Langgraph. Your skills matrix should include expertise in LLM such as Hugging Face OSS LLMs, GPT, Gemini, Claude, Mixtral, Llama, LLM Ops such as ML Flow, Langchain, Langraph, LangFlow, Flowise, LLamaIndex, SageMaker, AWS Bedrock, Vertex AI, Azure AI, Databases/Datawarehouse like DynamoDB, Cosmos, MongoDB, RDS, MySQL, PostGreSQL, Aurora, Spanner, Google BigQuery, Cloud Knowledge of AWS/Azure/GCP, Dev Ops knowledge in Kubernetes, Docker, FluentD, Kibana, Grafana, Prometheus, and Cloud Certifications (Bonus) in AWS Professional Solution Architect, AWS Machine Learning Specialty, Azure Solutions Architect Expert. Proficiency in Python, SQL, and Javascript is also required. Adobe is committed to creating exceptional employee experiences and values diversity. If you require accommodations to navigate the website or complete the application process, please contact accommodations@adobe.com or call (408) 536-3015.,

Posted 5 days ago

Apply

3.0 - 5.0 years

0 Lacs

bengaluru, karnataka, india

On-site

The Role: We are seeking a Site Reliability Engineer (SRE) to ensure our multi-cloud networking platform meets and exceeds the stringent reliability, performance, and availability targets our enterprise customers demand. This is not a traditional operations role you will apply a software engineering mindset to solve complex infrastructure challenges and automate solutions at scale. You will be the guardian of our production environment, responsible for the uptime of our services and the architect of the systems that allow us to scale with confidence. Your work is critical to building and maintaining the trust of our customers. Responsibilities: Define and Manage Reliability: Establish and own the Service Level Objectives (SLOs) and Service Level Indicators (SLIs) that define the reliability of our platform. Participate in a blameless post-incident analysis culture and an on-call rotation to manage and resolve production incidents. Build and Own the Observability Stack: Design, implement, and manage our complete observability stack, leveraging tools like Prometheus for metrics, Grafana for visualization, Elasticsearch for logging, and Jaeger/OpenTelemetry for distributed tracing to provide end-to-end visibility into our distributed system. Automate Everything: Write robust automation and tooling in Python or Go to eliminate manual operational tasks, from incident response to infrastructure provisioning. Infrastructure as Code (IaC): Use Terraform and Ansible to manage our multi-cloud infrastructure as code, ensuring our environments are consistent, repeatable, and auditable. Kubernetes and Cloud Operations: Manage, troubleshoot, and scale our Kubernetes clusters across our multi-cloud footprint (AWS, Azure, GCP). You will be the expert on running our application reliably in a containerized environment. CI/CD and Release Engineering: Collaborate with development teams to enhance our CI/CD pipelines, ensuring that every release is safe, reliable, and can be deployed with high velocity. Required Qualifications: 3-5+ years of experience in a Site Reliability Engineering (SRE), DevOps, or similar infrastructure-focused software engineering role. Strong programming and automation skills in Python or Go. Deep, hands-on expertise with a modern observability stack, including Prometheus, Grafana, and the ELK Stack (Elasticsearch, Logstash/Fluentd, Kibana). Proven experience with Infrastructure as Code (Terraform) and configuration management (Ansible). In-depth knowledge of running, managing, and troubleshooting applications on Kubernetes in a production, multi-cloud environment. A rigorous, data-driven approach to reliability and a deep understanding of distributed systems, their failure modes, and how to make them resilient. Preferred Qualifications: Experience with distributed tracing using Jaeger or OpenTelemetry. A strong understanding of cloud networking concepts (VPCs, subnets, routing, security groups). Experience defining and tracking SLOs and error budgets. Experience in a fast-paced startup environment. Relevant certifications such as Certified Kubernetes Administrator (CKA) or cloud provider certifications (AWS, Azure, GCP).

Posted 1 week ago

Apply

3.0 - 5.0 years

0 Lacs

bengaluru, karnataka, india

On-site

We have an exciting and rewarding opportunity for you to take your software engineering career to the next level. As a Software Engineer III at JPMorgan Chase within the Consumer and Community Banking - Banking and Wealth Management Team, you serve as a seasoned member of an agile team to design and deliver trusted market-leading technology products in a secure, stable, and scalable way. You are responsible for carrying out critical technology solutions across multiple technical areas within various business functions in support of the firm's business objectives. Job responsibilities Executes software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to build solutions or break down technical problems Creates secure and high-quality production code and maintains algorithms that run synchronously with appropriate systems Produces architecture and design artifacts for complex applications while being accountable for ensuring design constraints are met by software code development Gathers, analyzes, synthesizes, and develops visualizations and reporting from large, diverse data sets in service of continuous improvement of software applications and systems Proactively identifies hidden problems and patterns in data and uses these insights to drive improvements to coding hygiene and system architecture Contributes to software engineering communities of practice and events that explore new and emerging technologies Adds to team culture of diversity, opportunity, inclusion, and respect Required qualifications, capabilities, and skills Formal training or certification on software engineering concepts and 3+ years applied experience Hands-on experience with cloud-based applications, technologies and tools, deployment, monitoring and operations, such as Kubernetes, Prometheus, FluentD, Slack, Elasticsearch, Grafana, Kibana, etc. Relational and NoSQL databases developing and managing operations leveraging key event streaming, messaging and DB services such as Cassandra, MQ/JMS/Kafka, Aurora, RDS, Cloud SQL, BigTable, DynamoDB, MongoDB, Cloud Spanner, Kinesis, Cloud Pub/Sub, etc. Networking (Security, Load Balancing, Network Routing Protocols, etc.) Demonstrated experience in the fields of production engineering and automation. Strong understanding of cloud technology standards and practices. Proficiency in utilizing tools for monitoring, analysis, and troubleshooting, including Splunk, Dynatrace, Datadog, or equivalent. Preferred qualifications, capabilities, and skills Ability to conduct detailed analysis on incidents to identify patterns and trends, thereby enhancing operational stability and efficiency. Familiarity with digital certificate management and automation tools. Knowledge of frameworks such as CI/CD pipeline. Excellent communication and collaboration skills.

Posted 1 week ago

Apply

3.0 - 7.0 years

9 - 13 Lacs

pune

Work from Office

As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes. Your primary responsibilities include: 24x7 Observability:Be part of a worldwide team that monitors the health of production systems and services around the clock, ensuring continuous reliability and optimal customer experience. Cross-Functional Troubleshooting:Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively. Deployment and Configuration:Leverage Continuous Delivery (CI/CD) tools to deploy services and configuration changes at enterprise scale. Security and Compliance Implementation:Implementing security measures that meet or exceed industry standards for regulations such as GDPR, SOC2, ISO 27001, PCI, HIPAA, and FBA. Maintenance and Support:Tasks related to applying Couchbase security patches and upgrades, supporting Cassandra and Mongo for pager duty rotation, and collaborating with Couchbase Product support for issue resolution. Required education Bachelor's Degree Required technical and professional expertise System Monitoring and Troubleshooting:Strong skills in monitoring/observability, issue response, and troubleshooting for optimal system performance. Automation Proficiency:Proficiency in automation for production environment changes, streamlining processes for efficiency, and reducing toil. Linux Proficiency:Strong knowledge of Linux operating systems. Operation and Support Experience:Demonstrated experience in handling day-to-day operations, alert management, incident support, migration tasks, and break-fix support. Experience with Infrastructure as Code (Terraform/OpenTofu) Experience with ELK/EFK stack (ElasticSearch, Logstash/Fluentd, and Kibana) Preferred technical and professional experience Kubernetes/OpenShift:Strongly preferred experience in working with production Kubernetes/OpenShift environments. Automation/Scripting:In depth experience with the Ansible, Python, Terraform, and CI/CD tools such as Jenkins, IBM Continuous Delivery, ArgoCD Monitoring/Observability:Hands on experience crafting alerts and dashboards using tools such as Instana, Grafana/Prometheus Experience working in an agile team, e.g., Kanban

Posted 1 week ago

Apply

5.0 - 9.0 years

0 Lacs

hyderabad, telangana

On-site

As a Senior Observability Engineer, you will play a crucial role in leading the design, development, and maintenance of observability solutions across our infrastructure, applications, and services. Your primary responsibility will be to implement cutting-edge monitoring, logging, and tracing solutions to ensure the reliability, performance, and availability of our complex, distributed systems. Collaboration with cross-functional teams, including Development, Infrastructure Engineers, DevOps, and SREs, will be essential to optimize system observability and enhance our incident response capabilities. Key Responsibilities: - Lead the Design & Implementation of observability solutions for cloud and on-premises environments, encompassing monitoring, logging, and tracing. - Drive the Development and maintenance of advanced monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics. - Implement Distributed Tracing frameworks like OpenTelemetry, Jaeger, or Zipkin to enhance application performance diagnostics and troubleshooting. - Optimize Log Management and analysis strategies using tools like Elasticsearch, Splunk, Loki, and Fluentd for efficient log processing and insights. - Develop Advanced Alerting and anomaly detection strategies to proactively identify system issues and improve Mean Time to Recovery (MTTR). - Collaborate with Development & SRE Teams to enhance observability in CI/CD pipelines, microservices architectures, and various platform environments. - Automate Observability Tasks by leveraging scripting languages such as Python, Bash, or Golang to increase efficiency and scale observability operations. - Ensure Scalability & Efficiency of monitoring solutions to manage large-scale distributed systems and meet evolving business requirements. - Lead Incident Response by providing actionable insights through observability data for effective troubleshooting and root cause analysis. - Stay Abreast of Industry Trends in observability, Site Reliability Engineering (SRE), and monitoring practices to continuously improve processes. Required Qualifications: - 5+ years of hands-on experience in observability, SRE, DevOps, or related fields, with a proven track record in managing complex, large-scale distributed systems. - Expert-level proficiency in observability tools such as Prometheus, Grafana, Datadog, New Relic, AppDynamics, and the ability to design and implement these solutions at scale. - Advanced experience with log management platforms like Elasticsearch, Splunk, Loki, and Fluentd, optimizing log aggregation and analysis for performance insights. - Deep expertise in distributed tracing tools like OpenTelemetry, Jaeger, or Zipkin, focusing on performance optimization and root cause analysis. - Extensive experience with cloud environments (Azure, AWS, GCP) and Kubernetes for deploying and managing observability solutions in cloud-native infrastructures. - Advanced proficiency in scripting languages like Python, Bash, or Golang, and experience with Infrastructure as Code (IaC) tools such as Terraform and Ansible. - Strong understanding of system architecture, performance tuning, and troubleshooting production environments with scalability and high availability in mind. - Proven leadership experience and the ability to mentor teams, provide technical direction, and drive best practices for observability and monitoring. - Excellent problem-solving skills, emphasizing actionable insights and data-driven decision-making. - Ability to lead high-impact projects, communicate effectively with stakeholders, and influence cross-functional teams. - Strong communication and collaboration skills, working closely with engineering teams, leadership, and external partners to achieve observability and system reliability goals. Preferred Qualifications: - Experience with AI-driven observability tools and anomaly detection techniques. - Familiarity with microservices, serverless architectures, and event-driven systems. - Proven track record of handling on-call rotations and incident management workflows in high-availability environments. - Relevant certifications in observability tools, cloud platforms, or SRE best practices are advantageous.,

Posted 2 weeks ago

Apply

2.0 - 4.0 years

0 Lacs

hyderabad, telangana, india

On-site

With your expertise in delivering infrastructure solutions, you are a top-performer in your field. Come on board as a highly appreciated member of a winning team. As an Infrastructure Engineer II at JPMorgan Chase within the Chief Technolgy Office team , you develop knowledge of software, applications, and technical processes within the infrastructure engineering discipline. Through this work you begin to apply your proficiency in a single application or technical methodology. Job responsibilities Applies technical knowledge to assignments with a defined scope such as testing the performance of the infrastructure and understanding and verifying that requirements were successfully met Drives results, collects and analyzes monitoring data in test and production, and sees assignments through to completion Carries out day to day work assignments with some guidance and within documented parameters Escalates issues to appropriate leaders Develops considerations for upstream/downstream data and systems or technical implications Adds to team culture of diversity, opportunity, inclusion, and respect Required qualifications, capabilities, and skills Formal training or certification on Infrastructure engineering concepts and 2+ years applied experience Experience in Kafka, KSQL, Splunk, Elastic/Kibana, and FluentD . Create dashboards in Grafana, alerting, and query Prometheus. Prior experience in support and implementation. Programming experience in at least one language, such as Python, or the ability to write complex shell scripts is essential. Experience in managing large clusters of Splunk, Elastic, Kafka, and FluentD, providing production support to demanding customers with frequent tickets and critical incidents Preferred qualifications, capabilities, and skills Familiarity with modern front-end technologies Relevant experience in designing, developing, and implementing software solutions, constantly seeking to be an expert

Posted 2 weeks ago

Apply

7.0 - 9.0 years

0 Lacs

hyderabad, telangana, india

On-site

JOB DESCRIPTION Are you ready to make an impact at DTCC Do you want to work on innovative projects, collaborate with a dynamic and supportive team, and receive investment in your professional development At DTCC, we are at the forefront of innovation in the financial markets. We are committed to helping our employees grow and succeed. We believe that you have the skills and drive to make a real impact. We foster a thriving internal community and are committed to creating a workplace that looks like the world that we serve. Pay and Benefits: Competitive compensation, including base pay and annual incentive Comprehensive health and life insurance and well-being benefits, based on location Pension / Retirement benefits Paid Time Off and Personal/Family Care, and other leaves of absence when needed to support your physical, financial, and emotional well-being. The Impact you will have in this role: At DTCC, the Observability team is at the forefront of ensuring the health, performance, and reliability of our critical systems and applications. We empower the organization with real-time visibility into infrastructure and business applications by leveraging cutting-edge monitoring, reporting, and visualization tools. Our team collects and analyzes metrics, logs, and traces using platforms like Splunk and other telemetry solutions. This data is essential for assessing application health and availability, and for enabling rapid root cause analysis when issues arise-helping us maintain resilience in a fast-paced, high-volume trading environment. If you're passionate about observability, data-driven problem solving, and building systems that make a real-world impact, we'd love to have you on our team. Primary Responsibilities: As a member of DTCC's Observability team, you will play a pivotal role in enhancing our monitoring and telemetry capabilities across critical infrastructure and business applications. Your responsibilities will include: Lead the migration from OpenText monitoring tools to Grafana and other open-source platforms. Design and deploy monitoring rules for infrastructure and business applications. Develop and manage alerting rules and notification workflows. Build real-time dashboards to visualize system health and performance. Configure and manage OpenTelemetry Collectors and Pipelines. Integrate observability tools with CI/CD, incident management, and cloud platforms. Deploy and manage observability agents across diverse environments. Perform upgrades and maintenance of observability platforms. Qualifications: Minimum of 07+ years of related experience. Bachelor's degree preferred or equivalent experience. Talent needed for success Proven experience designing intuitive, real-time dashboards (e.g., in Grafana) that effectively communicate system health, performance trends, and business KPIs. Expertise in defining and tuning monitoring rules, thresholds, and alerting logic to ensure accurate and actionable incident detection. Strong understanding of both application-level and operating system-level metrics, including CPU, memory, disk I/O, network, and custom business metrics. Experience with structured log ingestion, parsing, and analysis using tools like Splunk, Fluentd, or OpenTelemetry. Familiarity with implementing and analyzing synthetic transactions and real user monitoring to assess end-user experience and application responsiveness. Hands-on experience with application tracing tools and frameworks (e.g., OpenTelemetry, Jaeger, Zipkin) to diagnose performance bottlenecks and service dependencies. Proficiency in configuring and using AWS CloudWatch for collecting and visualizing cloud-native metrics, logs, and events. Understanding of containerized environments (e.g., Docker, Kubernetes) and how to monitor container health, resource usage, and orchestration metrics. Ability to write scripts or small applications in languages such as Python, Java, or Bash to automate observability tasks and data processing. Experience with automation and configuration management tools such as Ansible, Terraform, Chef, or SCCM to deploy and manage observability components at scale. Actual salary is determined based on the role, location, individual experience, skills, and other considerations. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Posted 2 weeks ago

Apply

10.0 - 12.0 years

0 Lacs

chennai, tamil nadu, india

On-site

JOB DESCRIPTION Are you ready to make an impact at DTCC Do you want to work on innovative projects, collaborate with a dynamic and supportive team, and receive investment in your professional development At DTCC, we are at the forefront of innovation in the financial markets. We are committed to helping our employees grow and succeed. We believe that you have the skills and drive to make a real impact. We foster a thriving internal community and are committed to creating a workplace that looks like the world that we serve. Pay and Benefits: Competitive compensation, including base pay and annual incentive Comprehensive health and life insurance and well-being benefits, based on location Pension / Retirement benefits Paid Time Off and Personal/Family Care, and other leaves of absence when needed to support your physical, financial, and emotional well-being. The Impact you will have in this role: At DTCC, the Observability team is at the forefront of ensuring the health, performance, and reliability of our critical systems and applications. We empower the organization with real-time visibility into infrastructure and business applications by leveraging cutting-edge monitoring, reporting, and visualization tools. Our team collects and analyzes metrics, logs, and traces using platforms like Splunk and other telemetry solutions. This data is essential for assessing application health and availability, and for enabling rapid root cause analysis when issues arise-helping us maintain resilience in a fast-paced, high-volume trading environment. If you're passionate about observability, data-driven problem solving, and building systems that make a real-world impact, we'd love to have you on our team. Primary Responsibilities: As a member of DTCC's Observability team, you will play a pivotal role in enhancing our monitoring and telemetry capabilities across critical infrastructure and business applications. Your responsibilities will include: Lead the migration from OpenText monitoring tools to Grafana and other open-source platforms. Design and deploy monitoring rules for infrastructure and business applications. Develop and manage alerting rules and notification workflows. Build real-time dashboards to visualize system health and performance. Configure and manage OpenTelemetry Collectors and Pipelines. Integrate observability tools with CI/CD, incident management, and cloud platforms. Deploy and manage observability agents across diverse environments. Perform upgrades and maintenance of observability platforms. Qualifications: Minimum of 10+ years of related experience. Bachelor's degree preferred or equivalent experience. Talent needed for success Proven experience designing intuitive, real-time dashboards (e.g., in Grafana) that effectively communicate system health, performance trends, and business KPIs. Expertise in defining and tuning monitoring rules, thresholds, and alerting logic to ensure accurate and actionable incident detection. Strong understanding of both application-level and operating system-level metrics, including CPU, memory, disk I/O, network, and custom business metrics. Experience with structured log ingestion, parsing, and analysis using tools like Splunk, Fluentd, or OpenTelemetry. Familiarity with implementing and analyzing synthetic transactions and real user monitoring to assess end-user experience and application responsiveness. Hands-on experience with application tracing tools and frameworks (e.g., OpenTelemetry, Jaeger, Zipkin) to diagnose performance bottlenecks and service dependencies. Proficiency in configuring and using AWS CloudWatch for collecting and visualizing cloud-native metrics, logs, and events. Understanding of containerized environments (e.g., Docker, Kubernetes) and how to monitor container health, resource usage, and orchestration metrics. Ability to write scripts or small applications in languages such as Python, Java, or Bash to automate observability tasks and data processing. Experience with automation and configuration management tools such as Ansible, Terraform, Chef, or SCCM to deploy and manage observability components at scale. Actual salary is determined based on the role, location, individual experience, skills, and other considerations. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Posted 2 weeks ago

Apply

3.0 - 5.0 years

0 Lacs

bengaluru, karnataka, india

On-site

At Broadridge, we've built a culture where the highest goal is to empower others to accomplish more. If you're passionate about developing your career, while helping others along the way, come join the Broadridge team. We are seeking a highly skilled Syslog Engineer & Splunk Implementation Specialist with practical experience in deploying, configuring, and maintaining enterprise-wide logging solutions in hybrid environments. The ideal candidate will focus on hands-on implementation of syslog-based log aggregation pipelines, ensuring integration with SIEM, cloud logging, and security monitoring tools. Key Responsibilities Design, implement, and manage Splunk solutions, including architecting and scaling Splunk infrastructure in hybrid environment (AWS, Onprem, Azure). Demonstrated proficiency in implementing Splunk Common Information Model (CIM) normalization, ensuring consistent data formatting and enabling advanced correlation, reporting, and analysis within the Splunk platform. Implement and manage infrastructure automation using Terraform and Chef cookbooks to support scalable and reliable deployment environments. Engineer and design centralized log aggregation solutions using syslog (Logstash, Fluentd, Filebeat, etc.) and related technologies. Develop and maintain secure, efficient, and scalable logging architectures across both Linux and Windows operating systems. Administer end-to-end log management processes, including centralized aggregation, long-term archival, and swift retrieval for analysis and auditing purposes. Develop automation scripts to optimize log ingestion, parsing, and reporting (using Phython, bash etc). Serve as a key member of the Security Operations Center (SOC), monitoring, analyzing, and responding to security events and incidents. Collaborate with cross-functional teams to ensure comprehensive log coverage and compliance with security policies. Document architecture, policies, and procedures related to logging and security event management. Required Skills and Qualifications Proven hands-on experience with Splunk architecture and SIEM engineering. Hands-on experience with DevOps tools and automation frameworks, including Terraform for infrastructure as code and Chef cookbooks for configuration management. Expertise in implementing large scale log management, syslog engineering and log aggregation techniques. Experience with Logstash, AWS OpenSearch or related technology. Expertise in Devops, deploying SIEM infrastructure through IaC(Terraform, Chef, Jenkins). Proficiency in Python and shell scripting for automation. Strong operating system knowledge - both Linux and Windows environments. Good to have at least 3 years as a SOC Analyst or similar security monitoring role. Ability to troubleshoot, optimize, and maintain large-scale log management solutions. Strong communication and documentation skills. We are dedicated to fostering a collaborative, engaging, and inclusive environment and are committed to providing a workplace that empowers associates to be authentic and bring their best to work. We believe that associates do their best when they feel safe, understood, and valued, and we work diligently and collaboratively to ensure Broadridge is a company-and ultimately a community-that recognizes and celebrates everyone's unique perspective.

Posted 2 weeks ago

Apply

8.0 - 12.0 years

18 - 22 Lacs

chennai

Work from Office

Do you want to work on innovative projects, collaborate with a dynamic and supportive team, and receive investment in your professional development? At DTCC, we are at the forefront of innovation in the financial markets. We are committed to helping our employees grow and succeed. We believe that you have the skills and drive to make a real impact. We foster a thriving internal community and are committed to creating a workplace that looks like the world that we serve. Pay and Benefits: Competitive compensation, including base pay and annual incentive Comprehensive health and life insurance and well-being benefits, based on location Pension / Retirement benefits Paid Time Off and Personal/Family Care, and other leaves of absence when needed to support your physical, financial, and emotional well-being. DTCC offers a flexible/hybrid model of 3 days onsite and 2 days remote (onsite Tuesdays, Wednesdays and a third day unique to each team or employee). The Impact you will have in this role: At DTCC, the Observability team is at the forefront of ensuring the health, performance, and reliability of our critical systems and applications. We empower the organization with real-time visibility into infrastructure and business applications by leveraging cutting-edge monitoring, reporting, and visualization tools. Our team collects and analyzes metrics, logs, and traces using platforms like Splunk and other telemetry solutions. This data is essential for assessing application health and availability, and for enabling rapid root cause analysis when issues arisehelping us maintain resilience in a fast-paced, high-volume trading environment. If you're passionate about observability, data-driven problem solving, and building systems that make a real-world impact, wed love to have you on our team. Primary Responsibilities: As a member of DTCCs Observability team, you will play a pivotal role in enhancing our monitoring and telemetry capabilities across critical infrastructure and business applications. Your responsibilities will include: Lead the migration from OpenText monitoring tools to Grafana and other open-source platforms. Design and deploy monitoring rules for infrastructure and business applications. Develop and manage alerting rules and notification workflows. Build real-time dashboards to visualize system health and performance. Configure and manage OpenTelemetry Collectors and Pipelines. Integrate observability tools with CI/CD, incident management, and cloud platforms. Deploy and manage observability agents across diverse environments. Perform upgrades and maintenance of observability platforms. Qualifications: Minimum of 10+ years of related experience. Bachelor's degree preferred or equivalent experience. Talent needed for success Proven experience designing intuitive, real-time dashboards (e.g., in Grafana) that effectively communicate system health, performance trends, and business KPIs. Expertise in defining and tuning monitoring rules, thresholds, and alerting logic to ensure accurate and actionable incident detection. Strong understanding of both application-level and operating system-level metrics, including CPU, memory, disk I/O, network, and custom business metrics. Experience with structured log ingestion, parsing, and analysis using tools like Splunk, Fluentd, or OpenTelemetry. Familiarity with implementing and analyzing synthetic transactions and real user monitoring to assess end-user experience and application responsiveness. Hands-on experience with application tracing tools and frameworks (e.g., OpenTelemetry, Jaeger, Zipkin) to diagnose performance bottlenecks and service dependencies. Proficiency in configuring and using AWS CloudWatch for collecting and visualizing cloud-native metrics, logs, and events. Understanding of containerized environments (e.g., Docker, Kubernetes) and how to monitor container health, resource usage, and orchestration metrics. Ability to write scripts or small applications in languages such as Python, Java, or Bash to automate observability tasks and data processing. Experience with automation and configuration management tools such as Ansible, Terraform, Chef, or SCCM to deploy and manage observability components at scale. Actual salary is determined based on the role, location, individual experience, skills, and other considerations. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Posted 3 weeks ago

Apply

5.0 - 7.0 years

3 - 5 Lacs

hyderabad, india

Hybrid

Job Purpose Designs, develops, and implements Java applications to support business requirements. Follows approved life cycle methodologies, creates design documents, writes code and performs unit and functional testing of software. Contributes to the overall architecture and standards of the group, acts as an SME and plays a software governance role. Key Activities / Outputs • Work closely with business analysts to analyse and understand the business requirements and business case, in order to produce simple, cost effective and innovative solution designs • Implement the designed solutions in the required development language (typically Java) in accordance with the Vitality Group standards, processes, tools and frameworks • Testing the quality of produced software thoroughly through participation in code reviews, the use of static code analysis tools, creation and execution of unit tests, functional regression tests, load tests and stress tests and evaluating the results of performance metrics collected on the software. • Participate in feasibility studies, proof of concepts, JAD sessions, estimation and costing sessions, evaluate and review programming methods, tools and standards, etc. • Maintain the system in production and provide support in the form of query resolution and defect fixes • Prepare the necessary technical documentation including payload definitions, class diagrams, activity diagrams, ERDs, operational and support documentation, etc • Driving the skills development of team members, coaching of team members for performance and coaching on career development, recruitment, staff training, performance management, etc Technical Skills or Knowledge Java, Object Orientation, Spring, Hibernate, Junit, SOA, SOAP, REST, Microservices, Docker, Data Modelling, UML, SQL, Architectural Styles, Liferay 7 (web), Kotlin (Android), Swift (iOS) Preferred Technical Skills (Would be advantageous) Kafka, Zookeeper, Zuul, Eureka, Obsidian, Elasticsearch, Kibana, Fluentd This position is a hybrid role based in Hyderabad which requires you to be in the office on a Tuesday, Wednesday and Thursday.

Posted 3 weeks ago

Apply

6.0 - 10.0 years

6 - 10 Lacs

Delhi, India

On-site

Support our Akamai edge setup, including CDN, WAF, botman. Fix issues and kep traffic flowing fast to AWS/on-prem origins. Write rules to block API threats (like bad JWTs or injections). Set up logging with Fluentd or Logstash to Elasticsearch. Send metrics to CloudWatch (AWS) or Prometheus (on-prem). Build dashboards and alerts for latency, errors, and attacks. Deliverables Enable perpetually-available AWS-based Envoy gateway routing traffic from Akamai. WAF blocking 95%+ of test attacks. Dashboards showing real-time gateway and origin health. Qualifications and Skills Education and Experience: B.E in Computer Science Engineering, or equivalent technical degree with strong computer science fundamentals Experience in an Agile software development environment Excellent communication and collaboration skills with the ability to work in a team-oriented environment Technical Skills: Experience with edge networks (like Akamai) and Envoy or nginx. WAF skills (any flavor). Hands-on with 2+ of: Fluentd, Logstash, CloudWatch, Elasticsearch Additional Important Requirements Strong Communication : Ability to articulate technical concepts clearly to different audiences. Collaboration & Teamwork : Works well with engineering, operations, and business teams. Problem-Solving : Logical thinking and troubleshooting mindset. Documentation & Knowledge Sharing : Contributes to runbooks and operational guides.

Posted 1 month ago

Apply

6.0 - 10.0 years

6 - 10 Lacs

Hyderabad, Telangana, India

On-site

Support our Akamai edge setup, including CDN, WAF, botman. Fix issues and kep traffic flowing fast to AWS/on-prem origins. Write rules to block API threats (like bad JWTs or injections). Set up logging with Fluentd or Logstash to Elasticsearch. Send metrics to CloudWatch (AWS) or Prometheus (on-prem). Build dashboards and alerts for latency, errors, and attacks. Deliverables Enable perpetually-available AWS-based Envoy gateway routing traffic from Akamai. WAF blocking 95%+ of test attacks. Dashboards showing real-time gateway and origin health. Qualifications and Skills Education and Experience: B.E in Computer Science Engineering, or equivalent technical degree with strong computer science fundamentals Experience in an Agile software development environment Excellent communication and collaboration skills with the ability to work in a team-oriented environment Technical Skills: Experience with edge networks (like Akamai) and Envoy or nginx. WAF skills (any flavor). Hands-on with 2+ of: Fluentd, Logstash, CloudWatch, Elasticsearch Additional Important Requirements Strong Communication : Ability to articulate technical concepts clearly to different audiences. Collaboration & Teamwork : Works well with engineering, operations, and business teams. Problem-Solving : Logical thinking and troubleshooting mindset. Documentation & Knowledge Sharing : Contributes to runbooks and operational guides.

Posted 1 month ago

Apply

6.0 - 10.0 years

6 - 10 Lacs

Bengaluru, Karnataka, India

On-site

Support our Akamai edge setup, including CDN, WAF, botman. Fix issues and kep traffic flowing fast to AWS/on-prem origins. Write rules to block API threats (like bad JWTs or injections). Set up logging with Fluentd or Logstash to Elasticsearch. Send metrics to CloudWatch (AWS) or Prometheus (on-prem). Build dashboards and alerts for latency, errors, and attacks. Deliverables Enable perpetually-available AWS-based Envoy gateway routing traffic from Akamai. WAF blocking 95%+ of test attacks. Dashboards showing real-time gateway and origin health. Qualifications and Skills Education and Experience: B.E in Computer Science Engineering, or equivalent technical degree with strong computer science fundamentals Experience in an Agile software development environment Excellent communication and collaboration skills with the ability to work in a team-oriented environment Technical Skills: Experience with edge networks (like Akamai) and Envoy or nginx. WAF skills (any flavor). Hands-on with 2+ of: Fluentd, Logstash, CloudWatch, Elasticsearch Additional Important Requirements Strong Communication : Ability to articulate technical concepts clearly to different audiences. Collaboration & Teamwork : Works well with engineering, operations, and business teams. Problem-Solving : Logical thinking and troubleshooting mindset. Documentation & Knowledge Sharing : Contributes to runbooks and operational guides.

Posted 1 month ago

Apply

4.0 - 10.0 years

0 Lacs

hyderabad, telangana

On-site

As a Big Data Architect with 4 years of experience, you will be responsible for designing and implementing scalable solutions using technologies such as Spark, Scala, Hadoop MapReduce/HDFS, PIG, HIVE, and AWS cloud computing. Your role will involve hands-on experience with tools like EMR, EC2, Pentaho BI, Impala, ElasticSearch, Apache Kafka, Node.js, Redis, Logstash, statsD, Ganglia, Zeppelin, Hue, and KETTLE. Additionally, you should have sound knowledge in Machine learning, Zookeeper, Bootstrap.js, Apache Flume, FluentD, Collectd, Sqoop, Presto, Tableau, R, GROK, MongoDB, Apache Storm, and HBASE. To excel in this role, you must have a strong background in development with both Core Java and Advanced Java. A Bachelor's degree in Computer Science, Information Technology, or MCA is required along with 4 years of relevant experience. Your analytical and problem-solving skills will be put to the test as you tackle complex data challenges. Attention to detail is crucial, and you should possess excellent written and verbal communication skills. This position requires you to work independently while also being an effective team player. With 10 years of overall experience, you will be based in either Pune or Hyderabad, India. Join us in this dynamic role where you will have the opportunity to contribute to cutting-edge data architecture solutions.,

Posted 1 month ago

Apply

5.0 - 9.0 years

0 Lacs

karnataka

On-site

As a Cloud Application Developer at our organization, you will have the opportunity to help design, develop, and maintain robust cloud-native applications in an as-a-service model on a Cloud platform. Your responsibilities will include evaluating, implementing, and standardizing new tools and solutions to continuously improve the Cloud Platform. You will leverage your expertise to drive the organization and departments" technical vision in development teams. Additionally, you will liaise with global and local stakeholders to influence technical roadmaps and passionately contribute towards hosting a thriving developer community. Encouraging contribution towards inner and open-sourcing will be a key aspect of your role. To excel in this position, you should have experience and exposure to good programming practices, including coding and testing standards. Your passion and experience in proactively investigating, evaluating, and implementing new technical solutions with continuous improvement will be highly valued. Possessing a good development culture and familiarity with industry-wide best practices is essential. A production mindset with a keen focus on reliability and quality is crucial, along with a passion for being part of a distributed, self-sufficient feature team with regular deliverables. You should be a proactive learner and continuously enhance your skills in areas such as Scrum, Data, and Automation. Your strong technical ability to monitor, investigate, analyze, and fix production issues will be essential in this role. You should also have the ability to ideate and collaborate through inner and open-sourcing and interact effectively with client managers, developers, testers, and cross-functional teams like architects. Experience working in an Agile Team and exposure to Agile/SAFE development methodologies is required, along with a minimum of 5+ years of experience in software development and architecture. In terms of technical skills, you should have good experience in design and development, including object-oriented programming in Python, cloud-native application development, APIs, and microservices. Familiarity with relational databases like PostgreSQL and the ability to build robust SQL queries is essential. Knowledge of tools such as Grafana for data visualization, Elastic search, FluentD, and hosting applications using Containerization (Docker, Kubernetes) will be beneficial. Proficiency with CI/CD and DevOps tools like GIT, Jenkins, Sonar, good system skills with Linux OS, and bash scripting are also required. An understanding of the Cloud and cloud services is a must-have skill for this role. Joining our team means being part of a company that values people as drivers of change and believes in making a positive impact on the future. We encourage creating, daring, innovating, and taking action. Our employees have the opportunity to engage in solidarity actions and contribute to reducing the carbon footprint through sustainable practices. Diversity and inclusion are core values that we uphold, and we are committed to ESG practices. If you are looking to be directly involved, grow in a stimulating environment, and make a difference, you will find a home with us.,

Posted 1 month ago

Apply

5.0 - 9.0 years

0 Lacs

noida, uttar pradesh

On-site

You will be working as an AI Platform Engineer in Bangalore as part of the GenAI COE Team. Your key responsibilities will involve developing and promoting scalable AI platforms for customer-facing applications. It will be essential to evangelize the platform with customers and internal stakeholders, ensuring scalability, reliability, and performance to meet business needs. Your role will also entail designing machine learning pipelines for experiment management, model management, feature management, and model retraining. Implementing A/B testing of models and designing APIs for model inferencing at scale will be crucial. You should have proven expertise with MLflow, SageMaker, Vertex AI, and Azure AI. As an AI Platform Engineer, you will serve as a subject matter expert in LLM serving paradigms, with in-depth knowledge of GPU architectures. Expertise in distributed training and serving of large language models, along with proficiency in model and data parallel training using frameworks like DeepSpeed and service frameworks like vLLM, will be required. Demonstrating proven expertise in model fine-tuning and optimization techniques to achieve better latencies and accuracies in model results will be part of your responsibilities. Reducing training and resource requirements for fine-tuning LLM and LVM models will also be essential. Having extensive knowledge of different LLM models and providing insights on their applicability based on use cases is crucial. You should have proven experience in delivering end-to-end solutions from engineering to production for specific customer use cases. Your proficiency in DevOps and LLMOps practices, along with knowledge of Kubernetes, Docker, and container orchestration, will be necessary. A deep understanding of LLM orchestration frameworks such as Flowise, Langflow, and Langgraph is also required. In terms of skills, you should be familiar with LLM models like Hugging Face OSS LLMs, GPT, Gemini, Claude, Mixtral, and Llama, as well as LLM Ops tools like ML Flow, Langchain, Langraph, LangFlow, Flowise, LLamaIndex, SageMaker, AWS Bedrock, Vertex AI, and Azure AI. Additionally, knowledge of databases/data warehouse systems like DynamoDB, Cosmos, MongoDB, RDS, MySQL, PostGreSQL, Aurora, and Google BigQuery, as well as cloud platforms such as AWS, Azure, and GCP, is essential. Proficiency in DevOps tools like Kubernetes, Docker, FluentD, Kibana, Grafana, and Prometheus, along with cloud certifications like AWS Professional Solution Architect and Azure Solutions Architect Expert, will be beneficial. Strong programming skills in Python, SQL, and Javascript are required for this full-time role, with an in-person work location.,

Posted 1 month ago

Apply

5.0 - 10.0 years

13 - 18 Lacs

Bengaluru

Work from Office

Help design, develop and maintain robust cloud native applications in an as a services model on Cloud platform Evaluate, implement and standardize new tools / solutions to continuously improve the Cloud Platform Leverage expertise to driving organization and departments technical vision in development teams Liaise with global and local stakeholders and influence technical roadmaps Passionately contributing towards hosting a thriving developer community Encourage contribution towards inner and open sourcing, leading by example Profile required - Experience and exposer to good programming practices including Coding and Testing standards - Passion and Experience in proactively investigating, evaluating and implementing new technical solutions with continuously improvement - Possess good development culture and familiarity to industry wide best practices - Production mindset with keen focus on reliability and quality - Passionate about being a part of distributed self-sufficient feature team with regular deliverables - Proactive learner and own skills about Scrum, Data, Automation - Strong technical ability to monitor, investigate, analyze and fix production issues. - Ability to ideate and collaborate through inner and open sourcing - Ability to Interact with client managers, developers, testers and cross functional teams like architects - Experience working in Agile Team and exposure to agile / SAFE development methodologies. - Minimum 5+ years of experience in software development and architecture. - Good experience of design and development including object-oriented programming in python, cloud native application development, APIs and micro-service - Good experience with relational databases like PostgreSQL and ability to build robust SQL queries - Knowledge of Grafana for data visualization and ability to build dashboard from various data sources - Experience in big technologies like Elastic search and FluentD - Experience in hosting applications using Containerization [Docker, Kubernetes] - Good understanding of CI/CD and DevOps and Proficient with tools like GIT, Jenkin, Sonar - Good system skills with linux OS and bash scripting - Understanding of the Cloud and cloud services

Posted 1 month ago

Apply

3.0 - 8.0 years

6 - 12 Lacs

Gurugram

Work from Office

Location: NCR Team Type: Platform Operations Shift Model: 24x7 Rotational Coverage / On-call Support (L2/L3) Team Overview The OpenShift Container Platform (OCP) Operations Team is responsible for the continuous availability, health, and performance of OpenShift clusters that support mission-critical workloads. The team operates under a tiered structure (L2, L3) to manage day-to-day operations, incident management, automation, and lifecycle management of the container platform. This team is central to supporting stakeholders by ensuring the container orchestration layer is secure, resilient, scalable, and optimized. L2 OCP Support & Platform Engineering (Platform Analyst) Role Focus: Advanced Troubleshooting, Change Management, Automation Experience: 3–6 years Resources : 5 Key Responsibilities: Analyze and resolve platform issues related to workloads, PVCs, ingress, services, and image registries. Implement configuration changes via YAML/Helm/Kustomize. Maintain Operators, upgrade OpenShift clusters, and validate post-patching health. Work with CI/CD pipelines and DevOps teams for build & deploy troubleshooting. Manage and automate namespace provisioning, RBAC, NetworkPolicies. Maintain logs, monitoring, and alerting tools (Prometheus, EFK, Grafana). Participate in CR and patch planning cycles. L3 – OCP Platform Architect & Automation Lead (Platform SME) Role Focus: Architecture, Lifecycle Management, Platform Governance Experience: 6+ years Resources : 2 Key Responsibilities: Own lifecycle management: upgrades, patching, cluster DR, backup strategy. Automate platform operations via GitOps, Ansible, Terraform. Lead SEV1 issue resolution, post-mortems, and RCA reviews. Define compliance standards: RBAC, SCCs, Network Segmentation, CIS hardening. Integrate OCP with IDPs (ArgoCD, Vault, Harbor, GitLab). Drive platform observability and performance tuning initiatives. Mentor L1/L2 team members and lead operational best practices. Core Tools & Technology Stack Container Platform: OpenShift, Kubernetes CLI Tools: oc, kubectl, Helm, Kustomize Monitoring: Prometheus, Grafana, Thanos Logging: Fluentd, EFK Stack, Loki CI/CD: Jenkins, GitLab CI, ArgoCD, Tekton Automation: Ansible, Terraform Security: Vault, SCCs, RBAC, NetworkPolicies

Posted 1 month ago

Apply

5.0 - 8.0 years

5 - 8 Lacs

Chennai

Work from Office

Kafka Admin Consult with inquiring teams on how to leverage Kafka within their pipelines Architect, Build and Support existing and new Kafka clusters via IaC Partner with Splunk teams to route trac through Kafka by utilizing open-source agents and collectors deployed via ChefRemediate any health issues within Kafka Automate (where possible) any operational processes on the team Create new and/or update monitoring dashboards and alerts as neededManage a continuous improvement / continuous development (CI/CD pipelinePerform PoCs on new components to expand/enhance teams Kafka oerings Preferred QualificationsKnowledge and experience with Splunk, Elastic, Kibana and GrafanaKnowledge and experience with log collection agents such as Open-Telemetry, Fluent Bit, FluentD, Beats and LogStash.Knowledge and experience with Kubernetes / DockerKnowledge and experience with Kafka-ConnectKnowledge and experience with AWS or AzureKnowledge and experience with Streaming Analytics Mandatory Skills: API Microservice Integration. Experience: 5-8 Years.

Posted 1 month ago

Apply

3.0 - 6.0 years

5 - 9 Lacs

Bengaluru

Work from Office

Kubernetes (K8s) Python, Java Ansible Shell scripting Experience with Openshift Container Platform Go Language Robot Framework Experience with logging stacks like Grafana, Opensearch, Fluentd, Logstash stack on K8s DevOps framework Jenkins /Ansible etc Experience in migrating legacy applications to K8S Application development experience on Kubernetes

Posted 2 months ago

Apply

3.0 - 7.0 years

9 - 13 Lacs

Pune

Work from Office

As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes. Your primary responsibilities include: 24x7 Observability: Be part of a worldwide team that monitors the health of production systems and services around the clock, ensuring continuous reliability and optimal customer experience. Cross-Functional Troubleshooting: Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively. Deployment and Configuration: Leverage Continuous Delivery (CI/CD) tools to deploy services and configuration changes at enterprise scale. Security and Compliance Implementation: Implementing security measures that meet or exceed industry standards for regulations such as GDPR, SOC2, ISO 27001, PCI, HIPAA, and FBA. Maintenance and Support: Tasks related to applying Couchbase security patches and upgrades, supporting Cassandra and Mongo for pager duty rotation, and collaborating with Couchbase Product support for issue resolution. Required education Bachelor's Degree Required technical and professional expertise System Monitoring and Troubleshooting: Strong skills in monitoring/observability, issue response, and troubleshooting for optimal system performance. Automation Proficiency: Proficiency in automation for production environment changes, streamlining processes for efficiency, and reducing toil. Linux Proficiency: Strong knowledge of Linux operating systems. Operation and Support Experience: Demonstrated experience in handling day-to-day operations, alert management, incident support, migration tasks, and break-fix support. Experience with Infrastructure as Code (Terraform/OpenTofu) Experience with ELK/EFK stack (ElasticSearch, Logstash/Fluentd, and Kibana) Preferred technical and professional experience Kubernetes/OpenShift: Strongly preferred experience in working with production Kubernetes/OpenShift environments. Automation/Scripting: In depth experience with the Ansible, Python, Terraform, and CI/CD tools such as Jenkins, IBM Continuous Delivery, ArgoCD Monitoring/Observability: Hands on experience crafting alerts and dashboards using tools such as Instana, Grafana/Prometheus Experience working in an agile team, e.g., Kanban

Posted 2 months ago

Apply

7.0 - 12.0 years

25 - 32 Lacs

Pune

Work from Office

Hi, Wishes from GSN!!! Pleasure connecting with you!!! We been into Corporate Search Services for Identifying & Bringing in Stellar Talented Professionals for our reputed IT / Non-IT clients in India. We have been successfully providing results to various potential needs of our clients for the last 20 years. Who are we looking for? Skilled IT Operations Consultant specializing in Monitoring and Observability to design, implement and optimize monitoring solutions for our customers. Strong background in monitoring, observability and IT service management is MUST . 1. WORK LOCATION : PUNE 2. Job Role: LEAD ENGINEER 3. EXPERIENCE : 7+ yrs 4. CTC Range: Rs. 25 LPA to Rs. 30 LPA 5. Work Type : WFO ****** Looking for SHORT JOINERS ****** Job Description : Required Skills : Strong understanding of infrastructure and platform development principles and experience with programming languages such as Python, Ansible for developing custom scripts . Strong knowledge of monitoring frameworks, logging systems (ELK stack, Fluentd), and tracing tools (Jaeger, Zipkin) along with the OpenSource solutions like Prometheus, Grafana. Extensive EXP with monitoring and observability solutions such as OpsRamp, Dynatrace, New Relic , must have worked with ITSM integration (e.g. integration with ServiceNow, BMC remedy etc.) Working EXP with RESTful APIs and understanding of API integration with the monitoring tools . Knowledge of ITIL processes and Service Management frameworks . Familiarity with security monitoring and compliance requirements. Familiarity with AIOps and Machine Learning techniques for anomaly detection and incident prediction. Excellent analytical and problem-solving skills, ability to debug and troubleshoot complex automation issues Roles & Responsibilities : Design end-to-end monitoring and observability solutions to provide comprehensive visibility into infrastructure, applications and networks. Implement monitoring tools and frameworks (e.g., Prometheus, Grafana, OpsRamp, Dynatrace, New Relic) to track key performance indicators and system health metrics. Integration of monitoring and observability solutions with IT Service Management Tools. Develop and deploy dashboards and reports to proactively identify and address system performance issues. Architect scalable observability solutions to support hybrid and multi-cloud environments. Collaborate with infrastructure, development and DevOps teams to ensure seamless integration of monitoring systems into CI/CD pipelines. Continuously optimize monitoring configurations and thresholds to minimize noise and improve incident detection accuracy. Utilize AIOps and machine learning capabilities for intelligent incident management and predictive analytics. Work closely with business stakeholders to define monitoring requirements and success metrics. Document monitoring architectures, configurations and operational procedures. ****** Looking for SHORT JOINERS ****** If interested, dont hesitate to click APPLY for IMMEDIATE response. Best Wishes, GSN HR | Google review : https://g.co/kgs/UAsF9W

Posted 2 months ago

Apply

1.0 - 6.0 years

7 - 17 Lacs

Noida

Work from Office

Job Summary Site Reliability Engineers (SRE's) cover the intersection of Software Engineer and Systems Administrator. In other words, they can both create code and manage the infrastructure on which the code runs. This is a very wide skillset, but the end goal of an SRE is always the same: to ensure that all SLAs are met, but not exceeded, so as to balance performance and reliability with operational costs. As a Site Reliability Engineer II, you will be learning our systems, improving your craft as an engineer, and taking on tasks that improve the overall reliability of the VP platform. Key Responsibilities: Design, implement, and maintain robust monitoring and alerting systems. Lead observability initiatives by improving metrics, logging, and tracing across services and infrastructure. Collaborate with development and infrastructure teams to instrument applications and ensure visibility into system health and performance. Write Python scripts and tools for automation, infrastructure management, and incident response. Participate in and improve the incident management and on-call process, driving down Mean Time to Resolution (MTTR). Conduct root cause analysis and postmortems following incidents and champion efforts to prevent recurrence. Optimize systems for scalability, performance, and cost-efficiency in cloud and containerized environments. Advocate and implement SRE best practices, including SLOs/SLIs, capacity planning, and reliability reviews. Required Skills & Qualifications: 1+ years of experience in a Site Reliability Engineer or similar role. Excellent communicaiton skills in English. Proficiency in Python for automation and tooling. Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, New Relic, Open Telemetry, etc. Experience with log aggregation and analysis tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Fluentd. Good understanding of cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes). Familiarity with infrastructure-as-code (Terraform, Ansible, or similar). Strong debugging and incident response skills. Knowledge of CI/CD pipelines and release engineering practices.

Posted 2 months ago

Apply
Page 1 of 2
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies