Lustre Jobs – Apply to Latest Lustre Job Vacancies

5.0 years

0 Lacs

India

On-site

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 20 hours ago

Apply

Product Manager Domnic Lewis

10.0 years

0 Lacs

Bengaluru, Karnataka, India

Remote

We are hiring for a Product based client for Remote working. Offering/Product Manager – HPC Services Experience - 10-20 years Drive the end-to-end development, positioning, and enhancement of High Performance Computing (HPC) as-a-Service under TruScale offerings. Act as the bridge between customer needs, market insights, and technical capabilities —functioning as a Product Manager for HPC solutions, with a blend of offering management, market strategy, and technical understanding. 🔧 Core Responsibilities: Customer-Centric Offering Design Translate evolving customer requirements and industry trends into feature roadmaps and service enhancements. Understand end-user demands in HPC workloads , GPU/TPU-based acceleration, storage performance, memory optimization, and file systems like GPFS/Lustre . Market and Competitive Assessment Continuously evaluate the HPC landscape , track competitors (AWS HPC, Azure CycleCloud, HPE Cray, Dell Omnia), and derive positioning strategies. Assess use cases across domains such as genomics, oil & gas, AI/ML model training, simulations, and FSI . Go-to-Market and Launch Readiness Collaborate with Services Marketing and regional sales teams to craft value propositions , launch collaterals, and enablement decks. Lead sales training , positioning sessions, and GTM alignment. Strategic Partner & Delivery Coordination Align with partners, ISVs, and hardware platform teams to define tech stacks and solution templates for HPC customers. Track offering readiness across delivery, operations, and post-sales support. Preferred Profile: Background : Product Manager / Offering Manager in Cloud, IaaS, HPC, or related high-tech domains Experience : 8+ years in product/solution management Hands-on understanding of technologies like NVIDIA GPU stacks, containerized HPC (Singularity, Docker), scheduling systems (SLURM, PBS), Lustre/GPFS Familiarity with as-a-Service constructs, subscription models, and TCO discussions Skills : Product lifecycle management Project and cross-functional stakeholder management Strong articulation, documentation, and influencing ability Able to interact across sales, delivery, product, and finance Show more Show less

Posted 1 day ago

Apply

HPC Enterprise Architect Lenovo

10.0 years

0 Lacs

Bengaluru, Karnataka, India

On-site

Position Description: Lenovo is seeking an experienced HPC Enterprise Architect for the Lenovo SSG Architecture team. The ideal candidate will play a key role in support of the design, architecture, development, and deployment of HPC-related service offerings for customers globally. The chosen candidate will be part of Lenovo Solutions & Services Group (SSG) and work closely with the HPC Practice for service offering design, architecture, definition, development, and build for our Professional Services, Managed Services, and TruScale Services businesses. * This role is primarily internal but includes some customer-facing work. Ideal candidate for this role, is someone who has been involved in HPC from a solution architect or technical delivery lead perspective and understands the market direction. The candidate will should have experience working with pre-sales, sales and HPC delivery teams. In this role, you will: * Support the HPC Services Practice in the development and delivery of HPC-related service offerings * Technical lead for HPC services offerings, working closely with the HPC Practice leader throughout the service offering lifecycle, including: * Participate in the creation of new service offerings, starting with participation in the Ideation phase and providing expert guidance. * Collaborate with other architects and technical teams to ensure that HPC solutions are well-integrated with other services and technologies. * Assisting/leading early implementations with Technical Consultants * Presentation delivery in support of: * Our offering development process * Field enablement/training * Service delivery activities * Collaborate with the HPC Practice to create technical documentation and diagrams for communicating solutions and designs to customers, partners, and internal stakeholders. * Assist in the documentation of project requirements, statements of work, and service descriptions for HPC service offerings in support of the Lenovo TruScale and Hybrid Cloud businesses. * Stay up to date on emerging HPC technologies and trends and provide recommendations for new services or technology offerings. * Some business partner interaction. Qualifications: * Bachelor's or master's degree in Computer Science, Electrical Engineering, or a related field. * 10+ years of experience with HPC environments * Extensive experience in HPC architecture and design, with a proven track record of delivering complex HPC solutions. * Experience in designing and implementing HPC solutions on public cloud, private cloud, and on-premises infrastructure. * Deep knowledge of HPC technologies, including MPI, OpenMP, Infiniband, GPFS, Lustre, and other file systems, cluster management tools such as Slurm, Torque, or LSF, and scheduling software such as PBSPro. * Excellent communication skills, including the ability to communicate technical concepts to both technical and non-technical audiences. * Experience with virtualization and containerization technologies such as Docker, Kubernetes, and Singularity. * Strong understanding of networking technologies and protocols, including TCP/IP, Infiniband, and RDMA. * Familiarity with one or more programming languages such as C, C++, Fortran, Python, or Java. * Experience working in a multi-vendor, multi-cloud environment. * Strong problem-solving skills and the ability to work under pressure in a fast-paced environment. Preferred Skills: * Understanding of the relationship between various HPC software and hardware components * Deep dive understanding of one or more specific areas is a plus * Industry knowledge and experience with HPC solutions as deployed for a variety of environments and use cases * Demonstrated ability to clearly articulate technical issues, resolutions, and recommendations for customers, development teams, and leadership * Experience training others in understanding and using HPC-related solutions * Experience working in a customer-facing environment preferred Show more Show less

Posted 1 day ago

Apply

Lead Cloud Engineer - HPC Storage Chevron

10.0 - 15.0 years

0 Lacs

Bengaluru, Karnataka, India

On-site

About The Position Delivers High Performance Computing (HPC) application and infrastructure solutions that support complex, compute intensive workloads, parallel filesystems, low latency networks, artificial intelligence (AI), and machine learning (ML). This Lead role has an expectation of 10-15 years of relevant experience and will provide mentorship to junior members of the team. Key Responsibilities As a Cloud Engineer - HPC-Storage you are responsible for providing application and infrastructure solutions that support complex, compute intensive workloads, parallel filesystems, low latency networks, artificial intelligence (AI), and machine learning (ML). The successful candidate will partner with scientists, engineers, and other HPC PL experts to deliver HPC solutions that fully leverage the computing resources on premises and in the cloud. Design, implementation, and support of NetApp storage systems Administration of Azure NetApp files and Cloud Volumes ONTAP Administration of on-prem and cloud-hosted Linux systems Implementation of automation of system management, maintenance, and monitoring Troubleshooting, incident, and problem management Performance monitoring and optimization Required Qualifications Technical Knowledge and Skills - The preferred candidate will have 10-15 years of experience as well as at least 5 years of experience in Linux, Cloud, and NetApp storage system administration experience in a large-scale enterprise environment plus one or more of the following areas: Azure NetApp Files (ANF), Cloud Volumes ONTAP (CVO), parallel file systems (e.g., Lustre), Azure virtual machines, Azure networking, underlying infrastructure supporting Oil and Gas applications, and configuration management technologies (e.g., Ansible, PowerShell, Python, and Azure). Bachelor's degree in Computer Science, Information Systems, or comparable field Chevron ENGINE supports global operations, supporting business requirements across the world. Accordingly, the work hours for employees will be aligned to support business requirements. The standard work week will be Monday to Friday. Working hours are 8:00am to 5:00pm or 1.30pm to 10.30pm. Chevron participates in E-Verify in certain locations as required by law. Show more Show less

Posted 2 days ago

Apply

Senior Site Reliability Engineer - AI Research Clusters NVIDIA

5.0 years

0 Lacs

Pune, Maharashtra, India

On-site

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life’s work , to amplify human creativity and intelligence. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor’s degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience. JR1993564 Show more Show less

Posted 3 days ago

Apply

Director of System, QE (Quality Engineering) - Storage DDN

15.0 years

0 Lacs

Bengaluru, Karnataka, India

On-site

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. "DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC “The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence. Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management. Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage. We are seeking a highly skilled Director of Software Engineering to lead the Infinia Storage engineering team in the development of cutting-edge Quality Engineering (QE) tools, strategies, and infrastructure. This leadership position will be responsible for driving strategic initiatives in test automation, software engineering, and the overall quality and performance of storage subsystems. You will lead a talented engineering team, collaborate across functional teams, and work on building innovative solutions that will directly impact the quality and efficiency of product development and delivery. As a core leader in QE, you will bring both leadership and advanced technical skills to lead complex engineering efforts, expedite testing cycles, and improve the overall product quality, with a focus on automation and scalable test infrastructures. The Director will play a key role in influencing and shaping the future direction of our engineering practices, as well as ensuring that high standards of quality are maintained throughout the development lifecycle. Key Responsibilities: Strategic Leadership and Vision: Develop and execute a comprehensive strategy for Infinia’s Engineering Quality and Development processes, driving innovation and improvements across the organization. Lead strategic discussions for the design, development, and deployment of test automation tools and infrastructure that support storage subsystems testing in a highly scalable, reliable, and efficient manner. Collaborate with senior leadership across engineering, product, and quality teams to define long-term goals, priorities, and strategic initiatives that drive impactful results for the business. Team Leadership and Development: Lead and inspire a high-performing, results-driven team of software engineers focused on test automation and quality infrastructure. Encourage innovation, accountability, and growth within the team. Empower engineers to take ownership of their work and provide the guidance necessary to accomplish their objectives while balancing team priorities. Define and implement team development plans, helping to advance the careers of your engineers through mentorship, periodic check-ins, and setting long-term growth objectives. Cross-Functional Collaboration: Forge strong relationships with leaders across functional teams, including Quality Engineering, Product Development, and Operations, to ensure alignment on objectives, priorities, and deliverables. Lead collaborative efforts across multi-site, multicultural engineering teams, driving timely and high-quality results through clear communication and effective problem-solving. Test Automation and Architecture: Design and implement scalable test automation strategies that ensure flexibility, reusability, and efficiency across multiple storage platforms. Work closely with Test Architects and Engineering leaders to develop robust test strategies, plans, and test cases, leveraging automated testing to accelerate development cycles. Drive the continued evolution and optimization of test automation architecture, ensuring that it meets current and future engineering needs. Continuous Improvement and Agile Execution: Champion iterative and agile engineering methodologies to ensure the team’s output aligns with organizational goals and deliverables. Lead continuous improvement initiatives to enhance the efficiency of testing processes, reduce testing cycle time, and increase overall product quality. Establish metrics for tracking the success and impact of test automation, providing data-driven insights to leadership on key performance indicators. Global Coordination and Impact: Coordinate with engineering teams globally to align testing efforts and share best practices. Proactively identify and address challenges in the testing and automation process, ensuring that any blockers or obstacles are resolved promptly to maintain project timelines. Qualifications: BS/MS/Ph.D. in Computer Science, Computer Engineering, Mathematics, Statistics, or related technical field. 15+ years of experience in software development or software development for test, with a deep focus on distributed systems, data storage, or cloud computing. 8+ years of experience leading and managing a team of engineers in a software or test engineering capacity, with a strong track record of delivering results. Significant experience with QE methodologies, functional and structural testing techniques (Agile methodologies preferred). Expertise in Python or related high-level languages, as well as experience with automation frameworks like pytest, bash, and Ansible (experience with Ansible is a plus). Technical Skills: Strong understanding of distributed systems and storage architectures (parallel file systems, object storage, NVM, and key-value storage systems). Solid understanding of test automation design and implementation, including test strategy development, architecture, and best practices. Ability to read and understand coding languages and logic, including C++ and GoLang, and experience with high-level programming languages. Hands-on experience in high-performance computing system installation and management. Leadership and Soft Skills: Exceptional leadership and management skills with a demonstrated ability to inspire, motivate, and guide teams to high performance. Strong verbal and written communication skills, with the ability to effectively present and discuss complex ideas with both technical and non-technical audiences. Collaborative, team-oriented mindset with the ability to build relationships and create win-win solutions across cross-functional teams. Self-motivated, results-driven, and able to thrive in a fast-paced, dynamic environment with evolving responsibilities. Preferred Experience: Knowledge of Lustre, GPFS, or other parallel file system solutions. Familiarity with the installation, management, and optimization of high-performance computing systems. Experience in working with distributed key-value stores and NVM storage technologies. This position requires participation in an on-call rotation to provide after-hours support as needed. Show more Show less

Posted 3 days ago

Apply

HPC Platform Consultant Applicantz

5.0 years

0 Lacs

India

On-site

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 4 days ago

Apply

HPC Support Analyst Applicantz

5.0 years

0 Lacs

India

On-site

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 4 days ago

Apply

HPC Support Analyst Applicantz

5.0 years

0 Lacs

India

On-site

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 4 days ago

Apply

Python Automation - Storage DDN

3.0 years

0 Lacs

Pune, Maharashtra, India

On-site

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. "DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC “The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence. Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management. Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage. Job Summary: We are looking for Software Development Engineer in test to work on validation of high-performance storage solutions for HPC and AI markets. The ideal candidate will have experience designing, implementing, debugging and running both automated and manual software-based tests in a Linux environment, using shell tools and scripts. Responsibilities for this role include but are not limited to : Design and develop both automated and manual test cases to validate product features Run automated and manual tests as needed to validate product defect fixes and functionality Work with the Engineering manager and a geographically distributed team to understand product requirements and features Triage test failures on a daily basis Contribute to QA reports and provide input on release metrics Contribute to and validate product documentation Qualifications: BS/MS in Computer Science, Computer Engineering or equivalent degree/experience. 3+ years of component and system-level test experience 3+ years of experience working in Linux environments with Python languages; Experience with distributed filesystem storage environments Experience with file system, hardware fundamentals and common protocols (NFS / CIFS). Experience in storage systems troubleshooting and integration (NAS appliance a plus). Excellent time management skills, with the ability to independently prioritize, multitask, and work under deadlines in a fast paced environment Knowledge of Parallel File Systems, in particular Lustre, is highly preferred. Experience with git, JIRA, Jenkins and gerrit preferred Our team is highly motivated and focused on engineering excellence. We look for individuals who appreciate challenging themselves and thrive on curiosity. Engineers are encouraged to work across multiple areas of the company. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers and researchers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. DataDirect Networks, Inc. is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law. Show more Show less

Posted 5 days ago

Apply

Senior Site Reliability Engineer - AI Research Clusters NVIDIA

5.0 years

0 Lacs

Gurugram, Haryana, India

On-site

As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor’s degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience. JR1993756 Show more Show less

Posted 1 week ago

Apply

Lead High-Performance Computing Engineer EPAM Systems

8.0 - 12.0 years

0 Lacs

Gurugram, Haryana, India

On-site

We are seeking a Lead High-Performance Computing Engineer experienced in managing and enhancing HPC environments. The ideal candidate will bring a robust engineering background with proven experience in deploying and optimizing HPC infrastructures, who will thrive in our HPC infrastructure engineering team supporting scientific research teams. Responsibilities Participate in incident resolution, software and hardware upgrades Support and maintain HPC infrastructure Implement Infrastructure as Code (IaC) automation Develop and review system operational procedures Lead troubleshooting efforts in complex systems Requirements Experience range of 8 to 12 years in HPC environments Proficiency in configuring and supporting HPC infrastructure Proficiency in Linux, including capabilities such as kernel modules compilation and using debugging tools like strace, coredump, tcpdump Background in job schedulers including IBM LSF and Slurm Expertise in Bright Cluster Manager including installation and configuration tasks Knowledge of GPFS and Lustre file systems Understanding of InfiniBand and OmniPath network interconnect technologies Nice to have Familiarity with cloud-based HPC solutions Experience in system security and data protection best practices Show more Show less

Posted 1 week ago

Apply

Senior Consultant - DevOps AstraZeneca India

10.0 years

0 Lacs

Chennai, Tamil Nadu, India

On-site

Job Title: Senior Consultant - DevOps GCL: D3 Introduction to role At AstraZeneca, we treat Scientific Computing as a strategic asset that underpins our advances in science. Leading-edge research strategies critically depend on elite computing capabilities and high-energy engineering teams supporting them. The Scientific Computing Platform (SCP) is AstraZeneca's innovative computing environment designed to tackle today's and tomorrow's in-silico challenges. It focuses on a platform concept, building capabilities and services around central building blocks. At its heart, it uses compute environments, a classical InfiniBand/Slurm HPC cluster, an Openstack private cloud, and various public clouds for elasticity and scale. To exploit these resource pools most efficiently, the SCP deploys strong DevOps tooling and cloud-native technologies, aiming to adjust and adapt according to changing requirements and follow the science. The engineers and architects in the SCP team are key to continuously delivering and supporting this critical capability. Accountabilities We are looking for a highly motivated, ambitious, and independently working Scientific Computing engineer to join our global team. If you combine a DevOps way of working with a strong enterprise-ready service attitude, problem-solving skills, excitement for technology, and interest in accelerating science, then you are in good company. You will be engaged in a variety of activities stretching from daily operations and development, exciting technology exploration, and strategic project delivery. You will own parts of the platform, drive its operational excellence, prioritize the development backlog, and craft the roadmap. Whether your interest is more operations or development-focused, classical HPC or cloud computing, in hardware or applications, our team operates the whole stack. Everyone on the team takes responsibility for our success. Based in one of the science hub sites, you will become a trusted partner to our Science community, ensuring that we deliver technology at the highest standards to enable and push their work to the next level. Translating their needs into efficient solutions and applying engineering excellence to make science successful will be your daily reward. We operate the SCP as a single global team with shared responsibilities. You will join an agile and hardworking team of technologists who share excitement for High-Performance & Big Compute, modern technology stacks, and their application in science to change the lives of patients. To us, teammates are people who have humility, feel accountable for their work (and their teams), as well as the ability and willingness to be a leader of something. Living and breathing technology is a key factor enabling us to keep pace with the life-changing ideas of AstraZeneca's scientists. Essential Skills/Experience Demonstrated expertise and 10+ years of hands-on experience operating, crafting, or engineering large-scale computing environments, with a main focus on High-Performance Computing (HPC) while having experience in DevOps practices. Demonstrable ability to drive innovative computational solutions and leverage emerging technologies within the Life Sciences domain, particularly in the Pharmaceutical Industry or Biotech sectors. Nice to have experience in administering large-scale cluster and server computing environments, utilizing related software such as Slurm, LSF, and Grid Engine. Hands-on experience collaborating within DevOps teams and applying agile methodologies to streamline operations and development processes. Providing comprehensive scientific software support for end-users, including configuration, installation, tuning, and maintenance. Proficiency in operating and managing virtualized private cloud resources, specifically with OpenStack. Solid understanding of Linux system administration, the TCP/IP stack, and storage subsystems. Knowledge in administering large-scale parallel filesystems, including GPFS and Lustre. Proven track record of using configuration management tools (e.g., Ansible, Salt, Puppet) and technology frameworks within IT operations. Experience in developing and managing relationships with third-party suppliers. Proficient in scripting and tool development for HPC and DevOps platforms using Bash and Python. Desirable Skills/Experience Familiarity with operating and configuring public cloud computing infrastructure, such as AWS, Azure, or GCP. Experience in managing virtualized private cloud environments, particularly with OpenStack. Knowledge of container technologies (e.g., LXD, Singularity, Docker, Kubernetes). Demonstrated development experience across multiple programming languages, tools, and technologies (e.g., Java, C++, Python, Ruby, Perl, SQL, AWS). Familiarity with HashiCorp tools like Terraform, Vault, Consul, and Nomad. Experience in platform engineering, software engineering, and machine learning best practices, including version control, continuous integration (CI), continuous development (CD), containerization, and shell scripting. Bachelor's degree required; emphasis on Computational, Physical, or Biological Science, Engineering, or related fields preferred. A Master's or Ph.D. is a plus. Solid understanding of analytic products and scientific computing software relevant to pharmaceutical R&D. Excellent written and verbal communication, teamwork, and collaboration skills. When we put unexpected teams in the same room, we unleash bold thinking with the power to inspire life-changing medicines. In-person working gives us the platform we need to connect, work at pace and challenge perceptions. That's why we work, on average, a minimum of three days per week from the office. But that doesn't mean we're not flexible. We balance the expectation of being in the office while respecting individual flexibility. Join us in our unique and ambitious world. At AstraZeneca, our work has a direct impact on patients by transforming our ability to develop life-changing medicines. We empower the business to perform at its peak by combining cutting-edge science with leading digital technology platforms and data. Join us at a crucial stage of our journey in becoming a digital and data-led enterprise. Make the impossible possible by building partnerships and ecosystems while driving scale and speed to deliver exponential growth. Here you'll find countless opportunities to learn and grow while working on innovative technologies. Ready to make a difference? Apply now! Show more Show less

Posted 1 week ago

Apply

Senior Site Reliability Engineer - AI Research Clusters NVIDIA

5.0 years

0 Lacs

Gurugram, Haryana, India

On-site

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life’s work , to amplify human creativity and intelligence. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor’s degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience. JR1993564 Show more Show less

Posted 1 week ago

Apply

HPC Platform Consultant Applicantz

5.0 years

0 Lacs

India

On-site

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 1 week ago

Apply

HPC Support Analyst Applicantz

5.0 years

0 Lacs

India

On-site

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 1 week ago

Apply

Lead High-Performance Computing Engineer EPAM Systems

8.0 - 12.0 years

0 Lacs

Chennai, Tamil Nadu, India

On-site

We are seeking a Lead High-Performance Computing Engineer experienced in managing and enhancing HPC environments. The ideal candidate will bring a robust engineering background with proven experience in deploying and optimizing HPC infrastructures, who will thrive in our HPC infrastructure engineering team supporting scientific research teams. Responsibilities Participate in incident resolution, software and hardware upgrades Support and maintain HPC infrastructure Implement Infrastructure as Code (IaC) automation Develop and review system operational procedures Lead troubleshooting efforts in complex systems Requirements Experience range of 8 to 12 years in HPC environments Proficiency in configuring and supporting HPC infrastructure Proficiency in Linux, including capabilities such as kernel modules compilation and using debugging tools like strace, coredump, tcpdump Background in job schedulers including IBM LSF and Slurm Expertise in Bright Cluster Manager including installation and configuration tasks Knowledge of GPFS and Lustre file systems Understanding of InfiniBand and OmniPath network interconnect technologies Nice to have Familiarity with cloud-based HPC solutions Experience in system security and data protection best practices Show more Show less

Posted 1 week ago

Apply

Senior High-Performance Computing Engineer EPAM Systems

5.0 - 8.0 years

0 Lacs

Gurugram, Haryana, India

On-site

We are seeking a Senior High-Performance Computing Engineer with a strong engineering background and hands-on experience in deployment and optimization of HPC infrastructure. This role involves daily operations and engineering activities of the HPC environments to support a scientific research team’s HPC cluster utilization. Responsibilities Support HPC infrastructure Implement IaC Infrastructure automation Participate in incident resolution, software and hardware upgrades Requirements 5 to 8 years of HPC engineering experience Proficiency in configuring and supporting HPC infrastructure Proficiency in Linux including kernel modules compilation, debugging tools (strace, coredump, tcpdump) Competency in Job schedulers (IBM LSF, Slurm) Expertise in Bright Cluster Manager including installation and configuration Knowledge of GPFS/Lustre filesystems Background in InfiniBand/OmniPath network interconnect Nice to have Familiarity with efficiency optimization methods for HPC Show more Show less

Posted 1 week ago

Apply

HPC Support Analyst Applicantz

5.0 years

0 Lacs

India

On-site

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 1 week ago

Apply

High-Performance Computing (HPC) Solutions Architect Lenovo

7.0 years

0 Lacs

Pune, Maharashtra, India

Remote

We are Lenovo. We do what we say. We own what we do. We WOW our customers. Lenovo is a US$57 billion revenue global technology powerhouse, ranked #248 in the Fortune Global 500, and serving millions of customers every day in 180 markets. Focused on a bold vision to deliver Smarter Technology for All, Lenovo has built on its success as the world’s largest PC company with a full-stack portfolio of AI-enabled, AI-ready, and AI-optimized devices (PCs, workstations, smartphones, tablets), infrastructure (server, storage, edge, high performance computing and software defined infrastructure), software, solutions, and services. Lenovo’s continued investment in world-changing innovation is building a more equitable, trustworthy, and smarter future for everyone, everywhere. Lenovo is listed on the Hong Kong stock exchange under Lenovo Group Limited (HKSE: 992) (ADR: LNVGY). To find out more visit www.lenovo.com and read about the latest news via our StoryHub. Position Description Lenovo is seeking an experienced High-Performance Computing (HPC) Solutions Architect who will play a key role in supporting the development and deployment of HPC-related service offerings, including Professional and Managed Services. The chosen candidate will be part of Lenovo Solutions Services Group (SSG) as a member of the Center of Excellence team, driving service offering design and definition for our Professional Services, Managed Services, and TruScale Services businesses. This role is primarily internal but includes some customer-facing work. This is a remote-only role. In This Role You Will Support the development and delivery of HPC service offerings for enterprise, research, and government customers. Work with leading enterprises, research institutions, and cloud providers to integrate customer and market requirements into service offerings and technology roadmaps. Architect full-stack HPC infrastructure solutions, including compute, networking, storage, security, and workload management components, to support high-performance computing workloads. Demonstrate expertise in HPC technologies, including cluster computing, high-speed interconnects (InfiniBand, Ethernet), parallel file systems (Lustre, GPFS), and workload orchestration tools (Slurm, PBS, LSF). Leverage experience with on-premises, hybrid, and cloud-based HPC environments, optimizing workload placement and performance. Work across internal teams, divisions, and external partners to develop and deliver innovative HPC solutions that drive business and research advancements. Demonstrate strong written and verbal communication skills, with the ability to present solutions to customers, internal stakeholders, and partners. Drive strategic relationships with HPC enterprises, research institutions, cloud service providers (CSPs), and technology partners. Act As The Technical Lead For HPC-based Infrastructure Solutions, Collaborating With The HPC Practice Leader Throughout The Service Offering Lifecycle, Including Assisting in the creation of new service offerings, from concept development to implementation guidance. Leading initial implementations alongside Technical Consultants. Deliver Presentations In Support Of Offering development and go-to-market strategies. Field enablement and training initiatives. Service delivery execution. Assist in documentation of project requirements, statements of work, and service descriptions. Collaborate with business partners to enhance solution capabilities and integration. Qualifications Bachelor's degree, MBA, or equivalent experience required. 7+ years of experience architecting and deploying enterprise HPC solutions. Minimum of 5+ years of experience working with at least one cloud platform (VMware, AWS, Azure, GCP) or on-premises HPC clusters. Proven expertise in designing, developing, and delivering large-scale HPC platforms and solutions. Experience implementing high-availability, high-performance computing architectures with defined scalability and fault-tolerance objectives. Strong background in HPC system design, workload scheduling, parallel computing, and data-intensive computing frameworks. Hands-on experience with automation tools such as Terraform, Ansible, and SaltStack for HPC environments. Certification in HPC technologies or cloud-based HPC solutions preferred. Preferred Skills Experience developing infrastructure strategy and reference architectures for HPC deployments. Hands-on expertise in HPC hardware, software, and cloud-native architectures for scientific computing and AI workloads. Experience designing and implementing highly available, secure, and scalable HPC solutions. Strong scripting/programming skills (Python, Bash, C, MPI, OpenMP) for automation and performance tuning. Deep knowledge of high-speed networking (RDMA, InfiniBand), distributed storage, and parallel computing models. Experience with monitoring, observability, and performance optimization for HPC clusters. Expertise in defining and delivering HPC migration, implementation, operations, and optimization initiatives. Industry knowledge in key HPC-driven sectors such as scientific research, healthcare, financial modeling, and engineering simulations. Experience in systems integration and managing direct/channel sales engagements. Strong understanding of HPC containerization, workflow management, and cloud-bursting strategies. Demonstrated ability to present technical solutions and recommendations to customers, engineering teams, and leadership. Experience in a customer-facing HPC services environment, providing professional and managed services. TOGAF 9 Certified or equivalent architecture framework certification preferred. You will report to SSG (Solutions Services Group) organization structure. SSG has been focusing on the expanding IT service market, especially the digital workplace services opportunity, the growing demand for aaS (as a Service) model, and customers stronger preference for sustainability services. Meanwhile, SSG has continued to invest in software tools, platforms, and repeatable vertical solutions with our own IP, and focus on vertical solutions in manufacturing, retail, healthcare, education, and Smart City. We are expanding TruScale as a Service to include Digital Workplace Solutions, developing our Hybrid Cloud solutions, and exploring Metaverse solutions. We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, religion, sexual orientation, gender identity, national origin, status as a veteran, and basis of disability or any federal, state, or local protected class. Show more Show less

Posted 1 week ago

Apply

Linux System Administrator CNC Hospitality Pvt Ltd

2.0 years

5 - 7 Lacs

Hyderābād

On-site

Job Description : We are seeking a talented Linux System Administrator with expertise in managing High-Performance Computing (HPC) servers to join our team. The ideal candidate will be responsible for the deployment, configuration, optimization, and maintenance of our HPC infrastructure, ensuring maximum performance and reliability for our clients. Responsibilities: Install, configure, and maintain Linux-based HPC clusters, including hardware and software components such as job schedulers, parallel file systems, and MPI libraries. Optimize system performance and resource utilization to meet the computational demands of our clients' workloads. Monitor system health and performance, troubleshoot issues, and implement solutions to minimize downtime and disruptions. Collaborate with researchers, engineers, and other stakeholders to understand their computational requirements and tailor HPC solutions to meet their needs. Develop and maintain documentation, best practices, and standard operating procedures for HPC system administration tasks. Implement security measures to protect HPC systems and data from unauthorized access and cyber threats. Stay up-to-date with the latest developments in HPC technologies and best practices, and recommend upgrades or improvements as needed. Provide technical support and training to end-users, including researchers and application developers, to help them utilize HPC resources effectively. Requirements: Bachelor's degree in Computer Science, Engineering, or related field. Proven experience as a Linux System Administrator with a focus on HPC environments. In-depth knowledge of Linux operating systems, cluster management tools (e.g., Slurm, PBS Pro), and parallel file systems (e.g., Lustre, GPFS). Experience with HPC hardware components such as compute nodes, interconnects (e.g., InfiniBand), and storage systems. Proficiency in scripting languages (e.g., Bash, Python) for automation and system administration tasks. Job Types: Full-time, Permanent Pay: ₹500,000.00 - ₹700,000.00 per year Benefits: Health insurance Provident Fund Schedule: Day shift Ability to commute/relocate: Hyderabad, Telangana: Reliably commute or planning to relocate before starting work (Preferred) Experience: System Administrator: 2 years (Required) Linux: 2 years (Required) HPC: 1 year (Required) Work Location: In person

Posted 1 week ago

Apply

Linux System Administrator CNC Hospitality Pvt Ltd

0.0 - 1.0 years

0 Lacs

Hyderabad, Telangana

On-site

Indeed logo

Job Description : We are seeking a talented Linux System Administrator with expertise in managing High-Performance Computing (HPC) servers to join our team. The ideal candidate will be responsible for the deployment, configuration, optimization, and maintenance of our HPC infrastructure, ensuring maximum performance and reliability for our clients. Responsibilities: Install, configure, and maintain Linux-based HPC clusters, including hardware and software components such as job schedulers, parallel file systems, and MPI libraries. Optimize system performance and resource utilization to meet the computational demands of our clients' workloads. Monitor system health and performance, troubleshoot issues, and implement solutions to minimize downtime and disruptions. Collaborate with researchers, engineers, and other stakeholders to understand their computational requirements and tailor HPC solutions to meet their needs. Develop and maintain documentation, best practices, and standard operating procedures for HPC system administration tasks. Implement security measures to protect HPC systems and data from unauthorized access and cyber threats. Stay up-to-date with the latest developments in HPC technologies and best practices, and recommend upgrades or improvements as needed. Provide technical support and training to end-users, including researchers and application developers, to help them utilize HPC resources effectively. Requirements: Bachelor's degree in Computer Science, Engineering, or related field. Proven experience as a Linux System Administrator with a focus on HPC environments. In-depth knowledge of Linux operating systems, cluster management tools (e.g., Slurm, PBS Pro), and parallel file systems (e.g., Lustre, GPFS). Experience with HPC hardware components such as compute nodes, interconnects (e.g., InfiniBand), and storage systems. Proficiency in scripting languages (e.g., Bash, Python) for automation and system administration tasks. Job Types: Full-time, Permanent Pay: ₹500,000.00 - ₹700,000.00 per year Benefits: Health insurance Provident Fund Schedule: Day shift Ability to commute/relocate: Hyderabad, Telangana: Reliably commute or planning to relocate before starting work (Preferred) Experience: System Administrator: 2 years (Required) Linux: 2 years (Required) HPC: 1 year (Required) Work Location: In person

Posted 1 week ago

Apply

Service Engineer Induct

2.0 years

5 - 7 Lacs

Hyderābād

On-site

Job Description : We are seeking a talented Linux System Administrator with expertise in managing High-Performance Computing (HPC) servers to join our team. The ideal candidate will be responsible for the deployment, configuration, optimization, and maintenance of our HPC infrastructure, ensuring maximum performance and reliability for our clients. Responsibilities: Install, configure, and maintain Linux-based HPC clusters, including hardware and software components such as job schedulers, parallel file systems, and MPI libraries. Optimize system performance and resource utilization to meet the computational demands of our clients' workloads. Monitor system health and performance, troubleshoot issues, and implement solutions to minimize downtime and disruptions. Collaborate with researchers, engineers, and other stakeholders to understand their computational requirements and tailor HPC solutions to meet their needs. Develop and maintain documentation, best practices, and standard operating procedures for HPC system administration tasks. Implement security measures to protect HPC systems and data from unauthorized access and cyber threats. Stay up-to-date with the latest developments in HPC technologies and best practices, and recommend upgrades or improvements as needed. Provide technical support and training to end-users, including researchers and application developers, to help them utilize HPC resources effectively. Requirements: Bachelor's degree in Computer Science, Engineering, or related field. Proven experience as a Linux System Administrator with a focus on HPC environments. In-depth knowledge of Linux operating systems, cluster management tools (e.g., Slurm, PBS Pro), and parallel file systems (e.g., Lustre, GPFS). Experience with HPC hardware components such as compute nodes, interconnects (e.g., InfiniBand), and storage systems. Proficiency in scripting languages (e.g., Bash, Python) for automation and system administration tasks. Job Types: Full-time, Permanent Pay: ₹500,000.00 - ₹700,000.00 per year Benefits: Health insurance Schedule: Day shift Morning shift Supplemental Pay: Yearly bonus Ability to commute/relocate: Hyderabad, Telangana: Reliably commute or planning to relocate before starting work (Preferred) Experience: Service engineer / System Administrator: 2 years (Required) Linux: 2 years (Required) HPC: 1 year (Required) Work Location: In person

Posted 1 week ago

Apply

Service Engineer Induct

0.0 - 1.0 years

0 Lacs

Hyderabad, Telangana

On-site

Indeed logo

Job Description : We are seeking a talented Linux System Administrator with expertise in managing High-Performance Computing (HPC) servers to join our team. The ideal candidate will be responsible for the deployment, configuration, optimization, and maintenance of our HPC infrastructure, ensuring maximum performance and reliability for our clients. Responsibilities: Install, configure, and maintain Linux-based HPC clusters, including hardware and software components such as job schedulers, parallel file systems, and MPI libraries. Optimize system performance and resource utilization to meet the computational demands of our clients' workloads. Monitor system health and performance, troubleshoot issues, and implement solutions to minimize downtime and disruptions. Collaborate with researchers, engineers, and other stakeholders to understand their computational requirements and tailor HPC solutions to meet their needs. Develop and maintain documentation, best practices, and standard operating procedures for HPC system administration tasks. Implement security measures to protect HPC systems and data from unauthorized access and cyber threats. Stay up-to-date with the latest developments in HPC technologies and best practices, and recommend upgrades or improvements as needed. Provide technical support and training to end-users, including researchers and application developers, to help them utilize HPC resources effectively. Requirements: Bachelor's degree in Computer Science, Engineering, or related field. Proven experience as a Linux System Administrator with a focus on HPC environments. In-depth knowledge of Linux operating systems, cluster management tools (e.g., Slurm, PBS Pro), and parallel file systems (e.g., Lustre, GPFS). Experience with HPC hardware components such as compute nodes, interconnects (e.g., InfiniBand), and storage systems. Proficiency in scripting languages (e.g., Bash, Python) for automation and system administration tasks. Job Types: Full-time, Permanent Pay: ₹500,000.00 - ₹700,000.00 per year Benefits: Health insurance Schedule: Day shift Morning shift Supplemental Pay: Yearly bonus Ability to commute/relocate: Hyderabad, Telangana: Reliably commute or planning to relocate before starting work (Preferred) Experience: Service engineer / System Administrator: 2 years (Required) Linux: 2 years (Required) HPC: 1 year (Required) Work Location: In person

Posted 1 week ago

Apply

Senior Software Engineer - Devops DDN

5.0 years

0 Lacs

Pune, Maharashtra, India

On-site

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. "DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC “The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence. Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management. Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage. Job Description Job Description Location: Pune, India About The Role We are looking for a Senior DevOps Engineer to join our high-impact team in Pune, India. You will lead the design and implementation of scalable, secure, and highly available infrastructure across both cloud and on-premise environments. This role demands a deep understanding of Linux systems, infrastructure automation, and performance tuning, especially in high-performance computing (HPC) setups. As a technical leader, you’ll collaborate closely with development, QA, and operations teams to drive DevOps best practices, tool adoption, and overall infrastructure reliability. Key Responsibilities Design, build, and maintain Linux-based infrastructure across cloud (primarily AWS) and physical data centers. Implement and manage Infrastructure as Code (IaC) using tools such as CloudFormation, Terraform, Ansible, and Chef. Develop and manage CI/CD pipelines using Jenkins, Git, and Gerrit to support continuous delivery. Automate provisioning, configuration, and software deployments with Bash, Python, Ansible, etc. Set up and manage monitoring/logging systems like Prometheus, Grafana, and ELK stack. Optimize system performance and troubleshoot critical infrastructure issues related to networking, filesystems, and services. Configure and maintain storage and filesystems including ext4, xfs, LVM, NFS, iSCSI, and potentially Lustre. Manage PXE boot infrastructure using Cobbler/Kickstart, and create/maintain custom ISO images. Implement infrastructure security best practices, including IAM, encryption, and firewall policies. Act as a DevOps thought leader, mentor junior engineers, and recommend tooling and process improvements. Maintain clear and concise documentation of systems, processes, and best practices. Collaborate with cross-functional teams to ensure reliable and scalable application delivery. Required Skills & Experience 5+ years of experience in DevOps, SRE, or Infrastructure Engineering. Deep expertise in Linux system administration, especially around storage, networking, and process control. Strong proficiency in scripting (e.g., Bash, Python) and configuration management tools (Chef, Ansible). Proven experience in managing on-premise data center infrastructure, including provisioning and PXE boot tools. Familiar with CI/CD systems, Agile workflows, and Git-based source control (Gerrit/GitHub). Experience with cloud services, preferably AWS, and hybrid cloud models. Knowledge of virtualization (e.g., KVM, Vagrant) and containerization (Docker, Podman, Kubernetes). Excellent communication, collaboration, and documentation skills. Nice to Have Hands-on with Lustre or other distributed/parallel filesystems. Experience in HPC (High-Performance Computing) environments. Familiarity with Kubernetes deployments in hybrid clusters. DDN DataDirect Networks (DDN) is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law. Show more Show less

Posted 2 weeks ago

Apply

Login to

Please Verify Your Phone or Email

Confirm Action

Search

Profile

Bookmarks

49 Lustre Jobs

Set Job Alert

Start Your Job Search Today

Job Application AI Bot

Download the Mobile App

Setup Job Alerts

Featured Companies

Login to

Please Verify Your Phone or Email

Confirm Action

Contact Us

Search

Profile

Bookmarks

49 Lustre Jobs

Set Job Alert

Upload Resume

AI Job Matching Summary

Pros

Cons

Summary

Start Your Job Search Today

Job Application AI Bot

Download the Mobile App

Setup Job Alerts

Featured Companies