Home
Jobs
Companies
Resume

49 Lustre Jobs - Page 2

Filter
Filter Interviews
Min: 0 years
Max: 25 years
Min: ₹0
Max: ₹10000000
Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

3.0 years

3 - 3 Lacs

Bengaluru

On-site

Location: Bengaluru, KA, IN Company: ExxonMobil About us At ExxonMobil, our vision is to lead in energy innovations that advance modern living and a net-zero future. As one of the world’s largest publicly traded energy and chemical companies, we are powered by a unique and diverse workforce fueled by the pride in what we do and what we stand for. The success of our Upstream, Product Solutions and Low Carbon Solutions businesses is the result of the talent, curiosity and drive of our people. They bring solutions every day to optimize our strategy in energy, chemicals, lubricants and lower-emissions technologies. We invite you to bring your ideas to ExxonMobil to help create sustainable solutions that improve quality of life and meet society’s evolving needs. Learn more about our What and our Why and how we can work together . ExxonMobil’s affiliates in India ExxonMobil’s affiliates have offices in India in Bengaluru, Mumbai and the National Capital Region. ExxonMobil’s affiliates in India supporting the Product Solutions business engage in the marketing, sales and distribution of performance as well as specialty products across chemicals and lubricants businesses. The India planning teams are also embedded with global business units for business planning and analytics. ExxonMobil’s LNG affiliate in India supporting the upstream business provides consultant services for other ExxonMobil upstream affiliates and conducts LNG market-development activities. The Global Business Center - Technology Center provides a range of technical and business support services for ExxonMobil’s operations around the globe. ExxonMobil strives to make a positive contribution to the communities where we operate and its affiliates support a range of education, health and community-building programs in India. Read more about our Corporate Responsibility Framework. To know more about ExxonMobil in India, visit ExxonMobil India and the Energy Factor India. What role you will play in our team The HPC Systems Engineer role has the overall responsibility to work within a team to provide a performant, reliable, and secure high-performance computing (HPC) environment. The HPC Systems Engineer will be involved in various aspects of designing and engineering our HPC system as well as be responsible for managing day-to-day operations and maintenance activities including, but not limited to the following: general troubleshooting of any issues that may arise, monitoring overall system health, performing system maintenance tasks, and evaluating new hardware/system software. Job location is based out of Bengaluru, Karnataka What you will do Establish strategies for overall support of the system! Evaluate new hardware and software and understand potential benefits/impacts it can have in the environment. Perform hardware maintenance. Perform software installations and upgrades, inclusive of operating system. Monitor overall system performance and health. Provide support for the management of data in the environment. Work with users to resolve problems and ensure they are able to effectively utilize the system. Interact with both business customers and technical teams that are globally distributed and within varied time zones Engaging with vendors for problem resolution of existing infrastructure and discussion of roadmaps and new technologies for evaluations Foster a supportive work environment and maintains open, productive interactions among team and across organizations Build and maintain cross-organizational contacts to facilitate execution of work. About You Skills and Qualifications Bachelor of Engineering degree and score 70% and above (equivalent CGPA) Excellent technical, analytical, and communication skills A minimum of 3 years of hands-on Linux experience (e.g. RHEL, CentOS) and production infrastructure support (e.g. networking, storage, monitoring, compute, installation, configuration, maintenance, upgrade, retirement) Experience in system administration and technical support (e.g. installation, configuration, maintenance, upgrade, retirement, problem resolution) Experience in HPC technologies such as parallel/distributed files systems (e.g. Lustre, GPFS), high speed interconnect fabrics (e.g. Infiniband, Omni-Path), and HPC batch scheduling software suites (e.g. PBSPro, SLURM) Proficiency in technical writing and documentation of solutions Solid understanding of data center operations fundamentals in networking, cooling, and power Works well in a team environment. Self-motivated Minimum 7 years of experience in working in High Performance Computing Systems Preferred Qualifications/ Experience Strong IT skills in infrastructure and applications Experience with supporting large scale production environments. Experience in implementing changes and security controls in a global framework .Understanding of data center operations fundamentals in networking, cooling, and power Knowledge and experience with installing/compiling vendor and open-source software. Knowledge and experience with application/infrastructure deployment and support in one or more of the major cloud environments Comfortable in relocating to Bengaluru and working hour - (1:30 to 10:30 PM IST) shift time. Your benefits An ExxonMobil career is one designed to last. Our commitment to you runs deep: our employees grow personally and professionally, with benefits built on our core categories of health, security, finance and life. We offer you: Competitive compensation Medical plans, maternity leave and benefits, life, accidental death and dismemberment benefits Retirement benefits Global networking & cross-functional opportunities Annual vacations & holidays Day care assistance program Training and development program Tuition assistance program Workplace flexibility policy Relocation program Transportation facility Please note benefits may change from time to time without notice, subject to applicable laws. The benefits programs are based on the Company’s eligibility guidelines. Stay connected with us Learn more about ExxonMobil in India, visit ExxonMobil India and Energy Factor India . Follow us on LinkedIn and Instagram Like us on Facebook Subscribe our channel at YouTube EEO Statement ExxonMobil is an Equal Opportunity Employer: All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin or disability status. Business solicitation and recruiting scams ExxonMobil does not use recruiting or placement agencies that charge candidates an advance fee of any kind (e.g., placement fees, immigration processing fees, etc.). Follow the LINK to understand more about recruitment scams in the name of ExxonMobil. Nothing herein is intended to override the corporate separateness of local entities. Working relationships discussed herein do not necessarily represent a reporting connection, but may reflect a functional guidance, stewardship, or service relationship. Exxon Mobil Corporation has numerous affiliates, many with names that include ExxonMobil, Exxon, Esso and Mobil. For convenience and simplicity, those terms and terms like corporation, company, our, we and its are sometimes used as abbreviated references to specific affiliates or affiliate groups. Abbreviated references describing global or regional operational organizations and global or regional business lines are also sometimes used for convenience and simplicity. Similarly, ExxonMobil has business relationships with thousands of customers, suppliers, governments, and others. For convenience and simplicity, words like venture, joint venture, partnership, co-venturer, and partner are used to indicate business relationships involving common activities and interests, and those words may not indicate precise legal relationships. Nothing herein is intended to override the corporate separateness of local entities. Working relationships discussed herein do not necessarily represent a reporting connection, but may reflect a functional guidance, stewardship, or service relationship. Exxon Mobil Corporation has numerous affiliates, many with names that include ExxonMobil, Exxon, Esso and Mobil. For convenience and simplicity, those terms and terms like corporation, company, our, we and its are sometimes used as abbreviated references to specific affiliates or affiliate groups. Abbreviated references describing global or regional operational organizations and global or regional business lines are also sometimes used for convenience and simplicity. Similarly, ExxonMobil has business relationships with thousands of customers, suppliers, governments, and others. For convenience and simplicity, words like venture, joint venture, partnership, co-venturer, and partner are used to indicate business relationships involving common activities and interests, and those words may not indicate precise legal relationships. Job Segment: Cloud, Sustainability, Systems Engineer, Embedded, CSR, Technology, Energy, Engineering, Management

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

Bangalore Urban, Karnataka, India

On-site

Linkedin logo

We are seeking a Lead High-Performance Computing Engineer experienced in managing and enhancing HPC environments. The ideal candidate will bring a robust engineering background with proven experience in deploying and optimizing HPC infrastructures, who will thrive in our HPC infrastructure engineering team supporting scientific research teams. Responsibilities Participate in incident resolution, software and hardware upgrades Support and maintain HPC infrastructure Implement Infrastructure as Code (IaC) automation Develop and review system operational procedures Lead troubleshooting efforts in complex systems Requirements Experience range of 8 to 12 years in HPC environments Proficiency in configuring and supporting HPC infrastructure Proficiency in Linux, including capabilities such as kernel modules compilation and using debugging tools like strace, coredump, tcpdump Background in job schedulers including IBM LSF and Slurm Expertise in Bright Cluster Manager including installation and configuration tasks Knowledge of GPFS and Lustre file systems Understanding of InfiniBand and OmniPath network interconnect technologies Nice to have Familiarity with cloud-based HPC solutions Experience in system security and data protection best practices Show more Show less

Posted 2 weeks ago

Apply

7.0 years

0 Lacs

Gurgaon, Haryana, India

Remote

Linkedin logo

We are Lenovo. We do what we say. We own what we do. We WOW our customers. Lenovo is a US$57 billion revenue global technology powerhouse, ranked #248 in the Fortune Global 500, and serving millions of customers every day in 180 markets. Focused on a bold vision to deliver Smarter Technology for All, Lenovo has built on its success as the world’s largest PC company with a full-stack portfolio of AI-enabled, AI-ready, and AI-optimized devices (PCs, workstations, smartphones, tablets), infrastructure (server, storage, edge, high performance computing and software defined infrastructure), software, solutions, and services. Lenovo’s continued investment in world-changing innovation is building a more equitable, trustworthy, and smarter future for everyone, everywhere. Lenovo is listed on the Hong Kong stock exchange under Lenovo Group Limited (HKSE: 992) (ADR: LNVGY). To find out more visit www.lenovo.com and read about the latest news via our StoryHub. Position Description Lenovo is seeking an experienced High-Performance Computing (HPC) Solutions Architect who will play a key role in supporting the development and deployment of HPC-related service offerings, including Professional and Managed Services. The chosen candidate will be part of Lenovo Solutions Services Group (SSG) as a member of the Center of Excellence team, driving service offering design and definition for our Professional Services, Managed Services, and TruScale Services businesses. This role is primarily internal but includes some customer-facing work. This is a remote-only role. In This Role You Will Support the development and delivery of HPC service offerings for enterprise, research, and government customers. Work with leading enterprises, research institutions, and cloud providers to integrate customer and market requirements into service offerings and technology roadmaps. Architect full-stack HPC infrastructure solutions, including compute, networking, storage, security, and workload management components, to support high-performance computing workloads. Demonstrate expertise in HPC technologies, including cluster computing, high-speed interconnects (InfiniBand, Ethernet), parallel file systems (Lustre, GPFS), and workload orchestration tools (Slurm, PBS, LSF). Leverage experience with on-premises, hybrid, and cloud-based HPC environments, optimizing workload placement and performance. Work across internal teams, divisions, and external partners to develop and deliver innovative HPC solutions that drive business and research advancements. Demonstrate strong written and verbal communication skills, with the ability to present solutions to customers, internal stakeholders, and partners. Drive strategic relationships with HPC enterprises, research institutions, cloud service providers (CSPs), and technology partners. Act As The Technical Lead For HPC-based Infrastructure Solutions, Collaborating With The HPC Practice Leader Throughout The Service Offering Lifecycle, Including Assisting in the creation of new service offerings, from concept development to implementation guidance. Leading initial implementations alongside Technical Consultants. Deliver Presentations In Support Of Offering development and go-to-market strategies. Field enablement and training initiatives. Service delivery execution. Assist in documentation of project requirements, statements of work, and service descriptions. Collaborate with business partners to enhance solution capabilities and integration. Qualifications Bachelor's degree, MBA, or equivalent experience required. 7+ years of experience architecting and deploying enterprise HPC solutions. Minimum of 5+ years of experience working with at least one cloud platform (VMware, AWS, Azure, GCP) or on-premises HPC clusters. Proven expertise in designing, developing, and delivering large-scale HPC platforms and solutions. Experience implementing high-availability, high-performance computing architectures with defined scalability and fault-tolerance objectives. Strong background in HPC system design, workload scheduling, parallel computing, and data-intensive computing frameworks. Hands-on experience with automation tools such as Terraform, Ansible, and SaltStack for HPC environments. Certification in HPC technologies or cloud-based HPC solutions preferred. Preferred Skills Experience developing infrastructure strategy and reference architectures for HPC deployments. Hands-on expertise in HPC hardware, software, and cloud-native architectures for scientific computing and AI workloads. Experience designing and implementing highly available, secure, and scalable HPC solutions. Strong scripting/programming skills (Python, Bash, C, MPI, OpenMP) for automation and performance tuning. Deep knowledge of high-speed networking (RDMA, InfiniBand), distributed storage, and parallel computing models. Experience with monitoring, observability, and performance optimization for HPC clusters. Expertise in defining and delivering HPC migration, implementation, operations, and optimization initiatives. Industry knowledge in key HPC-driven sectors such as scientific research, healthcare, financial modeling, and engineering simulations. Experience in systems integration and managing direct/channel sales engagements. Strong understanding of HPC containerization, workflow management, and cloud-bursting strategies. Demonstrated ability to present technical solutions and recommendations to customers, engineering teams, and leadership. Experience in a customer-facing HPC services environment, providing professional and managed services. TOGAF 9 Certified or equivalent architecture framework certification preferred. You will report to SSG (Solutions Services Group) organization structure. SSG has been focusing on the expanding IT service market, especially the digital workplace services opportunity, the growing demand for aaS (as a Service) model, and customers stronger preference for sustainability services. Meanwhile, SSG has continued to invest in software tools, platforms, and repeatable vertical solutions with our own IP, and focus on vertical solutions in manufacturing, retail, healthcare, education, and Smart City. We are expanding TruScale as a Service to include Digital Workplace Solutions, developing our Hybrid Cloud solutions, and exploring Metaverse solutions. We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, religion, sexual orientation, gender identity, national origin, status as a veteran, and basis of disability or any federal, state, or local protected class. Show more Show less

Posted 2 weeks ago

Apply

7.0 years

10 Lacs

Hyderābād

On-site

SMTS Systems Design Eng. Hyderabad, India Engineering 62878 Job Description WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. AMD together we advance_ THE TEAM AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people. THE ROLE: We are seeking an experienced HPC Systems Engineer with 7+ years of expertise in high-performance computing (HPC) environments. This role requires hands-on experience with Python, Kubernetes (K8s), Slurm, OpenStack, and Ansible , along with the ability to support external clients in live troubleshooting sessions. The PERSON: The ideal candidate will have deep technical knowledge of drivers, troubleshooting methods, and system-level debugging and will play a key role in managing, optimizing, and troubleshooting HPC clusters and cloud-based HPC environments. KEY RESPONSIBILITIES: HPC System Administration & Troubleshooting Manage and optimize HPC clusters, ensuring high availability and performance. Troubleshoot GPU, CPU, network drivers, firmware, and OS-level issues. Debug storage, networking, and job scheduling bottlenecks in Slurm-based environments. Kubernetes & Cloud HPC Environments Deploy and manage HPC workloads in Kubernetes for AI/ML and parallel computing. Optimize OpenStack-based HPC clusters with Ceph, Cinder, and Neutron for cloud scalability. Implement containerized HPC workflows using Kubernetes and OpenShift. Automation & Infrastructure As Code (IaC) Develop Ansible and Terraform scripts for provisioning and managing HPC resources. Automate job scheduling, cluster monitoring, and log analysis using Python. Optimize CI/CD pipelines for HPC and AI/ML applications. Performance Tuning & Benchmarking Benchmark and optimize multi-node HPC workloads (MPI, NCCL, ROCm, CUDA). Tune OS parameters, networking (InfiniBand, RoCE), and Slurm configurations for peak performance. Enhance HPC storage performance (Ceph, Lustre, NFS) and distributed computing efficiency. Client Support & Collaboration Provide real-time technical support and troubleshooting for HPC users. Engage with developers, DevOps, and system administrators to optimize cluster performance. Document solutions, best practices, and contribute to internal knowledge bases. PREFERRED QUALIFICATIONS: Experience with AMD MI300, MI2X0 GPUs, ROCm, MPI, UCX, or XPMEM. Exposure to containerized workloads using Singularity or Docker in HPC. Familiarity with OpenStack deployment automation (e.g., TripleO, Kolla, or OpenStack-Ansible). Experience in customer-facing technical roles, with a strong ability to troubleshoot live issues. This role is critical in ensuring seamless HPC operations, troubleshooting complex system issues, and supporting high-profile clients with real-time problem resolution in both bare-metal and cloud-based HPC environments. ACADEMIC CREDENTIALS: Bachelor or Masters Degree in Computer Engineering or Electrical/Electronics Engineering #LI-PK1 AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Posted 2 weeks ago

Apply

5.0 years

0 Lacs

India

On-site

Linkedin logo

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 2 weeks ago

Apply

1.0 - 4.0 years

4 - 6 Lacs

Hyderabad

Work from Office

Naukri logo

Project Role : Technology OpS Support Practitioner Project Role Description : Own the integrity and governance of systems, including best practices for delivering services. Develop, deploy and support infrastructures, applications and technology initiatives from an architectural and operational perspective in conjunction with existing standards and methods of delivery. Must have skills : Storage Area Networks (SAN) Architecture and Design Good to have skills : Netapp Storage Area Network (SAN) AdministrationMinimum 5 year(s) of experience is required Educational Qualification : 15 years full time education Project Role :Integration Engineer Project Role Description :Provide consultative Business and System Integration services to help clients implement effective solutions. Understand and translate customer needs into business and technology solutions. Drive discussions and consult on transformation, the customer journey, functional/application designs and ensure technology and business solutions represent business requirements. Must have Skills :File:ONTAP/Isilon (one must have File)Block:Power flex, SolidFire(rear to find), vmax, 3par, brocade, cisco (One must have of block)Object:Storage grid(rear to find), storage fabricJob Requirements :File Storage Engineering product experience (eg Dell Isilon, NetApp ONTAP, VAST, Lustre, etc)Datacenter stack experience (Storage, Compute, Networking)Linux/Unix and Windows Operating Systems, including NAS protocols CIFS/SMB and NFSProven experience in automation of manual tasks via code (eg Python) or scripts (eg bash, PowerShell)Experience with programming languages such as Python; also JSON, YAML, etcRest API consumption via code or scriptsAbility to lead others and provide Subject Matter Expertise in one or more subjectsExcellent presentation skills Work with external vendors for new and existing productsExperience of large enterprise infrastructure designKnowledge of data storage technologies from NetApp, Dell or similar companiesSoftware and systems security. Key Responsibilities :Support role L2 and L3 tasksClosing incident tickets, interacting with customers, VendorsFacilitate migration (file Products) and make sure Runbooks are in place Educational Qualification:Minimum Bachelors degree Relevant Vendor/Technology certifications preferred Qualification 15 years full time education

Posted 2 weeks ago

Apply

5.0 years

0 Lacs

India

On-site

Linkedin logo

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 2 weeks ago

Apply

8.0 - 12.0 years

0 Lacs

Pune, Maharashtra, India

On-site

Linkedin logo

We are seeking a Lead High-Performance Computing Engineer experienced in managing and enhancing HPC environments. The ideal candidate will bring a robust engineering background with proven experience in deploying and optimizing HPC infrastructures, who will thrive in our HPC infrastructure engineering team supporting scientific research teams. Responsibilities Participate in incident resolution, software and hardware upgrades Support and maintain HPC infrastructure Implement Infrastructure as Code (IaC) automation Develop and review system operational procedures Lead troubleshooting efforts in complex systems Requirements Experience range of 8 to 12 years in HPC environments Proficiency in configuring and supporting HPC infrastructure Proficiency in Linux, including capabilities such as kernel modules compilation and using debugging tools like strace, coredump, tcpdump Background in job schedulers including IBM LSF and Slurm Expertise in Bright Cluster Manager including installation and configuration tasks Knowledge of GPFS and Lustre file systems Understanding of InfiniBand and OmniPath network interconnect technologies Nice to have Familiarity with cloud-based HPC solutions Experience in system security and data protection best practices Show more Show less

Posted 2 weeks ago

Apply

5.0 years

0 Lacs

Pune, Maharashtra, India

On-site

Linkedin logo

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. "DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC “The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence. Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management. Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage. Job Description Job Summary: We are looking for a Lead Quality Assurance (QA) Engineer to work on validation of high-performance storage solutions for HPC and AI markets. The ideal candidate will have experience designing, implementing, debugging and running both automated and manual software-based tests in a Linux environment, using shell tools and scripts. Responsibilities for this role include but are not limited to : Design and develop both automated and manual test cases to validate product features Run automated and manual tests as needed to validate product defect fixes and functionality Work with the Engineering manager and a geographically distributed team to understand product requirements and features Triage test failures on a daily basis Contribute to QA reports and provide input on release metrics Contribute to and validate product documentation Qualifications BS/MS in Computer Science, Computer Engineering or equivalent degree/experience. 5+ years of component and system-level test experience 5+ years of experience working in Linux environments with shell script languages; experience with bash is highly desirable, Python experience is a plus 3+ years of experience working with enterprise-class or HPC storage systems and/or distributed systems Attention to detail and commitment to high quality/error free deliverables. Strong team player with good communication skills and should be self-starter Excellent time management skills, with the ability to independently prioritize, multitask, and work under deadlines in a fast paced environment Knowledge of Parallel File Systems, in particular Lustre, is highly preferred. Experience with git, JIRA, Jenkins and gerrit preferred DDN DataDirect Networks (DDN) is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law. Show more Show less

Posted 3 weeks ago

Apply

5.0 years

0 Lacs

Bengaluru, Karnataka, India

On-site

Linkedin logo

As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor’s degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience. JR1993756 Show more Show less

Posted 3 weeks ago

Apply

5.0 years

0 Lacs

Bengaluru, Karnataka, India

On-site

Linkedin logo

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life’s work , to amplify human creativity and intelligence. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor’s degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience. JR1993564 Show more Show less

Posted 3 weeks ago

Apply

5.0 years

0 Lacs

Hyderabad, Telangana, India

On-site

Linkedin logo

As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters. Optimize cluster operations for maximum reliability, efficiency, and performance. Drive foundational improvements and automation to enhance researcher productivity. Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world. Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor’s degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure. Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster. Deep understanding of GPU computing and AI infrastructure. Passion for solving complex technical challenges and optimizing system performance. Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm. Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc. In depth understating of container technologies like Docker, Enroot, etc. Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems. Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA. Experience with Cloud Deployment, BCM, Terraform. Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads. Multi-cloud experience. JR1993756 Show more Show less

Posted 3 weeks ago

Apply

5.0 years

0 Lacs

India

On-site

Linkedin logo

THIS IS A LONG TERM CONTRACT POSITION WITH ONE OF THE LARGEST, GLOBAL, TECHNOLOGY LEADER. Our Client is a Fortune 350 company that engages in the design, manufacturing, marketing, and service of semiconductor processing equipment. We are seeking an experienced High Performance Computing platform consultant to provide Support to India/Asia/EU region users and carry out platform enhancements and reliability improvement projects as aligned with HPC architect Minimum qualifications: Bachelor’s or Master’s degree in Computer Science or equivalent with 5+ years of experience in High Performance Computing technologies HPC Environment: Familiar with use of HPC – Ansys/Fluent over MPI, Helping users to tune their jobs in an HPC environment Linux administration Parallel file system (Eg. Gluster, Lustre, ZFS, Gluster, Luster, NFS, CIFS) MPI (OpenMPI, MPICH2, IntelMIP), Infiniband parallel computing Monitoring tools – Eg. Nagios Programming skills such as in Python would be nice to have, especially using MPI Experienced and hands on with Cloud technologies: Prefer using Azure and Terraform for VM creations and maintenance Effective communication skills (the resource would independently engage and address user requests and resolve incidents for global regions – Asia, EU included) Ability to work independently with minimal supervision Preferred Qualifications: Experience with ANSYS Products Show more Show less

Posted 3 weeks ago

Apply

8.0 years

0 Lacs

Pune, Maharashtra, India

On-site

Linkedin logo

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing. "DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC “The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence. Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management. Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage. Pre-Sales Solutions Architect - India Job Description: We are currently seeking candidate for Pre-Sales Solutions Architect – India, to join our dynamic team of passionate customer-enabling technologists ! The ideal candidate will have a deep understanding of AI & HPC infrastructure solutions and a proven track record of driving successful pre-sales engagements with great communications and presentation skills. The candidate expected to: Design solutions and proposals to meet customer-defined specifications, Help customers define solution specifications matching DDN products, Provide Proof of Concept (POC) and benchmarking support, Compare and contrast with competitive products to highlight DDN’s superior features & functionality Work with DDN engineering, professional services and sales teams to drive win ratio. Duties and Responsibilities: The duties and responsibilities for this role include but are not limited to; Pre-sales activity supporting HPC and AI customers. Assist in closing new business opportunities by gaining a thorough technical and business understanding of clients' needs and helping sales identify, qualify, and close new opportunities. Understand the sales process and how to utilize company resources to close accounts. Participate in customer-focused seminars, tradeshows, events, and training. Provide RFP responses, technical drawings, presentations, and recommendations. Acquire and maintain a thorough technical and procedural understanding of DDN Sales cycles, products/services and a thorough technical understanding of similar industries. Create of Bill of Materials of proposed solutions for DDN products and professional services. Ability to work with DDN Subject Matter Experts from different geographies and time zones. Ability to manage customer relationship post-sale, including strategy to close repeat business. Qualifications: BSc or higher degree or equivalent PERSONAL SKILLS: Ability to simplify and explain complex tasks, architectures and environments. Good written and oral communication skills. Must be able to develop and deliver presentations; connect and build a rapport with customers via phone, face to face meetings, and in written correspondences. Ability to work independently, respond timely and remain composed in hectic environments. Ability to listen, understand and articulate the customer's needs, along with possible solutions to sales team and sales management. IT INCREASE YOUR CHANCES OF GETTING HIRED IF YOU HAVE: 8+ years of “relevant” pre-sales experience Good understanding of AI ecosystem of NVIDIA/AMD/INTEL GPUs including but not limited to hardware components, software libraries, containerization technologies like Docker and Kubernetes, middleware and application stack. Good understanding of storage technologies including SAN, NAS, DAS, Parallel Filesystems, Object Storage, Software Defined Storage etc. Good knowledge of storage protocols like BLOCK I/O, NFS, SMB and S3 and what’s required to build solutions around them. Good understanding of Ethernet and InfiniBand networking technologies including network topology, blocking ratios, throughput and IOPS capabilities etc. Experience of working with Lustre/GPFS/Weka/BeeGFS in the capacity of building or deploying the solution is a plus. Experience of architecting or deploying solutions in public or private clouds and multitenant environments. Understanding of appropriate content to be developed to address different set of customers (as per their seniority level in the organization). Show more Show less

Posted 3 weeks ago

Apply

7.0 - 3.0 years

0 Lacs

Bengaluru, Karnataka

On-site

Indeed logo

About us At ExxonMobil, our vision is to lead in energy innovations that advance modern living and a net-zero future. As one of the world’s largest publicly traded energy and chemical companies, we are powered by a unique and diverse workforce fueled by the pride in what we do and what we stand for. The success of our Upstream, Product Solutions and Low Carbon Solutions businesses is the result of the talent, curiosity and drive of our people. They bring solutions every day to optimize our strategy in energy, chemicals, lubricants and lower-emissions technologies. We invite you to bring your ideas to ExxonMobil to help create sustainable solutions that improve quality of life and meet society’s evolving needs. Learn more about our What and our Why and how we can work together . ExxonMobil’s affiliates in India ExxonMobil’s affiliates have offices in India in Bengaluru, Mumbai and the National Capital Region. ExxonMobil’s affiliates in India supporting the Product Solutions business engage in the marketing, sales and distribution of performance as well as specialty products across chemicals and lubricants businesses. The India planning teams are also embedded with global business units for business planning and analytics. ExxonMobil’s LNG affiliate in India supporting the upstream business provides consultant services for other ExxonMobil upstream affiliates and conducts LNG market-development activities. The Global Business Center - Technology Center provides a range of technical and business support services for ExxonMobil’s operations around the globe. ExxonMobil strives to make a positive contribution to the communities where we operate and its affiliates support a range of education, health and community-building programs in India. Read more about our Corporate Responsibility Framework. To know more about ExxonMobil in India, visit ExxonMobil India and the Energy Factor India. What role you will play in our team The HPC Systems Engineer role has the overall responsibility to work within a team to provide a performant, reliable, and secure high-performance computing (HPC) environment. The HPC Systems Engineer will be involved in various aspects of designing and engineering our HPC system as well as be responsible for managing day-to-day operations and maintenance activities including, but not limited to the following: general troubleshooting of any issues that may arise, monitoring overall system health, performing system maintenance tasks, and evaluating new hardware/system software. Job location is based out of Bengaluru, Karnataka What you will do Establish strategies for overall support of the system! Evaluate new hardware and software and understand potential benefits/impacts it can have in the environment. Perform hardware maintenance. Perform software installations and upgrades, inclusive of operating system. Monitor overall system performance and health. Provide support for the management of data in the environment. Work with users to resolve problems and ensure they are able to effectively utilize the system. Interact with both business customers and technical teams that are globally distributed and within varied time zones Engaging with vendors for problem resolution of existing infrastructure and discussion of roadmaps and new technologies for evaluations Foster a supportive work environment and maintains open, productive interactions among team and across organizations Build and maintain cross-organizational contacts to facilitate execution of work. About You Skills and Qualifications Bachelor of Engineering degree and score 70% and above (equivalent CGPA) Excellent technical, analytical, and communication skills A minimum of 3 years of hands-on Linux experience (e.g. RHEL, CentOS) and production infrastructure support (e.g. networking, storage, monitoring, compute, installation, configuration, maintenance, upgrade, retirement) Experience in system administration and technical support (e.g. installation, configuration, maintenance, upgrade, retirement, problem resolution) Experience in HPC technologies such as parallel/distributed files systems (e.g. Lustre, GPFS), high speed interconnect fabrics (e.g. Infiniband, Omni-Path), and HPC batch scheduling software suites (e.g. PBSPro, SLURM) Proficiency in technical writing and documentation of solutions Solid understanding of data center operations fundamentals in networking, cooling, and power Works well in a team environment. Self-motivated Minimum 7 years of experience in working in High Performance Computing Systems Preferred Qualifications/ Experience Strong IT skills in infrastructure and applications Experience with supporting large scale production environments. Experience in implementing changes and security controls in a global framework .Understanding of data center operations fundamentals in networking, cooling, and power Knowledge and experience with installing/compiling vendor and open-source software. Knowledge and experience with application/infrastructure deployment and support in one or more of the major cloud environments Comfortable in relocating to Bengaluru and working hour - (1:30 to 10:30 PM IST) shift time. Your benefits An ExxonMobil career is one designed to last. Our commitment to you runs deep: our employees grow personally and professionally, with benefits built on our core categories of health, security, finance and life. We offer you: Competitive compensation Medical plans, maternity leave and benefits, life, accidental death and dismemberment benefits Retirement benefits Global networking & cross-functional opportunities Annual vacations & holidays Day care assistance program Training and development program Tuition assistance program Workplace flexibility policy Relocation program Transportation facility Please note benefits may change from time to time without notice, subject to applicable laws. The benefits programs are based on the Company’s eligibility guidelines. Stay connected with us Learn more about ExxonMobil in India, visit ExxonMobil India and Energy Factor India . Follow us on LinkedIn and Instagram Like us on Facebook Subscribe our channel at YouTube EEO Statement ExxonMobil is an Equal Opportunity Employer: All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin or disability status. Business solicitation and recruiting scams ExxonMobil does not use recruiting or placement agencies that charge candidates an advance fee of any kind (e.g., placement fees, immigration processing fees, etc.). Follow the LINK to understand more about recruitment scams in the name of ExxonMobil. Nothing herein is intended to override the corporate separateness of local entities. Working relationships discussed herein do not necessarily represent a reporting connection, but may reflect a functional guidance, stewardship, or service relationship. Exxon Mobil Corporation has numerous affiliates, many with names that include ExxonMobil, Exxon, Esso and Mobil. For convenience and simplicity, those terms and terms like corporation, company, our, we and its are sometimes used as abbreviated references to specific affiliates or affiliate groups. Abbreviated references describing global or regional operational organizations and global or regional business lines are also sometimes used for convenience and simplicity. Similarly, ExxonMobil has business relationships with thousands of customers, suppliers, governments, and others. For convenience and simplicity, words like venture, joint venture, partnership, co-venturer, and partner are used to indicate business relationships involving common activities and interests, and those words may not indicate precise legal relationships. Nothing herein is intended to override the corporate separateness of local entities. Working relationships discussed herein do not necessarily represent a reporting connection, but may reflect a functional guidance, stewardship, or service relationship. Exxon Mobil Corporation has numerous affiliates, many with names that include ExxonMobil, Exxon, Esso and Mobil. For convenience and simplicity, those terms and terms like corporation, company, our, we and its are sometimes used as abbreviated references to specific affiliates or affiliate groups. Abbreviated references describing global or regional operational organizations and global or regional business lines are also sometimes used for convenience and simplicity. Similarly, ExxonMobil has business relationships with thousands of customers, suppliers, governments, and others. For convenience and simplicity, words like venture, joint venture, partnership, co-venturer, and partner are used to indicate business relationships involving common activities and interests, and those words may not indicate precise legal relationships.

Posted 3 weeks ago

Apply

5 - 10 years

7 - 12 Lacs

Gurgaon

Work from Office

Naukri logo

The High-Performance Computing Infrastructure Engineer is primarily responsible for the overall health and maintenance of storage technologies in our managed services customer's environments. Our HPC Infrastructure Engineers are a valued member of the Managed Services Infrastructure Practice responsible for Tier 3 incident management, service request management and change management infrastructure support for all Managed Services customers. Roles & Responsibilities Provide enterprise-level operational support to Managed Services customers for incident, problem, and change management activities Plan and perform maintenance activities Assess customer environments for performance and design issues and propose resolutions Work across technical teams to troubleshoot complex infrastructure issues Create and maintain detailed documentation Serve as a subject matter expert and escalation point for storage technologies Work with vendors to resolve storage issues Communicate with customers and internal team with transparency Participate in on-call rotation Completion of training and certification as assigned to further skills and knowledge Skills Required Bachelors degree or equivalent Information Systems or related field. Unique education, specialized experience, skills, knowledge, training, or certification may be substituted for education 5+ years of expert level experience managing infrastructure in high-performance computing environments including configuration, troubleshooting, and best practice. 1+ years of experience with Nvidia DGX preferred. Experience with high-performance computing (HPC) schedulers (e.g., SLURM, PBS, Torque) required. Experience configuring, maintaining and troubleshooting Kubernetes. Experience with storage technology (e.g., Ceph, Vast Data Platform) and distributed file systems (e.g., Lustre, GPFS, NFS, GlusterFS). Experience with machine learning or data science workflows in HPC/AI environments Advances experience with Linux operating systems. Experience configuring, maintaining and troubleshooting Nvidia/Mellanox (Cumulus OS) switches a plus Experience with both ethernet and InfiniBand networking a plus. 1+ years working with monitoring platforms (e.g., Prometheus, Grafana); Elastic Observability experience is a bonus 1+ years working with an enterprise ITSM system: Service Now is a bonus Previous experience with automation tools such as Ansible, Puppet, or Chef a plus. Managed Services or consulting experience is required. Strong background with customer service High level problem-solving and communication skills Strong oral and written communications skills Related network certifications are a bonus.

Posted 2 months ago

Apply

3 - 5 years

1 - 5 Lacs

Chennai, Bengaluru, Hyderabad

Work from Office

Naukri logo

Design, deploy and configure HPC Clusters including compute, storage and networking components. Installation requests on HPC, application upgrades, and troubleshooting processes in coordination with users, software vendors and OEM. Administer job schedulers (e.g., Slurm), manager user access, monitor health and troubleshoot system issues on both on-prem and Cloud. Optimize HPC workloads, tune resource utilization and benchmark system performance. Install and maintain HPC hardware, software stacks, compliers, libraries (e.g., MPI, OPENMP) and custom tools. Configure VM, Storage and servers on cloud. Assist users in optimizing and running applications on the cluster & cloud, including guidance. Ensure System stability through regular updates, proactive monitoring and software/hardware troubleshooting. Responsibilities Supervise day-to-day support operations for HPC and Cloud team by supporting ticket SLA adherence. Manage support ticket systems, primarily using internal IT tools. Ensure timely resolution of user issues related to CAE applications in HPC & Cloud. Plan, schedule, and oversee application upgrades and installations. Collaborate with internal teams and external vendors to ensure seamless issue resolution. Generate detailed performance reports monthly, analyzing key trends and areas for improvement. Technical Skills: Operating Systems: Expertise in Linux (RHEL CentOS, Ubuntu) HPC Tools and Frameworks: 1. Job Schedulers: Slurm, PBS & Sync-HPC 2. Parallel Programming: MPI, OPENMP, CUDA 3. Scripting: Python, Bash and Optionally C/C++ Cloud: Knowledge in AWS, GCP & Azure with HPC toolkits, VM & Object storage creation. Networking: Knowledge of high-speed networks (InfiniBand, RDMA, Ethernet) Storage Systems: Experience with parallel file systems (Lustre, NFS) Hardware: Familiarity with HPC specific hardware wit, RAM, CPU & GPU Certifications Any Cloud Solution Architect Certificate (Preferred GCP) RHEL Certified System Administrator (Preferred)

Posted 3 months ago

Apply

5 years

0 Lacs

Hyderabad, Telangana, India

Hybrid

Linkedin logo

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life’s work , to amplify human creativity and intelligence. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world! As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow. What You'll Be Doing In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting. Design and implement state-of-the-art GPU compute clusters.Optimize cluster operations for maximum reliability, efficiency, and performance.Drive foundational improvements and automation to enhance researcher productivity.Troubleshoot, diagnose, and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners.Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systemsWrite and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world.Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters. What We Need To See Bachelor’s degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure.Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster.Deep understanding of GPU computing and AI infrastructure.Passion for solving complex technical challenges and optimizing system performance.Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm.Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc.In depth understating of container technologies like Docker, Enroot, etc.Experience programming in Python and Bash scripting. Ways To Stand Out From The Crowd Interest in crafting, analyzing, and fixing large-scale distributed systems.Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA.Experience with Cloud Deployment, BCM, Terraform.Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.Multi-cloud experience. JR1993564

Posted 5 months ago

Apply

0 years

0 Lacs

Bengaluru, Karnataka, India

On-site

Linkedin logo

Job Description Summary Delivers High Performance Computing (HPC) application and infrastructure solutions that support complex, compute intensive geophysics, reservoir simulation, machine learning and AI workloads, parallel filesystems, and low latency networks. Job Description About Chevron IT The Information Technology function empowers all of Chevron to harness the benefits of information and digital technologies to drive competitive advantage. Our thirteen IT digital platforms maximize value creation by empowering teams to accelerate innovation, capitalize on scale, drive efficiency and reduce duplication, and speed access to quality data to unlock opportunities across business units and functions. Organized around business capabilities and digital products, the digital platforms span Chevron’s global integrated oil and gas value chain. Our Business Units partner with the enterprise digital platforms to drive effective engagement between IT and all of Chevron with a clear focus on efficiently delivering better, faster results. Our IT Engineering personnel chapters offer a flexible staffing and assignment model that allows us to bring the right skills when and where they are needed -- and ensure we are focusing on efforts that drive the highest business priorities. The Global Capability Center (GCC) IT department contains platform teams who are members of the four Petro technical digital platforms and IT Foundation Platform including Surface, Subsurface, Wells (Drilling), and Health, Safety, Environment (HSE), and IT Foundation Digital (ITFP). These teams operate in a matrix environment with priorities determined by global Product Owners and Product Managers. About This Position The Global Capability Center (GCC) Cloud Engineer - HPC is a key technical engineer accountable for delivery and support of products and services to Chevron workforce. The product team is part of the HPC Product line within the IT Foundation Digital Platform. This Product line is responsible for digital delivery and support of HPC products and services to the Chevron Enterprise. Key Responsibilities The Role: As a Cloud Engineer - HPC you will be responsible for providing application and infrastructure solutions that support complex, compute intensive geophysics, reservoir simulation, machine learning and AI workloads, parallel filesystems, and low latency networks. Required Qualifications The preferred candidate will have knowledge and at least 5 years' experience in Linux, Cloud, and storage system administration experience in a large-scale enterprise environment plus one or more of the following areas: HPC job scheduling systems (e.g., Slurm or PBS), parallel file systems (e.g., Lustre), Azure VM Scale Sets, underlying infrastructure supporting Oil and Gas applications, and configuration management technologies (e.g., Satellite, Ansible, Python, and Azure). Bachelor's degree in computer science, Information Systems, or comparable field. Chevron ENGINE supports global operations, supporting business requirements across the world. Accordingly, the work hours for employees will be aligned to support business requirements. The standard work week will be Monday to Friday. Working hours are 8:00am to 5:00pm or 1:30pm to 10:30pm. Chevron participates in E-Verify in certain locations as required by law. Show more Show less

Posted 4 weeks ago

Apply

0 years

0 Lacs

Pune, Maharashtra, India

On-site

Linkedin logo

We are seeking a Senior High-Performance Computing Engineer with a strong engineering background and hands-on experience in deployment and optimization of HPC infrastructure. This role involves daily operations and engineering activities of the HPC environments to support a scientific research team’s HPC cluster utilization. Responsibilities Support HPC infrastructure Implement IaC Infrastructure automation Participate in incident resolution, software and hardware upgrades Requirements 5 to 8 years of HPC engineering experience Proficiency in configuring and supporting HPC infrastructure Proficiency in Linux including kernel modules compilation, debugging tools (strace, coredump, tcpdump) Competency in Job schedulers (IBM LSF, Slurm) Expertise in Bright Cluster Manager including installation and configuration Knowledge of GPFS/Lustre filesystems Background in InfiniBand/OmniPath network interconnect Nice to have Familiarity with efficiency optimization methods for HPC Show more Show less

Posted 4 weeks ago

Apply

5 - 10 years

4 - 6 Lacs

Hyderabad

Work from Office

Naukri logo

Project Role : Technology OpS Support Practitioner Project Role Description : Own the integrity and governance of systems, including best practices for delivering services. Develop, deploy and support infrastructures, applications and technology initiatives from an architectural and operational perspective in conjunction with existing standards and methods of delivery. Must have skills : Storage Area Networks (SAN) Architecture and Design Good to have skills : Netapp Storage Area Network (SAN) Administration Minimum 5 year(s) of experience is required Educational Qualification : 15 years full time education Project Role :Integration Engineer Project Role Description :Provide consultative Business and System Integration services to help clients implement effective solutions. Understand and translate customer needs into business and technology solutions. Drive discussions and consult on transformation, the customer journey, functional/application designs and ensure technology and business solutions represent business requirements. Must have Skills : File:ONTAP/Isilon (one must have File) Block:Power flex, SolidFire(rear to find), vmax, 3par, brocade, cisco (One must have of block) Object:Storage grid(rear to find), storage fabricJob Requirements : File Storage Engineering product experience (eg Dell Isilon, NetApp ONTAP, VAST, Lustre, etc) Datacenter stack experience (Storage, Compute, Networking) Linux/Unix and Windows Operating Systems, including NAS protocols CIFS/SMB and NFS Proven experience in automation of manual tasks via code (eg Python) or scripts (eg bash, PowerShell) Experience with programming languages such as Python; also JSON, YAML, etc Rest API consumption via code or scripts Ability to lead others and provide Subject Matter Expertise in one or more subjects Excellent presentation skills Work with external vendors for new and existing products Experience of large enterprise infrastructure design Knowledge of data storage technologies from NetApp, Dell or similar companies Software and systems security. Key Responsibilities : Support role L2 and L3 tasks Closing incident tickets, interacting with customers, Vendors Facilitate migration (file Products) and make sure Runbooks are in place Educational Qualification:Minimum Bachelor's degree Relevant Vendor/Technology certifications preferred Qualification 15 years full time education

Posted 1 month ago

Apply

5 - 10 years

4 - 6 Lacs

Hyderabad

Work from Office

Naukri logo

Project Role : Technology OpS Support Practitioner Project Role Description : Own the integrity and governance of systems, including best practices for delivering services. Develop, deploy and support infrastructures, applications and technology initiatives from an architectural and operational perspective in conjunction with existing standards and methods of delivery. Must have skills : Storage Area Networks (SAN) Architecture and Design Good to have skills : Netapp Storage Area Network (SAN) Administration Minimum 5 year(s) of experience is required Educational Qualification : 15 years full time education Project Role :Integration Engineer Project Role Description :Provide consultative Business and System Integration services to help clients implement effective solutions. Understand and translate customer needs into business and technology solutions. Drive discussions and consult on transformation, the customer journey, functional/application designs and ensure technology and business solutions represent business requirements. Must have Skills : File:ONTAP/Isilon (one must have File) Block:Power flex, SolidFire(rear to find), vmax, 3par, brocade, cisco (One must have of block) Object:Storage grid(rear to find), storage fabricJob Requirements : File Storage Engineering product experience (eg Dell Isilon, NetApp ONTAP, VAST, Lustre, etc) Datacenter stack experience (Storage, Compute, Networking) Linux/Unix and Windows Operating Systems, including NAS protocols CIFS/SMB and NFS Proven experience in automation of manual tasks via code (eg Python) or scripts (eg bash, PowerShell) Experience with programming languages such as Python; also JSON, YAML, etc Rest API consumption via code or scripts Ability to lead others and provide Subject Matter Expertise in one or more subjects Excellent presentation skills Work with external vendors for new and existing products Experience of large enterprise infrastructure design Knowledge of data storage technologies from NetApp, Dell or similar companies Software and systems security. Key Responsibilities : Support role L2 and L3 tasks Closing incident tickets, interacting with customers, Vendors Facilitate migration (file Products) and make sure Runbooks are in place Educational Qualification:Minimum Bachelor's degree Relevant Vendor/Technology certifications preferred Qualification 15 years full time education

Posted 1 month ago

Apply

7 - 12 years

20 - 35 Lacs

Bengaluru

Work from Office

Naukri logo

Design and support NetApp-based HPC storage solutions across cloud and on-prem environments. Optimize performance and automate Linux-based storage systems for ML and compute-heavy workflows. Required Candidate profile Engineer with deep experience in NetApp ONTAP, Azure NetApp Files, Linux, storage performance tuning, and automation in HPC environments.

Posted 1 month ago

Apply

5 - 10 years

15 - 30 Lacs

Bengaluru

Work from Office

Naukri logo

Design and manage HPC infrastructure for geophysics, simulation, ML/AI using Azure and Linux. Optimize compute environments and support job schedulers, file systems, and parallel processing workflows. Required Candidate profile Experienced HPC engineer with 5–10 years in Linux, Azure, job schedulers, and supporting scientific workloads in a large-scale enterprise environment.

Posted 1 month ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies