Principal Systems Performance Engineer

12 - 17 years

16 - 20 Lacs

Posted:3 months ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Principal / Senior Systems Performance Engineer
Micron Data Center and Client Workload Engineering in Hyderabad, India, is seeking a senior/principal engineer to join our dynamic team.
The successful candidate will primarily contribute to the ML development, ML DevOps, HBM program in the data center by analyzing how AI/ML workloads perform on the latest MU-HBM, Micron main memory, expansion memory and near memory (HBM/LP) solutions, conduct competitive analysis, showcase the benefits that workloads see with MU-HBM s capacity / bandwidth / thermals, contribute to marketing collateral, and extract AI/ML workload traces to help optimize future HBM designs.
Job Responsibilities:
The Job Responsibilities include but are not limited to the following:
  • Design, implement, and maintain scalable & reliable ML infrastructure and pipelines.
  • Collaborate with data scientists and ML engineers to deploy machine learning models into production environments.
  • Automate and optimize ML workflows, including data preprocessing, model training, evaluation, and deployment.
  • Monitor and manage the performance, reliability, and scalability of ML systems.
  • Troubleshoot and resolve issues related to ML infrastructure and deployments.
  • Implement and manage distributed training and inference solutions to enhance model performance and scalability.
  • Utilize DeepSpeed, TensorRT, vLLM for optimizing and accelerating AI inference and training processes.
  • Understand key care abouts when it comes to ML models such as: transformer architectures, precision, quantization, distillation, attention span & KV cache, MoE, etc.
  • Build workload memory access traces from AI models.
  • Study system balance ratios for DRAM to HBM in terms of capacity and bandwidth to understand and model TCO.
  • Study data movement between CPU, GPU and the associated memory subsystems (DDR, HBM) in heterogeneous system architectures via connectivity such as PCIe/NVLINK/Infinity Fabric to understand the bottlenecks in data movement for different workloads.
  • Develop an automated testing framework through scripting.
  • Customer engagements and conference presentations to showcase findings and develop whitepapers.
Requirements:
  • Strong programming skills in Python and familiarity with ML frameworks such as TensorFlow, PyTorch, or scikit-learn.
  • Experience in data preparation: cleaning, splitting, and transforming data for training, validation, and testing.
  • Proficiency in model training and development: creating and training machine learning models.
  • Expertise in model evaluation: testing models to assess their performance.
  • Skills in model deployment: launching server, live inference, batched inference
  • Experience with AI inference and distributed training techniques.
  • Strong foundation in GPU and CPU processor architecture
  • Familiarity with and knowledge of server system memory (DRAM)
  • Strong experience with benchmarking and performance analysis
  • Strong software development skills using leading scripting, programming languages and technologies (Python, CUDA, C, C++)
  • Familiarity with PCIe and NVLINK connectivity
Preferred Qualifications:
  • Experience in quickly building AI workflows: building pipelines and model workflows to design, deploy, and manage consistent model delivery.
  • Ability to easily deploy models anywhere: using managed endpoints to deploy models and workflows across accessible CPU and GPU machines.
  • Understanding of MLOps: the overarching concept covering the core tools, processes, and best practices for end-to-end machine learning system development and operations in production.
  • Knowledge of GenAIOps: extending MLOps to develop and operationalize generative AI solutions, including the management of and interaction with a foundation model.
  • Familiarity with LLMOps: focused specifically on developing and productionizing LLM-based solutions.
  • Experience with RAGOps: focusing on the delivery and operation of RAGs, considered the ultimate reference architecture for generative AI and LLMs.
  • Data management: collect, ingest, store, process, and label data for training and evaluation. Configure role-based access control; dataset search, browsing, and exploration; data provenance tracking, data logging, dataset versioning, metadata indexing, data quality validation, dataset cards, and dashboards for data visualization.
  • Workflow and pipeline management: work with cloud resources or a local workstation; connect data preparation, model training, model evaluation, model optimization, and model deployment steps into an end-to-end automated and scalable workflow combining data and compute.
  • Model management: train, evaluate, and optimize models for production; store and version models along with their model cards in a centralized model registry; assess model risks, and ensure compliance with standards.
  • Experiment management and observability: track and compare different machine learning model experiments, including changes in training data, models, and hyperparameters. Automatically search the space of possible model architectures and hyperparameters for a given model architecture; analyze model performance during inference, monitor model inputs and outputs for concept drift.
  • Synthetic data management: extend data management with a new native generative AI capability. Generate synthetic training data through domain randomization to increase transfer learning capabilities. Declaratively define and generate edge cases to evaluate, validate, and certify model accuracy and robustness.
  • Embedding management: represent data samples of any modality as dense multi-dimensional embedding vectors; generate, store, and version embeddings in a vector database. Visualize embeddings for improvised exploration. Find relevant contextual information through vector similarity search for RAGs.
Education:
  • Bachelor s or higher (with 12+ years of experience) in Computer Science or related field.


AI alert : Candidates are encouraged to use AI tools to enhance their resume and/or application materials. However, all information provided must be accurate and reflect the candidates true skills and experiences. Misuse of AI to fabricate or misrepresent qualifications will result in immediate disqualification.
Fraud alert: Micron advises job seekers to be cautious of unsolicited job offers and to verify the authenticity of any communication claiming to be from Micron by checking the official Micron careers website in the About Micron Technology, Inc.

Mock Interview

Practice Video Interview with JobPe AI

Start Machine Learning Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Micron Software logo
Micron Software

Software Development

Silicon Valley

RecommendedJobs for You