Posted:19 hours ago|
Platform:
On-site
Full Time
Company: Indian / Global Engineering & Manufacturing Organization Key Skills: Machine Learning, ML, AI Artificial intelligence, Artificial Intelligence, Tensorflow, Python, Pytorch. Roles and Responsibilities: Design, build, and rigorously optimize the complete stack necessary for large-scale model training, fine-tuning, and inference--including dataloading, distributed training, and model deployment--to maximize Model Flop Utilization (MFU) on compute clusters. Collaborate closely with research scientists to translate state-of-the-art models and algorithms into production-grade, high-performance code and scalable infrastructure. Implement, integrate, and test advancements from recent research publications and open-source contributions into enterprise-grade systems. Profile training workflows to identify and resolve bottlenecks across all layers of the training stack--from input pipelines to inference--enhancing speed and resource efficiency. Contribute to evaluations and selections of hardware, software, and cloud platforms defining the future of the AI infrastructure stack. Use MLOps tools (e.g., MLflow, Weights & Biases) to establish best practices across the entire AI model lifecycle, including development, validation, deployment, and monitoring. Maintain extensive documentation of infrastructure architecture, pipelines, and training processes to ensure reproducibility and smooth knowledge transfer. Continuously research and implement improvements in large-scale training strategies and data engineering workflows to keep the organization at the cutting edge. Demonstrate initiative and ownership in developing rapid prototypes and production-scale systems for AI applications in the energy sector. Experience Requirement: 5-9 years of experience building and optimizing large-scale machine learning infrastructure, including distributed training and data pipelines. Proven hands-on expertise with deep learning frameworks such as PyTorch, JAX, or PyTorch Lightning in multi-node GPU environments. Experience in scaling models trained on large datasets across distributed computing systems. Familiarity with writing and optimizing CUDA, Triton, or CUTLASS kernels for performance enhancement is preferred. Hands-on experience with AI/ML lifecycle management using MLOps frameworks and performance profiling tools. Demonstrated collaboration with AI researchers and data scientists to integrate models into production environments. Track record of open-source contributions in AI infrastructure or data engineering is a significant plus. Education: M.E., B.Tech M.Tech (Dual), BCA, B.E., B.Tech, M. Tech, MCA. Show more Show less
MyCareernet
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
My Connections MyCareernet
Gurugram, Haryana, India
Salary: Not disclosed
Bengaluru
8.0 - 12.0 Lacs P.A.
Gurgaon, Haryana, India
Salary: Not disclosed
Bengaluru
13.0 - 17.0 Lacs P.A.
Ghaziabad
1.2 - 6.0 Lacs P.A.
Thane, Maharashtra, India
Salary: Not disclosed
Thrissur
6.0 - 15.0 Lacs P.A.
Hyderabad, Telangana, India
Salary: Not disclosed
Thrissur, Kerala
Salary: Not disclosed
Chalakkudy, Kerala, India
Salary: Not disclosed