Key Responsibilities:
LLM Deployment & Optimization
- Deploy, fine-tune, and optimize open-source LLMs (e.g., LLaMA, Mistral, CodeS, DeepSeek).
- Implement quantization (e.g., 4-bit, 8-bit) and pruning for efficient inference on commodity hardware.
- Build and manage inference APIs (REST/gRPC) for production use.
Infrastructure Management
- Set up and manage on-premise GPU servers and VM-based deployments.
- Build scalable cloud-based LLM infrastructure using AWS (SageMaker, EC2), Azure ML, or GCP Vertex AI.
- Ensure cost efficiency by choosing appropriate hardware and job scheduling strategies.
MLOps & Reliability Engineering
- Develop CI/CD pipelines for model training, testing, evaluation, and deployment.
- Integrate version control for models, data, and hyperparameters.
- Set up logging, tracing, and monitoring tools (e.g., MLflow, Prometheus, Grafana) for model performance and failure detection.
Security, Compliance & Performance
- Ensure data privacy (FERPA/GDPR) and enforce security best practices across deployments.
- Apply secure coding standards and implement RBAC, encryption, and network hardening for cloud/on-prem.
Cross-functional Integration
- Work closely with AI solution engineers, backend developers, and product owners to integrate LLM services into the platform.
- Support performance benchmarking and A/B testing of AI features across modules.
Documentation & Internal Enablement
- Document LLM pipelines, configuration steps, and infrastructure setup in internal playbooks.
- Create guides and reusable templates for future deployments and models.
Key Requirements:
Education:
- Bachelors or Masters in Computer Science, AI/ML, Data Engineering, or related field.
Technical Skills:
- Strong Python experience with ML libraries (e.g., PyTorch, Hugging Face Transformers).
- Familiar with LangChain, LlamaIndex, or other RAG frameworks.
- Experience with Docker, Kubernetes, and API gateways (e.g., Kong, NGINX).
- Working knowledge of vector databases (FAISS, Pinecone, Qdrant).
- Familiarity with GPU deployment tools (CUDA, Triton Inference Server, HuggingFace Accelerate).
Experience:
- 4+ years in an AI/MLOps role, including experience in LLM fine-tuning and deployment.
- Hands-on work with model inference in production environments (both cloud and on-prem).
- Exposure to SaaS and modular product environments is a plus.