Posted:9 hours ago|
Platform:
On-site
Full Time
We are building a distributed LLM inference network that combines idle GPU capacity from around the world into a single cohesive plane of compute that can be used for running large-language models like DeepSeek and Llama 4. At any given moment, we have over 5,000 GPUs and hundreds of terabytes of VRAM connected to the network.
We are a small, well-funded team working on difficult, high-impact problems at the intersection of AI and distributed systems. We primarily work in-person from our office in downtown San Francisco.
**Responsibilities
**
- Design and implement optimization techniques to increase model throughput and reduce latency across our suite of models
- Deploy and maintain large language models at scale in production environments
- Deploy new models as they are released by frontier labs
- Implement techniques like quantization, speculative decoding, and KV cache reuse
- Contribute regularly to open source projects such as SGLang and vLLM
- Deep dive into underlying codebases of TensorRT, PyTorch, TensorRT-LLM, vLLM, SGLang, CUDA, and other libraries to debug ML performance issues
- Collaborate with the engineering team to bring new features and capabilities to our inference platform
- Develop robust and scalable infrastructure for AI model serving
- Create and maintain technical documentation for inference systems
**Requirements
**
- 3+ years of experience writing high-performance, production-quality code
- Strong proficiency with Python and deep learning frameworks, particularly PyTorch
- Demonstrated experience with LLM inference optimization techniques
- Hands-on experience with SGLang and vLLM, with contributions to these projects strongly preferred
- Familiarity with Docker and Kubernetes for containerized deployments
- Experience with CUDA programming and GPU optimization
- Strong understanding of distributed systems and scalability challenges
- Proven track record of optimizing AI models for production environments
Nice to Have
- Familiarity with TensorRT and TensorRT-LLM
- Knowledge of vision models and multimodal AI systems
- Experience implementing techniques like quantization and speculative decoding
- Contributions to open source machine learning projects
- Experience with large-scale distributed computing
**Compensation
**We offer competitive compensation, equity in a high-growth startup, and comprehensive benefits. The base salary range for this role is <span class="math inline](\(180,000 - \)</span>250,000, plus competitive equity and benefits including:
- Full healthcare coverage
- Quarterly offsites
- Flexible PTO
Job Type: Full-time
Pay: ₹180,000.00 - ₹250,000.00 per year
Work Location: In person
Speak with the employer
+91 9008078505
FullThrottle Labs Pvt Ltd
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Practice Python coding challenges to boost your skills
Start Practicing Python Now1.8 - 2.5 Lacs P.A.
bengaluru
10.0 - 15.0 Lacs P.A.
hyderabad, telangana, india
Salary: Not disclosed
indore, madhya pradesh, india
Salary: Not disclosed
cochin, kerala, india
4.0 - 7.0 Lacs P.A.
Experience: Not specified
Salary: Not disclosed
10.0 - 14.0 Lacs P.A.
bengaluru, karnataka, india
Experience: Not specified
Salary: Not disclosed
Experience: Not specified
Salary: Not disclosed
hyderabad, telangana, india
Salary: Not disclosed