AI/ML Engineer – Speech & Language Technologies
Roles & Responsibilities Design, implement, and train deep learning models for: Text-to-Speech (e.g., SpeechT5, StyleTTS2, YourTTS, XTTS-v2 or similar models) Voice Cloning with speaker embeddings (x-vectors, d-vectors), few-shot adaptation, prosody and emotion transfer Engineer multilingual audio-text preprocessing pipelines: Text normalization, grapheme-to-phoneme (G2P) conversion, Unicode normalization (NFC/NFD) Silence trimming, VAD-based audio segmentation, audio enhancement for noisy corpora, speech prosody modification and waveform manipulation Build scalable data loaders using PyTorch for: Large-scale, multi-speaker datasets with variable-length sequences and chunked streaming Extract and process acoustic features: Log-mel spectrograms, pitch contours, MFCCs, energy, speaker embeddings Optimize training using: Mixed precision (FP16/BFloat16), gradient checkpointing, label smoothing, quantization-aware training Build serving infrastructure for inference using: TorchServe, ONNX Runtime, Triton Inference Server, FastAPI (for REST endpoints), including batch and real-time modes Optimize models for production: Quantization, model pruning, ONNX conversion, parallel decoding, GPU/CPU memory profiling Create automated and human evaluation logics: MOS, PESQ, STOI, BLEU, WER/CER, multi-speaker test sets, multilingual subjective listening tests Implement ethical deployment safeguards: Digital watermarking, impersonation detection, and voice verification for cloned speech Conduct literature reviews and reproduce state-of-the-art papers; adapt and improve on open benchmarks Mentor junior contributors, review code, and maintain shared research and model repositories Collaborate across teams (MLOps, backend, product, linguists) to translate research into deployable, user-facing solutions Required Skills Advanced proficiency in Python and PyTorch (TensorFlow a plus) Strong grasp of deep learning concepts: Sequence-to-sequence models, Transformers, autoregressive and non-autoregressive decoders, attention mechanisms, VAEs, GANs Experience with modern speech processing toolkits: ESPnet, NVIDIA NeMo, Coqui TTS, OpenSeq2Seq, or equivalent Design custom loss function for custom models based on: Mel loss, GAN loss, KL divergence, attention losses, etc.,, learning rate schedules, training stability Hands-on experience with multilingual and low-resource language modeling Understanding of transformer architecture, LLMs and working with existing AI models, tools and APIs Model serving & API integration: TorchServe, FastAPI, Docker, ONNX Runtime Preferred (Bonus) Skills CUDA kernel optimization, custom GPU operations, memory footprint profiling Experience deploying on AWS/GCP with GPU acceleration Experience developing RESTful APIs for real-time TTS/voice cloning endpoints Publications or open-source contributions in TTS, ASR, or speech processing Working knowledge of multilingual translation pipelines Knowledge of speaker diarization, voice anonymization, and speech synthesis for agglutinative/morphologically rich languages Milestones & Expectations (First 3–6 Months) Deliver at least one production-ready TTS or Voice Cloning model integrated with India Speaks’ Dubbing Studio or SaaS APIs Create a fully reproducible experiment pipeline for multilingual speech modeling, complete with model cards and performance benchmarks Contribute to custom evaluation tools for measuring quality across Indian languages Deploy optimized models to live staging environments using Triton, TorchServe, or ONNX Demonstrate impact through real-world integration in education, media, or defence deployments