Job
Description
Role Overview: You will drive frontier research in speech-to-speech and multimodal AI systems to build natural, self-learning, voice-capable AI Workers at GoCommotion, a fast-growing startup revolutionizing Customer Experience (CX) with AI Workers. You will be responsible for researching and developing direct speech-to-speech modeling, evaluating conversational turn-taking, latency, and VAD for real-time AI, exploring Agentic Reinforcement Training (ART), designing memory-augmented multimodal architectures, creating expressive speech generation systems, contributing to state-of-the-art research, and collaborating with MLEs on model training and evaluation. Key Responsibilities: - Research and develop direct speech-to-speech modeling using LLMs and audio encoders/decoders - Model and evaluate conversational turn-taking, latency, and VAD for real-time AI - Explore Agentic Reinforcement Training (ART) and self-learning mechanisms - Design memory-augmented multimodal architectures for context-aware interactions - Create expressive speech generation systems with emotion conditioning and speaker preservation - Contribute to state-of-the-art research in multimodal learning, audio-language alignment, and agentic reasoning - Define long-term AI research roadmap with the Research Director - Collaborate with MLEs on model training and evaluation, while leading dataset and experimentation design Qualifications Required: - 3+ years of applied or academic experience in speech, multimodal, or LLM research - Bachelors or Masters in Computer Science, AI, or Electrical Engineering - Strong in Python and scientific computing, including JupyterHub environments - Deep understanding of LLMs, transformer architectures, and multimodal embeddings - Experience in speech modeling pipelines: ASR, TTS, speech-to-speech, or audio-language models - Knowledge of turn-taking systems, VAD, prosody modeling, and real-time voice synthesis - Familiarity with self-supervised learning, contrastive learning, and agentic reinforcement (ART) - Skilled in dataset curation, experimental design, and model evaluation - Comfortable with tools like Agno, Pipecat, HuggingFace, and PyTorch - Exposure to LangChain, vector databases, and memory systems for agentic research - Strong written communication and clarity in presenting research insights - High research curiosity, independent ownership, and mission-driven mindset - Currently employed at a product-based organization. Role Overview: You will drive frontier research in speech-to-speech and multimodal AI systems to build natural, self-learning, voice-capable AI Workers at GoCommotion, a fast-growing startup revolutionizing Customer Experience (CX) with AI Workers. You will be responsible for researching and developing direct speech-to-speech modeling, evaluating conversational turn-taking, latency, and VAD for real-time AI, exploring Agentic Reinforcement Training (ART), designing memory-augmented multimodal architectures, creating expressive speech generation systems, contributing to state-of-the-art research, and collaborating with MLEs on model training and evaluation. Key Responsibilities: - Research and develop direct speech-to-speech modeling using LLMs and audio encoders/decoders - Model and evaluate conversational turn-taking, latency, and VAD for real-time AI - Explore Agentic Reinforcement Training (ART) and self-learning mechanisms - Design memory-augmented multimodal architectures for context-aware interactions - Create expressive speech generation systems with emotion conditioning and speaker preservation - Contribute to state-of-the-art research in multimodal learning, audio-language alignment, and agentic reasoning - Define long-term AI research roadmap with the Research Director - Collaborate with MLEs on model training and evaluation, while leading dataset and experimentation design Qualifications Required: - 3+ years of applied or academic experience in speech, multimodal, or LLM research - Bachelors or Masters in Computer Science, AI, or Electrical Engineering - Strong in Python and scientific computing, including JupyterHub environments - Deep understanding of LLMs, transformer architectures, and multimodal embeddings - Experience in speech modeling pipelines: ASR, TTS, speech-to-speech, or audio-language models - Knowledge of turn-taking systems, VAD, prosody modeling, and real-time voice synthesis - Familiarity with self-supervised learning, contrastive learning, and agentic reinforcement (ART) - Skilled in dataset curation, experimental design, and model evaluation - Comfortable with tools like Agno, Pipecat, HuggingFace, and PyTorch - Exposure to LangChain, vector databases, and memory systems for agentic research - Strong written communication and clarity in presenting research insights - High research curiosity, independent ownership, and mission-driven mindset - Currently employed at a product-based organization.