Data Science Intern: AI Data Curation (Paid) – Life Sciences
Data Science Intern: AI & Life Sciences (Paid) Location: Pune (Hybrid) Type: Paid Internship (6 months, potential extension/absorption) Company: Dizzaroo – Transforming Drug Discovery Development with AI About Dizzaroo At Dizzaroo, we are building AI-first tools to transform drug discovery development . We explore, unravel, and organize complex biological, clinical, and scientific data to help bring new treatments to patients faster. We value bold ideas, flexibility, and a “no idea is a crazy idea” mindset. Role Overview We are seeking Data Science Interns to join our team in curating and cleaning specialized datasets for AI model fine-tuning in the life sciences domain . You will work with structured data (clinical, biomedical, genomic) and unstructured data (scientific publications, clinical protocols) to build high-quality datasets for our AI workflows supporting drug discovery development. This is a unique opportunity to gain hands-on experience at the intersection of AI, data science, and life sciences while working with cutting-edge tools and graph-based data infrastructure. What You Will Do Curate, clean, and annotate domain-specific datasets , including: Scientific publications, clinical protocols, and regulatory documents for training large language models. Biomedical and genomic data for structured AI pipelines. Use advanced tools and databases (Weaviate, Neo4j, SQL/NoSQL) to organize and manage large-scale, multimodal datasets . Support data pipeline validation and quality checks to ensure clean, structured training data for AI models. Assist with document chunking, metadata tagging, and knowledge graph development to enhance retrieval and structuring of scientific and clinical data. Collaborate with AI engineers and domain experts to align data curation with project goals . What We’re Looking For Background: Pursuing or recently completed a Bachelor's/Master’s in Data Science, Computer Science, Life Sciences, Biomedical Engineering, or related fields. Skills: Strong in at least one domain with working knowledge of the other: If your strength is data science , you should have: Proficiency in Python (libraries like pandas, numpy, pytorch). Exposure to SQL , and ideally to graph/vector databases (Neo4j, Weaviate). Experience with data cleaning, ETL workflows, or text processing (NLP preprocessing) . Curiosity to understand life sciences contexts . If your strength is life sciences , you should have: Knowledge of biomedical or clinical data structures , scientific literature, or genomics. Ability to use Python or spreadsheets for basic data analysis . Interest in applying data science tools to life sciences problems. Mindset: Comfortable with ambiguity and learning complex domain contexts. High attention to detail with a commitment to data quality . Aligns with Dizzaroo’s values of creativity, flexibility, and challenging the status quo. What You Will Gain Exposure to real-world AI model training pipelines using structured and unstructured data in drug discovery development. Experience with advanced data infrastructure and tooling for cutting-edge AI workflows. Opportunity to contribute to impactful projects across knowledge management, multimodal data integration, and computer vision in diagnostics . Potential pathway to full-time opportunities with Dizzaroo based on performance. How to Apply Send your CV and a brief note on why you are interested in this role to kalpeshp@dizzaroo.com with the subject: