Team: Product (Evaluation & Safety) Reports to: Head of Product Role Purpose Define, measure, and validate the safety, accuracy, and usability of our AI scribe and CDS. You bridge clinicians and engineersowning gold standards, bias/safety evaluations, and release gates so models are clinically trustworthy. What Youll Do Evaluation Design & Release Gates Design evaluation protocols for NLP/LLM, CDS, and ASR (accuracy, hallucinations, bias, robustness, latency). Define go/no-go thresholds and own the release gate tied to clinical risk; open crisp remediation tickets. Clinical NER & Code Mapping Evaluation (core) Create gold standards for spans (problem/symptom/med/lab/order/allergy), med attributes , assertion, temporality, and code links (SNOMED, RxNorm, LOINC ICD-10/CPT/local). Run the annotation program with clinicians; achieve 0.80 ; maintain guidelines & QA. Build red-team suites (abbrev collisions, negation, rare meds, overlapping speech, code-switching). Publish scorecards: per-type F1, linking accuracy, attribute exact-match, slice metrics (accent, specialty), and safety outcomes. Benchmarking, Explainability & Reporting Benchmark vs baselines; deliver stakeholder-ready reports & dashboards (confidence, alt candidates, rationale/citations, spancode trace). Provide audit logs suitable for compliance and clinical review. Production Monitoring Monitor drift, bias, and unsafe outputs; analyze clinician overrides and alert fatigue; recommend model/rule tuning. Partner with Compliance for PDPA/HIPAA/ISO 27799 evidence and regulatory submissions. Collaboration Work tightly with the Lead Data & AI Engineer on fixes and regression; align evaluation with real clinical workflows. Required Qualifications 35 years as a Data Scientist / AI Evaluator (healthcare preferred). Strong statistics & experimental design; skilled in error taxonomies and RCA. Python (pandas, NumPy, scikit-learn; basic PyTorch/HF a plus). Clinical ontologies (SNOMED, ICD, RxNorm, LOINC) and EHR data familiarity. Experience running clinician-in-the-loop studies; clear technical + clinical writing. Knowledge of HIPAA/PDPA/ISO 27799. Nice to Have MD/PharmD/Clinical Informaticist background or close clinical research experience. ASR/voice evaluation; diarization/WER analysis. Power analysis, IRB/ethics, risk management frameworks. BI/observability (Metabase/Superset, Grafana, OpenTelemetry). Success Metrics (you own) Quality/Safety : hallucination greater than 1% on audited notes; SOAP completeness greater than 95% . NER per-type F1 0.88 (overall 0.92 ); linking top-1 0.95 ; med attribute exact-match 0.93 . Safety : drugallergy recall greater than 99% , precision greater than 95% ; zero high-severity safety escapes at gate. Program Health : 0.80 ; weekly regression + monthly red-team updates; pilot time-to-finalize 3050% vs baseline.
Team: Product (Evaluation & Safety) Reports to: Head of Product Role Purpose Define, measure, and validate the safety, accuracy, and usability of our AI scribe and CDS. You bridge clinicians and engineersowning gold standards, bias/safety evaluations, and release gates so models are clinically trustworthy. What Youll Do Evaluation Design & Release Gates Design evaluation protocols for NLP/LLM, CDS, and ASR (accuracy, hallucinations, bias, robustness, latency). Define go/no-go thresholds and own the release gate tied to clinical risk; open crisp remediation tickets. Clinical NER & Code Mapping Evaluation (core) Create gold standards for spans (problem/symptom/med/lab/order/allergy), med attributes , assertion, temporality, and code links (SNOMED, RxNorm, LOINC ICD-10/CPT/local). Run the annotation program with clinicians; achieve 0.80 ; maintain guidelines & QA. Build red-team suites (abbrev collisions, negation, rare meds, overlapping speech, code-switching). Publish scorecards: per-type F1, linking accuracy, attribute exact-match, slice metrics (accent, specialty), and safety outcomes. Benchmarking, Explainability & Reporting Benchmark vs baselines; deliver stakeholder-ready reports & dashboards (confidence, alt candidates, rationale/citations, spancode trace). Provide audit logs suitable for compliance and clinical review. Production Monitoring Monitor drift, bias, and unsafe outputs; analyze clinician overrides and alert fatigue; recommend model/rule tuning. Partner with Compliance for PDPA/HIPAA/ISO 27799 evidence and regulatory submissions. Collaboration Work tightly with the Lead Data & AI Engineer on fixes and regression; align evaluation with real clinical workflows. Required Qualifications 35 years as a Data Scientist / AI Evaluator (healthcare preferred). Strong statistics & experimental design; skilled in error taxonomies and RCA. Python (pandas, NumPy, scikit-learn; basic PyTorch/HF a plus). Clinical ontologies (SNOMED, ICD, RxNorm, LOINC) and EHR data familiarity. Experience running clinician-in-the-loop studies; clear technical + clinical writing. Knowledge of HIPAA/PDPA/ISO 27799. Nice to Have MD/PharmD/Clinical Informaticist background or close clinical research experience. ASR/voice evaluation; diarization/WER analysis. Power analysis, IRB/ethics, risk management frameworks. BI/observability (Metabase/Superset, Grafana, OpenTelemetry). .
Team: Product & Engineering (Data/AI) Reports to: Head of Product (dotted line to Head of Engineering) Type: Full-time Role Purpose Build, deploy, and maintain the AI/ML stack powering our EMRclinical NLP/LLM, decision support, and voice scribing. Own end-to-end data engineering, model training, and MLOps with healthcare compliance baked in (PDPA, HIPAA, ISO 27799). What Youll Do Data Platform & Pipelines Architect and operate pipelines for structured/unstructured clinical data (EHR notes, HL7 v2, FHIR, audio). Build/maintain the feature store for clinical AI (labs, meds, allergies, vitals, orders) with lineage & versioning. Implement PHI de-identification/re-identification, KMS-backed encryption, DUAs, and access controls. Clinical NER & Code Mapping (core accountability) Own the extraction + normalization stack for: problems/diagnoses, symptoms/findings, medications ( with attributes ), labs, orders, allergies. Ship a hybrid extractor (transformer NER + rules) with assertion (present/absent/etc.) and temporality . Build a medication attribute parser (dose, unit/UCUM, route, frequency, duration, PRN, instructions). Implement a two-stage entity linker (candidate gen via lexicon/vector search + cross-encoder rerank) to SNOMED CT, RxNorm, LOINC ; manage crosswalks to ICD-10/CPT and local catalogs. Operate ontology ops : version pinning, diffs, UMLS/SNOMED licensing, regression tests per ontology release. Enforce safety guards (drugallergy, duplicate therapy, dose range) and confidence-driven UI disambiguation. Modeling, LLMs & Scribing Build RAG/LLM pipelines for summarization, CDS, and scribe workflows (prompting, tool use, retrieval, guardrails). Integrate ASR + diarization with streaming partials/hotwords for clinical terms and names. MLOps, Reliability & Cost Stand up MLOps: model registry, experiment tracking, CI/CD, canary/shadow deploys, drift & safety monitoring, blue/green rollbacks. Meet SLOs: p95 speechdraft less than 2.0s , ASR partial updates every 300500ms , 99.9% uptime, rollback less than5 min . Optimize inference (Triton/ONNX Runtime, quantization/distillation, caching) and track cost per encounter . EHR Integration & APIs Ship SMART on FHIR apps and CDS Hooks; design gRPC/REST services; run Kafka/PubSub with idempotent consumers. Security, Privacy & Compliance PHI-safe prompts/logs, prompt-injection & data-exfiltration guards, constrained tool allowlists. Audit trails exportable for clinical review & compliance. Collaboration Partner with Product & Clinical to encode guidelines/rules alongside ML. Mentor engineers; uphold code quality, reviews, and on-call. Required Qualifications 4+ years Data/ML Engineering (healthcare strongly preferred). Expert: Python , SQL , PyTorch/TensorFlow , Hugging Face . Deep NLP/LLM (transformers, RAG, prompt engineering, guardrails). Standards: FHIR/HL7 , SNOMED CT, ICD-10, RxNorm , LOINC , CPT; UMLS familiarity. MLOps (MLflow/Kubeflow/Vertex/SageMaker), containerized inference, CI/CD. Privacy/security in regulated environments (PDPA/HIPAA/ISO 27799). Nice to Have ASR/diarization (Whisper, Vosk, Kaldi), ONNX/TensorRT, Triton; gRPC/WebRTC streaming. GPU scheduling, vector DBs, OpenTelemetry, Terraform/IaC.