Software engineer and ML researcher. Production platforms, deep learning pipelines, large-scale data.
About
I build software that solves hard problems. Lately that's ML and healthcare, but the engineering is what I love.
I like building things and I like hard problems. Right now that looks like full-stack platforms, deep learning on large-scale data, and getting systems into production. I care a lot about clean architecture and writing code that actually ships.
I'm finishing my MS in Computational Biology at Harvard, cross-registered at MIT EECS, and doing research at the Broad Institute and Harvard Medical School. Day to day I build full-stack applications, write ML pipelines, and work with large-scale data. The kind of problems where good engineering and good science have to work together.
Before Harvard, I studied CS and Biology at the University of Toronto, shipped production software at RBC, and spent three years teaching CS. I speak English, Cantonese, and Mandarin.
Python, TypeScript, Java, SQL. React, Next.js, Node. AWS, GCP, Docker, PostgreSQL, Supabase. Production systems at scale.
PyTorch, Transformers, foundation models, interpretable ML (SHAP), scikit-learn, multi-agent LLM systems
Large-scale data pipelines, ETL, REST APIs, NLP, real-time streaming, multi-source data integration
Proteomics, multi-omics, clinical data, drug target validation, pathway analysis
Publications
Selected Work
Full-stack platforms, ML pipelines, and data-intensive applications. All deployed live on GCP.
Enter a gene target, get back a complete validation dossier: disease associations, pathway biology, druggability, tissue safety, clinical trial landscape, and key literature. Built to replace weeks of manual lit review in early-stage pharma.
Pulls from OpenTargets, PubMed, ChEMBL, ClinicalTrials.gov, UniProt, and GTEx. Produces weighted confidence scores across 5 evidence axes with go/no-go recommendations. Validated against known targets like PCSK9, BRCA1, and TP53.
Tracks pharmaceutical deals, clinical trials, and regulatory filings across 100 biotech companies. Designed for BD teams and compliance analysts.
Processes 2,400+ records from ClinicalTrials.gov, SEC EDGAR, PubMed, ChEMBL, and international regulatory databases. Multilingual NLP pipeline handles foreign-language drug filings.
Multiple AI models debate scientific and clinical questions in real time, challenging each other instead of agreeing by default.
Orchestrates GPT-4, Claude, and Gemini with anti-sycophancy scoring and a judge model. Real-time streaming deliberation.
Foundation model for irregular time-series data. The architecture applies to clinical time series like longitudinal lab values. Published at NeurIPS 2025.
Self-supervised Conformer-style transformer on 1.5M observations. 70% lower RMSE than benchmarks. Open-sourced on Hugging Face.
Neural networks encoding Reactome pathway structure so disease predictions come with mechanistic explanations. 50K+ plasma samples from major cohorts.
BMI prediction (R² = 0.73, 4.8x over Ridge), heart failure subtype classification (HFpEF vs HFrEF). SHAP interpretability across SomaLogic and Olink.
Thoughts on machine learning, computational biology, and building things.
Coming soon.