Case Study · Open Source
A primary-source intelligence platform for cross-border biotech
Biotech Intelligence translates Mandarin regulatory filings and corporate registries into structured, typed, queryable intelligence — then serves it as statically-rendered pages and an interactive ownership graph. This page is the engineering story behind it.
The Problem
The intelligence that moves cross-border deals is locked in Mandarin primary sources
Cross-border biotech licensing between Asian and Western companies reached $137.7 billion in 2025 — 38% of Big Pharma's $50M+ transactions in the first half. Yet the data that actually de-risks those deals — CDE drug filings, NMPA announcements, ultimate beneficial ownership, VIE structures, BIOSECURE exposure — lives in Chinese-language regulatory systems and corporate registries (Tianyancha, GSXT).
Deal teams are left stitching together machine translations and stale secondary reporting. There was no single, structured, primary-source view that both sides of a transaction could read from the same page. So I built one — and made it free and open source.
What had to be true
- Read the primary sources in the original language — not machine glosses.
- Normalise messy filings into one structured, queryable model.
- Make ownership and deal relationships explorable, not buried in prose.
What the platform does
One platform, four primary-source data products
Weekly briefings
16 long-form MDX issues translating the week's CDE filings and deals.
Deal tracker
Licensing terms benchmarked against comparable transactions, sortable by value.
Company profiles
Pipeline, ownership chains, and BIOSECURE status from corporate registries.
BIOSECURE tracker
Enforcement timeline and BCC list, joined to the entities under analysis.
Architecture & Data Flow
From a Mandarin filing to a typed, statically-rendered page
A Python pipeline does the heavy ingestion offline; the web tier stays a thin, fast, typed read layer. The database is exported to static JSON at build time, so production never depends on a writable filesystem and pages render instantly.
- Scrape
CDE filings, NMPA, corporate registries, deal news
- Translate + Enrich
Mandarin → English, entity resolution, BIOSECURE tagging
- SQLite
entities · deals · filings · subsidiaries · trials
- Typed API
app/api/intel/* — rate-limited, cached JSON
- Render
Static RSC pages + interactive force graph
Static surface
42 server-rendered pages
Briefings, deals, company profiles and the BIOSECURE tracker, pre-rendered for speed and SEO.
Interactive surface
Corporate-ownership graph
A client-side force simulation hydrated from the typed graph endpoint.
Build-time export
A prebuild step exports SQLite to static JSON in public/data/, so the serverless tier reads data without a writable disk.
Resilient fetches
Every external call is wrapped in try/catch with a graceful fallback and hourly ISR — the build can never fail on a flaky API.
One source of truth
The curated-entity filter and slug rule live in one module each, shared by SQL, API routes, and link builders — so a row always resolves to a real page.
The Tech Stack
Typed end-to-end, from the database row to the rendered cell
Next.js 14
App Router · React Server Components · ISR
TypeScript
End-to-end typing, DB rows to API to UI
Tailwind CSS
Editorial design system, dual light/dark theme
SQLite · better-sqlite3
Read-only intel store, prepared statements
MDX
16 long-form briefings as typed, queryable content
Python
Scrape + translate + enrich intel pipeline
Vitest
20 test files: lib, components, API, integration
Vercel
Static export + serverless API, hourly revalidation
No runtime UI framework beyond React; the interactive graph is a hand-built force simulation rather than a heavyweight charting dependency.
Engineering Highlights
Three things worth opening the repo for
Corporate-ownership force graph
A client-rendered, physics-based network of biotech entities, their subsidiaries, VIE structures, and deal relationships — colour-coded by BIOSECURE exposure. Hydrated from a typed graph endpoint and ships its own loading skeleton so a dense full-screen UI never flashes in cold.
Open the graphBIOSECURE compliance tracker
A primary-source timeline of the BIOSECURE Act, the 1260H list, and BCC designations, joined against the entity store so each tracked company carries a live compliance status. The same curated-entity filter drives the tracker, the graph, and every entity page from one place.
View the trackerTest suite + typed data layer
Twenty Vitest files cover the data layer, components, the intel API, and end-to-end content integrity — slug derivation parity between JS and SQL, deal-value math, markdown sanitisation, and SEO invariants. better-sqlite3 row generics keep query results typed instead of cast to any.
Read the sourceBuilt & Maintained By
Antony Tan
Computational biologist · Software engineer
Antony holds an MS in Computational Biology from the Harvard T.H. Chan School of Public Health and conducted research at the Broad Institute of MIT & Harvard, with a publication at NeurIPS 2025. BS in Computer Science from the University of Toronto.
Fluent in English, Mandarin, and Cantonese, Antony reads CDE filings, NMPA documents, and corporate registries (Tianyancha, GSXT) in their original language — which is what makes the primary-source methodology behind this platform possible. This project pairs that domain fluency with the full-stack engineering above.
Credentials
- MS Computational Biology — Harvard T.H. Chan
- Researcher — Broad Institute of MIT & Harvard
- Publication — NeurIPS 2025
- BS Computer Science — University of Toronto
- Languages — English · Mandarin · Cantonese
Editorial Methodology
Every briefing is built from primary sources
The engineering exists to serve a rigorous editorial process. The pipeline gathers, but a human with computational-biology training does the reading and the judgement.
1. SOURCE
Monitor CDE filings, NMPA announcements, Chinese corporate registries (Tianyancha, GSXT), and deal announcements in both English and Chinese media.
2. TRANSLATE
Filings are read by a native Mandarin speaker with computational-biology training — not machine translation. Terminology is checked against pharmacological and regulatory standards.
3. ANALYZE
Deal terms are benchmarked against comparable transactions. Corporate structures are mapped through holding-company registries. BIOSECURE exposure is assessed against BCC designation criteria.
4. DELIVER
Analysis is structured for business-development and compliance professionals, closing with concrete takeaways relevant to current deal and regulatory activity.
Independence disclosure: Biotech Intelligence has no financial relationships with any company, institution, or government entity covered. The author holds no stock positions in any company covered. All analysis is independent and cites its sources. This is research, not investment advice, legal counsel, or policy advocacy.
The whole platform is open source
Read the code, fork the pipeline, or build on it. Free to read, free to build on — no account, no paywall.
Open source · MIT licensed