Skip to content

Case Study · Open Source

A primary-source intelligence platform for cross-border biotech

Biotech Intelligence translates Mandarin regulatory filings and corporate registries into structured, typed, queryable intelligence — then serves it as statically-rendered pages and an interactive ownership graph. This page is the engineering story behind it.

Next.js 14 RSCTypeScriptSQLitePython pipelineMDXTailwind

The Problem

The intelligence that moves cross-border deals is locked in Mandarin primary sources

Cross-border biotech licensing between Asian and Western companies reached $137.7 billion in 2025 — 38% of Big Pharma's $50M+ transactions in the first half. Yet the data that actually de-risks those deals — CDE drug filings, NMPA announcements, ultimate beneficial ownership, VIE structures, BIOSECURE exposure — lives in Chinese-language regulatory systems and corporate registries (Tianyancha, GSXT).

Deal teams are left stitching together machine translations and stale secondary reporting. There was no single, structured, primary-source view that both sides of a transaction could read from the same page. So I built one — and made it free and open source.

What had to be true

  • Read the primary sources in the original language — not machine glosses.
  • Normalise messy filings into one structured, queryable model.
  • Make ownership and deal relationships explorable, not buried in prose.

Architecture & Data Flow

From a Mandarin filing to a typed, statically-rendered page

A Python pipeline does the heavy ingestion offline; the web tier stays a thin, fast, typed read layer. The database is exported to static JSON at build time, so production never depends on a writable filesystem and pages render instantly.

  1. Scrape

    CDE filings, NMPA, corporate registries, deal news

  2. Translate + Enrich

    Mandarin → English, entity resolution, BIOSECURE tagging

  3. SQLite

    entities · deals · filings · subsidiaries · trials

  4. Typed API

    app/api/intel/* — rate-limited, cached JSON

  5. Render

    Static RSC pages + interactive force graph

Static surface

42 server-rendered pages

Briefings, deals, company profiles and the BIOSECURE tracker, pre-rendered for speed and SEO.

Interactive surface

Corporate-ownership graph

A client-side force simulation hydrated from the typed graph endpoint.

Build-time export

A prebuild step exports SQLite to static JSON in public/data/, so the serverless tier reads data without a writable disk.

Resilient fetches

Every external call is wrapped in try/catch with a graceful fallback and hourly ISR — the build can never fail on a flaky API.

One source of truth

The curated-entity filter and slug rule live in one module each, shared by SQL, API routes, and link builders — so a row always resolves to a real page.

The Tech Stack

Typed end-to-end, from the database row to the rendered cell

Next.js 14

App Router · React Server Components · ISR

TypeScript

End-to-end typing, DB rows to API to UI

Tailwind CSS

Editorial design system, dual light/dark theme

SQLite · better-sqlite3

Read-only intel store, prepared statements

MDX

16 long-form briefings as typed, queryable content

Python

Scrape + translate + enrich intel pipeline

Vitest

20 test files: lib, components, API, integration

Vercel

Static export + serverless API, hourly revalidation

No runtime UI framework beyond React; the interactive graph is a hand-built force simulation rather than a heavyweight charting dependency.

Built & Maintained By

Antony Tan

Computational biologist · Software engineer

Antony holds an MS in Computational Biology from the Harvard T.H. Chan School of Public Health and conducted research at the Broad Institute of MIT & Harvard, with a publication at NeurIPS 2025. BS in Computer Science from the University of Toronto.

Fluent in English, Mandarin, and Cantonese, Antony reads CDE filings, NMPA documents, and corporate registries (Tianyancha, GSXT) in their original language — which is what makes the primary-source methodology behind this platform possible. This project pairs that domain fluency with the full-stack engineering above.

Credentials

  • MS Computational Biology — Harvard T.H. Chan
  • Researcher — Broad Institute of MIT & Harvard
  • Publication — NeurIPS 2025
  • BS Computer Science — University of Toronto
  • Languages — English · Mandarin · Cantonese

Editorial Methodology

Every briefing is built from primary sources

The engineering exists to serve a rigorous editorial process. The pipeline gathers, but a human with computational-biology training does the reading and the judgement.

1. SOURCE

Monitor CDE filings, NMPA announcements, Chinese corporate registries (Tianyancha, GSXT), and deal announcements in both English and Chinese media.

2. TRANSLATE

Filings are read by a native Mandarin speaker with computational-biology training — not machine translation. Terminology is checked against pharmacological and regulatory standards.

3. ANALYZE

Deal terms are benchmarked against comparable transactions. Corporate structures are mapped through holding-company registries. BIOSECURE exposure is assessed against BCC designation criteria.

4. DELIVER

Analysis is structured for business-development and compliance professionals, closing with concrete takeaways relevant to current deal and regulatory activity.

Independence disclosure: Biotech Intelligence has no financial relationships with any company, institution, or government entity covered. The author holds no stock positions in any company covered. All analysis is independent and cites its sources. This is research, not investment advice, legal counsel, or policy advocacy.

The whole platform is open source

Read the code, fork the pipeline, or build on it. Free to read, free to build on — no account, no paywall.

Open source · MIT licensed