Technical Documentation

Methodology & Data Sources

Transparent documentation of data sources, parsing techniques, analysis methods, and study limitations.

Data Sources

OpenFDA CRL Database

All Complete Response Letters were obtained from the FDA's public CRL database, launched in 2024 as part of their radical transparency initiative.

  • Approved CRLs: ~200 letters from drugs eventually approved (2020-2024)
  • Unapproved CRLs: ~89 letters from drugs not yet approved (2024-2025)
  • Format: PDF documents with redactions for proprietary information

PDF Parsing & Extraction

CRL PDFs were parsed using PyPDF2 to extract raw text. Regex patterns and keyword matching identified key elements:

Metadata Extraction

  • • Application number (NDA/BLA/ANDA)
  • • Drug name (when available)
  • • Letter date
  • • Page count

Deficiency Categories

  • • Safety
  • • Efficacy
  • • CMC/Manufacturing
  • • Clinical trial design
  • • Bioequivalence
  • • Labeling
  • • Statistical
  • • REMS

ChallengeHeavy redactions and poor OCR quality in older PDFs limited extraction accuracy. Missing drug names and dates for some documents.

NLP & Language Analysis

Advanced natural language processing techniques were applied to extract semantic patterns:

FDA-Specific Sentiment Analysis

Custom lexicons for severity scoring ("cannot approve", "inadequate") and certainty detection ("must", "should", "may"). Scores range 0-1.

Word Frequency & N-grams

TF-IDF vectorization to identify discriminative terms. Bigrams and trigrams captured regulatory phrases like "failed to demonstrate" and "new clinical trial".

Semantic Embeddings

t-SNE and UMAP dimensionality reduction on TF-IDF features to visualize document similarity in latent space. K-means clustering identified document groups.

Topic Modeling

Latent Dirichlet Allocation (LDA) with 5 topics revealed underlying themes in CRL content (clinical, manufacturing, safety, labeling, statistical).

Machine Learning Models

Binary classification task: predict whether a CRL will lead to eventual approval.

Models Evaluated

  • Logistic Regression: Linear baseline with L2 regularization
  • Random Forest: Ensemble of 100 decision trees, max depth 10
  • Gradient Boosting: XGBoost with 100 estimators, learning rate 0.1

Feature Engineering

  • • Binary flags for each deficiency category
  • • One-hot encoding of application type (NDA/BLA/ANDA)
  • • Document metadata (page count, text length)
  • • Key flags (safety concerns, new trial required)

Validation

5-fold stratified cross-validation on ~240 CRLs (80% training). Final test set of ~60 CRLs (20%) held out for unbiased performance evaluation.

Best Model: Gradient Boosting achieved 85.6% CV accuracy, significantly outperforming the 68% baseline (class distribution).

Limitations & Caveats

Small Sample Size

~300 CRLs is limited for robust machine learning. Confidence intervals are wide, and model generalization to future CRLs is uncertain.

Temporal Bias

Unapproved CRLs are recent (2024-2025), while approved CRLs span 2020-2024. "Unapproved" drugs may simply need more time for resubmission, not be fundamentally unapprovable.

Redaction & Data Quality

Heavy redactions obscure proprietary details. Drug names and dates often missing. OCR errors in older PDFs reduce text quality.

Regulatory Context

FDA standards and policies evolve over time. Patterns from 2020-2025 may not generalize to future years or therapeutic areas not well-represented in this dataset.

References & Tools

  • • Python 3.10+ with pandas, scikit-learn, matplotlib, NLTK
  • • Next.js 15, React 18, Recharts for interactive visualizations
  • • OpenFDA API and public CRL database
  • • BMJ 2015 CRL analysis: DOI 10.1136/bmj.h2758