Methodology & Data Sources
Transparent documentation of data sources, parsing techniques, analysis methods, and study limitations.
Data Sources
OpenFDA CRL Database
All Complete Response Letters were obtained from the FDA's public CRL database, launched in 2024 as part of their radical transparency initiative.
- →Approved CRLs: ~200 letters from drugs eventually approved (2020-2024)
- →Unapproved CRLs: ~89 letters from drugs not yet approved (2024-2025)
- →Format: PDF documents with redactions for proprietary information
PDF Parsing & Extraction
CRL PDFs were parsed using PyPDF2 to extract raw text. Regex patterns and keyword matching identified key elements:
Metadata Extraction
- • Application number (NDA/BLA/ANDA)
- • Drug name (when available)
- • Letter date
- • Page count
Deficiency Categories
- • Safety
- • Efficacy
- • CMC/Manufacturing
- • Clinical trial design
- • Bioequivalence
- • Labeling
- • Statistical
- • REMS
ChallengeHeavy redactions and poor OCR quality in older PDFs limited extraction accuracy. Missing drug names and dates for some documents.
NLP & Language Analysis
Advanced natural language processing techniques were applied to extract semantic patterns:
FDA-Specific Sentiment Analysis
Custom lexicons for severity scoring ("cannot approve", "inadequate") and certainty detection ("must", "should", "may"). Scores range 0-1.
Word Frequency & N-grams
TF-IDF vectorization to identify discriminative terms. Bigrams and trigrams captured regulatory phrases like "failed to demonstrate" and "new clinical trial".
Semantic Embeddings
t-SNE and UMAP dimensionality reduction on TF-IDF features to visualize document similarity in latent space. K-means clustering identified document groups.
Topic Modeling
Latent Dirichlet Allocation (LDA) with 5 topics revealed underlying themes in CRL content (clinical, manufacturing, safety, labeling, statistical).
Machine Learning Models
Binary classification task: predict whether a CRL will lead to eventual approval.
Models Evaluated
- • Logistic Regression: Linear baseline with L2 regularization
- • Random Forest: Ensemble of 100 decision trees, max depth 10
- • Gradient Boosting: XGBoost with 100 estimators, learning rate 0.1
Feature Engineering
- • Binary flags for each deficiency category
- • One-hot encoding of application type (NDA/BLA/ANDA)
- • Document metadata (page count, text length)
- • Key flags (safety concerns, new trial required)
Validation
5-fold stratified cross-validation on ~240 CRLs (80% training). Final test set of ~60 CRLs (20%) held out for unbiased performance evaluation.
Best Model: Gradient Boosting achieved 85.6% CV accuracy, significantly outperforming the 68% baseline (class distribution).
Limitations & Caveats
Small Sample Size
~300 CRLs is limited for robust machine learning. Confidence intervals are wide, and model generalization to future CRLs is uncertain.
Temporal Bias
Unapproved CRLs are recent (2024-2025), while approved CRLs span 2020-2024. "Unapproved" drugs may simply need more time for resubmission, not be fundamentally unapprovable.
Redaction & Data Quality
Heavy redactions obscure proprietary details. Drug names and dates often missing. OCR errors in older PDFs reduce text quality.
Regulatory Context
FDA standards and policies evolve over time. Patterns from 2020-2025 may not generalize to future years or therapeutic areas not well-represented in this dataset.
References & Tools
- • Python 3.10+ with pandas, scikit-learn, matplotlib, NLTK
- • Next.js 15, React 18, Recharts for interactive visualizations
- • OpenFDA API and public CRL database
- • BMJ 2015 CRL analysis: DOI 10.1136/bmj.h2758