Week 1 Day 3 — DrugComb AML Subset ETL
Summary
Full ETL on the 1.4M-row DrugComb v1.5 summary table (summary_v_1_5.csv, 1.3 GB)
completed in ~6 seconds via 100K-row streaming chunks. Output is three
canonical tables that Week 3 will consume.
Key numbers
| Stage | Count |
|---|---|
| Total DrugComb rows scanned | 1,432,351 |
| Rows on AML cell lines (13 canonical lines) | 13,877 |
| True combination rows (drug1 ≠ drug2, both present) | 4,938 |
| Monotherapy rows (drug2 = nan) | 8,939 |
| Unique drug names in AML subset | 769 |
| Drug names aligned to BeatAML vocab | 88 (80 exact + 8 fuzzy) |
| Strict combo pairs (both drugs in BeatAML) | 186 |
| Loose combo pairs (≥1 drug in BeatAML) | 1,708 |
| Monotherapy rows with BeatAML drug | 1,603 |
Critical finding: combination data is ALMANAC-HL-60 only
The 186 strict pairs all live on a single cell line — HL-60 — and all come from the NCI ALMANAC screen. Other AML lines (MV4-11, MOLM-13, THP-1, etc.) appear in DrugComb but only as monotherapy screens from CTRPv2/GDSC1/FIMM/CCLE/gCSI.
Implications for Week 3
This constrains the combo predictor design. We cannot learn how patient biology modulates combination response from one cell line. The architecture must factor:
- Single-drug response
→from BeatAML (613 patients × 165 drugs, already trained in Baseline A) - Combo adjustment
→learned from ALMANAC-HL-60 (186 pairs) - Patient extrapolation
→mechanism prior + drug-drug interaction embedding
Revised factorization:
combo_pred(patient, d1, d2) = single_pred(patient, d1)
+ single_pred(patient, d2)
+ combo_adjustment(d1, d2) ← ALMANAC-HL-60 residual
+ mechanism_prior(patient, d1, d2) ← knowledge-graph term
The HL-60 residual teaches pair chemistry; the mechanism prior + per-patient single-drug predictions carry the patient-personalization.
Outputs
data/canonical/
├── drugcomb_aml_pairs.csv # 186 strict pairs (both drugs in BeatAML)
├── drugcomb_aml_pairs_any_match.csv # 1,708 loose pairs (≥1 drug in BeatAML)
├── drugcomb_aml_monotherapy.csv # 1,603 single-drug AML cell-line screens
├── drugcomb_drug_alignment.csv # 769 names × source: no_match / exact / fuzzy
└── drugcomb_filter_manifest.json # reproducibility manifest
Cell line coverage
| Cell line | Total rows | Combo rows | Mono rows |
|---|---|---|---|
| HL-60 | 5,661 | 1,709 (ALMANAC) | 3,952 |
| HEL | 1,167 | 0 | 1,167 |
| OCI-AML5 | 828 | 0 | 828 |
| THP-1 | 748 | 0 | 748 |
| MV4-11 | 732 | 0 | 732 |
| OCI-AML2 | 705 | 0 | 705 |
| OCI-AML3 | 699 | 0 | 699 |
| KASUMI-1 | 674 | 0 | 674 |
| MOLM-13 | 673 | 0 | 673 |
| NB4 | 623 | 0 | 623 |
| SET-2 | 552 | 0 | 552 |
| U-937 | 495 | 0 | 495 |
| KG-1 | 320 | 0 | 320 |
Tests
46/46 pass:
tests/test_drug_name_normalizer.py(26 tests)tests/test_drugcomb_etl.py(20 tests including end-to-end synthetic ETL)
Issues resolved
- Salt-form regex bug: original
r"\bmono?hydrochloride\b"made theooptional instead of treatingmonoas an optional prefix. Fixed tor"\b(?:mono|di|tri)?hydrochloride\b". - Extended salt list: added
citrate, tartrate, phosphate, tosylate, ditosylate, fumarate, succinate, acetate, potassium, (TN). Recovered ~54 additional pair rows (Erlotinib, Pazopanib, Lapatinib salts). - Monotherapy contamination: DrugComb v1.5 unified table mixes monotherapy
(CTRPv2/GDSC/FIMM/CCLE/gCSI) with combination (ALMANAC). Added explicit
is_combo_rowfilter so pair counts don’t double-count single-drug screens.
Next (Day 4)
TCGA-LAML alignment — reuse existing align_tcga_public_to_templates.py from
AML-CRAFT. Output: data/canonical/tcga_laml_validation.csv with ~200 patients’
expression + mutation aligned to BeatAML template space for independent-cohort
validation in Week 5.