Building the OpenADMET Data Engine
For those new to OpenADMET, we are an open-science consortium dedicated to building predictive models of safety and toxicity for small molecules to improve humanity’s ability to more reliably, cheaply, and effectively treat disease. We aim to achieve this by developing new assays, datasets, benchmarks, and competitions for ADMET, much as CASP did for protein folding 1.
Our recent ExpansionRx Challenge highlighted a critical gap: among 370 participants, the top four winners relied on proprietary data. This is not surprising as public ADMET datasets are sparse, poorly documented/controlled, noisy, and contradictory 2. Worse, the decisions made during data generation are often opaque or explicitly hidden. This creates a “black box” in which ML practitioners must guess the quality, context, and limitations of the inputs, which means combining data from mismatched assay sources often degrades performance rather than improve it.
OpenADMET is developing a scaled and consistent data engine to support predictive ADMET models. We are applying the technologies and rigor of target-based drug discovery to the mechanisms underlying ADMET properties, starting with metabolism. Octant is a small molecule drug discovery company that has built a platform to measure the interface between biology and chemistry at scale. As a founding member of OpenADMET, Octant is leveraging this infrastructure to build diverse, dense, and self-consistent ADMET datasets. More importantly, we don’t see data generation as a “black box” service, but as a “glass box” collaboration. We aim to establish a tight feedback loop among experimentalists, data scientists, and ML practitioners to ensure the assays, datasets, benchmarks, and competitions are as useful as possible. We are hoping you, the community, can help us improve, and are looking for some feedback.
Our first data release
In this post, we are releasing a preview of our first datasets generated at Octant for OpenADMET: CYP inhibition and reaction phenotyping.
We prioritized metabolism assays because they are among the most critical determinants of both a drug’s exposure and ADMET liabilities. Cytochrome P450 (CYP) enzymes are the workhorses of phase I drug metabolism 3, driving the oxidation of nonpolar xenobiotics into polar intermediates. Consequently, CYP inhibition is a major driver of drug-drug interactions (DDIs) and is intimately linked to complex, interrelated metabolic phenomena (e.g., PXR induction of CYP3A4 4).
To validate our data generation platform, we screened approximately 1,200 compounds for both CYP reactivity (CYP2J2 and CYP3A4) and CYP inhibition (CYP3A4). Crucially, we aimed to improve cost, capacity, and confidence simultaneously by miniaturizing reaction volumes, introducing robust controls, and replacing standard LC-MS with acoustic ejection mass spectrometry (Echo-MS) for reactivity.
Details of our design decisions are provided in the sections below, and a snapshot of the data is shown in Figures 1, 2, and 3. While this is just a teaser dataset in anticipation of a future blind challenge, it is, to our knowledge, one of the largest publicly-available internally consistent CYP datasets. The full dataset is available on GitHub and HuggingFace and includes the following:
Dataset 1: CYP Reaction Phenotyping
Determination of which CYP enzymes metabolize a drug candidate by testing against individual recombinant CYP enzymes
- Targets: CYP3A4 and CYP2J2
- Scale: 1,221 compounds from a diversity chemical library
- Data: Well-level observed peak areas and calculated depletion ratios.
Dataset 2: CYP Inhibition
- Targets: CYP3A4
- Scale: 1,343 compounds (12-point dose-response curves), intersecting highly with the compounds assayed for reaction phenotyping.
- Assay Conditions: These data were collected following a 30 minute pre-incubation with active CYP3A4 to capture inhibition by both parent molecule and any metabolites generated.
- Data: Well-level fluorescence values, calculated IC50 values, and curve-fit statistics.
Note: Our platform also captures raw data artifacts, such as mzML spectra from the mass spectrometer. To keep the repository accessible, we have not included these large files in the initial release. However, if you believe raw spectral features would improve your modeling, please let us know. We are happy to provide them in future releases.
The data show a rich set of active molecules measured against the 3 reactivity and inhibition endpoints. For CYP reactivity, we observe 750 molecules (~61 %) that are depleted >50% in the CYP3A4 reactivity assay (Fig 1A, 1B), and 166 (~13 %) for CYP2J2 (Fig 1C, 1D). The larger set of active molecules for CYP3A4 is consistent with its well-known broad substrate recognition. For our first evaluation of CYP inhibition only, we included a CYP3A4 preincubation step prior to the inhibition assay to capture inhibition effects by both parent molecules and any generated metabolites. As a consequence, the IC50 values reported here reflect the combined effect of reversible inhibition and any time-dependent processes that occur during preincubation, rather than reversible inhibition alone. Using this approach, we identify 1095 molecules (79 %) with detectable inhibition (IC50 < 100 µM) (Fig 1E).
Figure 1: CYP screening overview
Figure 1: CYP reactivity and inhibition screening overview. (A, C) Distribution of log₁₀ fold-change in peak area (treatment vs control) for reactivity with CYP3A4 (A) and CYP2J2 (C) across the compound library. Dashed line indicates no change. (B, D) Hexbin density plots of reactivity data comparing log₁₀ peak area in control vs treatment conditions for CYP3A4 (B) and CYP2J2 (D). Each bin is colored according to the number of molecules captured within. Red dashed line indicates the line of identity (no-change). (E) Distribution of CYP3A4 pIC₅₀ values from the inhibition assay, colored by IC₅₀ potency bin. (F) CYP3A4 reactivity (log₂ fold-change) as a function of inhibition potency (pIC₅₀), colored by IC₅₀ bin.
Staying true to our ‘glass box’ approach, we surface well-level data, as shown in Figure 2 below. This enables exploration not only of the chemical diversity of compounds with variable degradation, but also understanding how each of the four replicates for each condition performs.
Figure 2: CYP3A4 vs CYP2J2 reactivity
Figure 2: Interactive comparison of CYP3A4 vs CYP2J2 reactivity. Each point represents a single molecule, and marginal density curves show the distributions for each enzyme. Hover to view molecular structure and raw peak area swarm plots per compound.
Taking this further, comparing molecules’ CYP3A4 inhibition with reactivity provides additional context and a more complete picture of the underlying assays (Fig. 1F, Fig. 3). Compounds that fail to inhibit CYP3A4 are rarely substrates, consistent with a lack of binding affinity. Conversely, potent inhibitors often exhibit modest reactivity, indicating binding modes that enable occupancy, but not catalysis within the active site. Some of these potent inhibitors are possibly time-dependent, given the active CYP pre-incubation step. These nuanced relationships are often invisible in public datasets where measurements are aggregated from disparate sources. By capturing self-consistent, multi-endpoint data, we enable the kind of multi-task modeling required to generalize up the complexity stack; from recombinant enzymes to microsomes, hepatocytes, and eventually, in vivo behavior.
Figure 3: CYP3A4 inhibition vs reactivity
Figure 3: Interactive CYP3A4 inhibition potency vs reactivity graph. Each point represents a compound plotted by IC50 (x-axis, log scale) and percent remaining in the CYP3A4 reaction phenotyping assay (y-axis). Marginal densities shown along each axis. Hover to view molecular structure, raw peak area swarm plot, and dose-response curve per compound.
The following sections detail the specific trade-offs we faced in building these assays. For an immediate deep dive into the full protocols and data, including both summary statistics and well-level readouts, please refer to this Github repo and HuggingFace dataset.
Importantly, this is just the start of a large-scale data generation campaign across these and other endpoints. Watch for an announcement in the next 2 weeks for an OpenADMET challenge using Octant-generated data.
What’s next?
Beyond the Tier 1 reactivity and inhibition screens described above, we are actively developing additional endpoints to profile CYP activity (Fig 7):
- CYP Clearance Assay: an automated, 2 µL 1536-well time-course reactivity assay to provide kinetic depth (Fig 7A).
- Time-Dependent Inhibition (TDI): Identifies inhibitors that grow more potent with sustained enzyme exposure, a critical DDI liability (Fig 7B).
Figure 7: Clearance & TDI assays
Figure 7: Characterization of CYP metabolic clearance and time-dependent inhibition (TDI) assays in 1536-well format. (A) Metabolic depletion curves for CYP3A4 substrates AdipoRon and Eletriptan across five starting concentrations (1.5–20 µM), measured by Echo-MS peak area over 120 minutes of CYP3A4 incubation. Points show individual replicates; lines are loess fits. (B) Time-dependent inhibition assessment. Left: Troleandomycin dose-response curves with (black) and without (grey) 30-minute CYP3A4 preincubation. Red shaded region and arrow highlight the pIC50 shift between conditions. Right: Interactive pIC50 shift estimates (mean ± 95% CI) for known TDI compounds (troleandomycin, azamulin, verapamil, diltiazem) and non-TDI controls. Dotted line indicates the 2-fold shift cutoff for TDI classification. Mouse over points to see underlying DRC curve shifts.
We plan to expand to additional CYP isoforms and scale to tens of thousands of molecules. As our datasets grow, we will launch a blind challenge with over 20,000 CYP reactivity and inhibition data points across >4 CYPs later in 2026, along with the nuclear receptors PXR and AHR.
Acknowledgements
The authors would like to acknowledge experimental data, analysis and scientific input contributed by Ana Lindahl, Lauren Orr and Ayesha Ghazali for this blog post.
This work was supported by funding from ARPA-H and the Astera Foundation.
We would also like to acknowledge technical support for our CYP assay development from SCIEX and Discovery Life Sciences.
Get Involved
Key to the success of the OpenADMET consortium is hearing from the broader community of ADMET and ML scientists, so tell us what matters: which endpoints (MetID? reactive metabolites?) should we prioritize next? What data formats would slot seamlessly into your model training? What experimental details are you most curious about? Join the conversation on the OpenADMET assay development Discord, or reach out to us directly .
Data Accessibility
Datasets are available in parquet format on HuggingFace. Blog post source code and assay protocols can be found on Github.
Last updated: March 03, 2026
Built with ❤️ and Quarto



