
Deepchem
Look up DeepChem loaders, datasets, splitters, and model APIs while building chemistry or molecular ML pipelines in Python.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill deepchemWhat is this skill?
- File-format loaders: CSV, UserCSV, SDF, Json, Image
- Biological loaders: FASTA, FASTQ, SAM/BAM/CRAM alignment formats
- Dataset types: NumpyDataset, DiskDataset, ImageDataset for memory vs scale
- Splitter catalog: Random, Index, Specified, RandomStratified, SingletaskStratified
- Specialized loaders including DFTYamlLoader and InMemoryLoader
Adoption & trust: 515 installs on skills.sh; 27.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
DeepChem integration and dataset wiring happen during product build when you implement predictive or cheminformatics features. Backend subphase fits Python ML services, batch inference, and data pipeline code that agents scaffold around scientific libraries.
Common Questions / FAQ
Is Deepchem safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Deepchem
# DeepChem API Reference This document provides a comprehensive reference for DeepChem's core APIs, organized by functionality. ## Data Handling ### Data Loaders #### File Format Loaders - **CSVLoader**: Load tabular data from CSV files with customizable feature handling - **UserCSVLoader**: User-defined CSV loading with flexible column specifications - **SDFLoader**: Process molecular structure files (SDF format) - **JsonLoader**: Import JSON-structured datasets - **ImageLoader**: Load image data for computer vision tasks #### Biological Data Loaders - **FASTALoader**: Handle protein/DNA sequences in FASTA format - **FASTQLoader**: Process FASTQ sequencing data with quality scores - **SAMLoader/BAMLoader/CRAMLoader**: Support sequence alignment formats #### Specialized Loaders - **DFTYamlLoader**: Process density functional theory computational data - **InMemoryLoader**: Load data directly from Python objects ### Dataset Classes - **NumpyDataset**: Wrap NumPy arrays for in-memory data manipulation - **DiskDataset**: Manage larger datasets stored on disk, reducing memory overhead - **ImageDataset**: Specialized container for image-based ML tasks ### Data Splitters #### General Splitters - **RandomSplitter**: Random dataset partitioning - **IndexSplitter**: Split by specified indices - **SpecifiedSplitter**: Use pre-defined splits - **RandomStratifiedSplitter**: Stratified random splitting - **SingletaskStratifiedSplitter**: Stratified splitting for single tasks - **TaskSplitter**: Split for multitask scenarios #### Molecule-Specific Splitters - **ScaffoldSplitter**: Divide molecules by structural scaffolds (prevents data leakage) - **ButinaSplitter**: Clustering-based molecular splitting - **FingerprintSplitter**: Split based on molecular fingerprint similarity - **MaxMinSplitter**: Maximize diversity between training/test sets - **MolecularWeightSplitter**: Split by molecular weight properties **Best Practice**: For drug discovery tasks, use ScaffoldSplitter to prevent overfitting on similar molecular structures. ### Transformers #### Normalization - **NormalizationTransformer**: Standard normalization (mean=0, std=1) - **MinMaxTransformer**: Scale features to [0,1] range - **LogTransformer**: Apply log transformation - **PowerTransformer**: Box-Cox and Yeo-Johnson transformations - **CDFTransformer**: Cumulative distribution function normalization #### Task-Specific - **BalancingTransformer**: Address class imbalance - **FeaturizationTransformer**: Apply dynamic feature engineering - **CoulombFitTransformer**: Quantum chemistry specific - **DAGTransformer**: Directed acyclic graph transformations - **RxnSplitTransformer**: Chemical reaction preprocessing ## Molecular Featurizers ### Graph-Based Featurizers Use these with graph neural networks (GCNs, MPNNs, etc.): - **ConvMolFeaturizer**: Graph representations for graph convolutional networks - **WeaveFeaturizer**: "Weave" graph embeddings - **MolGraphConvFeaturizer**: Graph convolution-ready representations - **EquivariantGraphFeaturizer**: Maintains geometric invariance - **DMPNNFeaturizer**: Directed message-passing neural network inputs - **GroverFeaturizer**: Pre-trained molecular embeddings ### Fingerprint-Based Featurizers Use these with traditional ML (Random Forest, SVM, XGBoost): - **MACCSKeysFingerprint**: 167-bit structural keys - **CircularFingerprint**: Extended connectivity fingerprints (Morgan fingerprints) - Parameters: `radius` (default 2), `size` (default 2048), `useChirality` (default False) - **PubChemFingerprint**: 881-bit structural descriptors - **Mol2VecFingerprint**: Learned molecular vector representations ### Descriptor Featurizers Calculate molecular properties directly: - **RDKitDescriptors**: ~200 molecular descriptors (MW, LogP, H-donors, H-acceptors, TPSA, etc.) - **MordredDescriptors**: Comprehensive structural and physicochemical descriptors - **CoulombMatrix**: Interatomic distance matrices for 3D structures ### Seq