ChEMBL is a large-scale bioactivity database that collects information on the interactions between small molecules (such as drugs, compounds, or substances) and their biological targets. It stands for “Chemical Biology Database” and is maintained by the European Molecular Biology Laboratory (EMBL).
Researchers can use ChEMBL to search for specific compounds, investigate their interactions with biological targets (such as enzymes, receptors, or ion channels), and analyze bioactivity data to identify potential drug candidates or understand the mechanisms of action for existing drugs. Additionally, ChEMBL provides tools and APIs (Application Programming Interfaces) for accessing and querying the database programmatically, enabling integration with other bioinformatics and cheminformatics workflows.
import math
from pathlib import Path
from zipfile import ZipFile
from tempfile import TemporaryDirectory
import numpy as np
import pandas as pd
from rdkit.Chem import PandasTools
!pip install chembl-webresource-client
from chembl_webresource_client.new_client import new_client
from tqdm.auto import tqdm
Get target data (EGFR kinase: UniProtID : P00533)¶
uniprot_id ="P00533"# Get target information from ChEMBL for specified values onlytargets = targets_api.get(target_components__accession=uniprot_id).only(
"target_chembl_id", "organism", "pref_name", "target_type")
print(f'The type of the targets is "{type(targets)}"')
The type of the targets is "<class 'chembl_webresource_client.query_set.QuerySet'>"
Fetch bioactivty data for the target_chembl_id : CHEMBL_203
# fetch the bioactivity data and filter it to only human proteins, IC50, exact measurement, binding databioactivities = bioactivities_api.filter(target_chembl_id=chembl_id, type="IC50", relation="=", assay_type="B").only(
"activity_id",
"assay_chembl_id",
"assay_description",
"assay_type",
"molecule_chembl_id",
"type",
"standard_units",
"relation",
"standard_value",
"target_chembl_id",
"target_organism",
)
print(f"Length and type of bioactivities object: {len(bioactivities)}, {type(bioactivities)}")
Length and type of bioactivities object: 10420, <class 'chembl_webresource_client.query_set.QuerySet'>
# Whats in here, look at first entrybioactivities[0]
# Download into a data framebioactivities_df = pd.DataFrame.from_dict(bioactivities)
print(f"DataFrame shape: {bioactivities_df.shape}")
bioactivities_df.head()
DataFrame shape: (10420, 13)
activity_id
assay_chembl_id
assay_description
assay_type
molecule_chembl_id
relation
standard_units
standard_value
target_chembl_id
target_organism
type
units
value
0
32260
CHEMBL674637
Inhibitory activity towards tyrosine phosphory...
B
CHEMBL68920
=
nM
41.0
CHEMBL203
Homo sapiens
IC50
uM
0.041
1
32267
CHEMBL674637
Inhibitory activity towards tyrosine phosphory...
B
CHEMBL69960
=
nM
170.0
CHEMBL203
Homo sapiens
IC50
uM
0.17
2
32680
CHEMBL677833
In vitro inhibition of Epidermal growth factor...
B
CHEMBL137635
=
nM
9300.0
CHEMBL203
Homo sapiens
IC50
uM
9.3
3
32770
CHEMBL674643
Inhibitory concentration of EGF dependent auto...
B
CHEMBL306988
=
nM
500000.0
CHEMBL203
Homo sapiens
IC50
uM
500.0
4
32772
CHEMBL674643
Inhibitory concentration of EGF dependent auto...
B
CHEMBL66879
=
nM
3000000.0
CHEMBL203
Homo sapiens
IC50
uM
3000.0
# units has different valuesbioactivities_df["units"].unique()
# Keep only canonical_smilescanonical_smiles = []
for i, compounds in compounds_df.iterrows():
try:
canonical_smiles.append(compounds["molecule_structures"]["canonical_smiles"])
except KeyError:
canonical_smiles.append(None)
compounds_df["smiles"] = canonical_smiles
compounds_df.shape
(6816, 3)
compounds_df.head()
molecule_chembl_id
molecule_structures
smiles
0
CHEMBL6246
{'canonical_smiles': 'O=c1oc2c(O)c(O)cc3c(=O)o...
O=c1oc2c(O)c(O)cc3c(=O)oc4c(O)c(O)cc1c4c23
1
CHEMBL10
{'canonical_smiles': 'C[S+]([O-])c1ccc(-c2nc(-...
C[S+]([O-])c1ccc(-c2nc(-c3ccc(F)cc3)c(-c3ccncc...
2
CHEMBL6976
{'canonical_smiles': 'COc1cc2c(cc1OC)Nc1ncn(C)...
COc1cc2c(cc1OC)Nc1ncn(C)c(=O)c1C2
3
CHEMBL7002
{'canonical_smiles': 'CC1(COc2ccc(CC3SC(=O)NC3...
CC1(COc2ccc(CC3SC(=O)NC3=O)cc2)CCCCC1
4
CHEMBL414013
{'canonical_smiles': 'COc1cc2c(cc1OC)Nc1ncnc(O...
COc1cc2c(cc1OC)Nc1ncnc(O)c1C2
# Are there missing smiles?compounds_df[compounds_df["smiles"].isnull()]