Using Django ORM to access the ChEMBL DB without SQL

myChEMBL team, ChEMBL group, EMBL-EBI.

ORM stands from Object Relational Mapping. It is a technology that allows access to a database in a native object oriented manner. Using ORM has several advantages:

  1. You don't have to put SQL strings into your code
  2. Your code is no longer tied to a specific database backend (such us postgres)
  3. Code written using ORM is more intuitive

There is already one ORM interface for ChEMBL, called pychembldb. It's based on the popular SQLAlchemy library. To complete this picture, we provide ORM models based on Django. This notebook presents how complex DB queries can be built using Django ORM within a IPython Notebook.

Django models described here are used in ChEMBL production code, for example in ChEMBL web services. As a result, the models can be loaded into myChEMBL, which uses postgres and they still work in ChEMBL production environment, where Oracle is used. The same codebase works against two different DB engines without modifying a single line of code.

Please note that this notebook covers only most basic usage of Django ORM in the context of ChEMBL data. For more comprehensive description and advanced usage, read Making Queries Django Tutorial and QuerySet API reference.

In [1]:
# In order to use django ORM we have to provide a settings module. This module must be kept in a separate file.
# A minimal settings.py contains database connection details, logging and caching configuration as well as a list of 
# applications (python packages) defining models we are planning to use. Let's see how our settings.py file for this
# notebook will look like:
!cat /home/chembl/ipynb_workbench/settings.py
DEBUG = False

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2', # Add 'postgresql_psycopg2', 'mysql', 'sqlite3' or 'oracle'.
        'NAME': 'chembl_19', # Or path to database file if using sqlite3.
        'USER': 'chembl', # Not used with sqlite3.
        'PASSWORD': '', # Not used witis oracle
        'HOST': '', # Set to empty string for localhost. Not used with sqlite3.
        'PORT': '5432', # Set to empty string for default. Not used with sqlite3.
    },
}

EXPORT_MODE = True
CORE_TABLES_MANAGED = False
APP_SPECIFIC_TABLES_MANAGED = False
COMPOUND_MOLS_TABLE = 'mols_rdkit'
CTAB_COLUMN = 'm'

SECRET_KEY = '3v2xb&@&_kibf0o!4m249njy3!qjxptht0m%q2w&ry8v&ok$na'

INSTALLED_APPS = (
    'chembl_core_db',
    'chembl_core_model',
    )

In [2]:
# In order to configure django to use our settings we have to append the module containing it to system path and set 
# DJANGO_SETTINGS_MODULE environment variable.
import os
import sys
sys.path.append('/home/chembl/ipynb_workbench')
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
# That's it! We are ready to use our models.
Out[2]:
'settings'
In [3]:
# First, lets import our ChEMBL ORM models. The module is called 'chembl_migration_model' because it describes ChEMBL DB
# schema, that is used in my_ChEMBL VM and in public SQL dumps. Internally, ChEMBL group is working on slightly more complex
# schema, containing more tables and columns.
import chembl_migration_model

# OK, now we can take a look at all available models:
print '\n'.join([model for model in dir(chembl_migration_model.models) if not model.startswith('__')])

# As you see each model corresponds to single table in ChEMBL DB. The name of the model is the name of the corresponding table,
# but without underscores and written in camel case, i.e. ASSAY_PARAMETERS (DB table) => AssayParameters (ORM model)

# For convenience we will import model names into default namespece:
from chembl_migration_model.models import *

# We need CompoundMols model to perform substructure and similarity search:
from chembl_core_model.models import CompoundMols
ActionType
Activities
ActivityStdsLookup
AssayParameters
AssayType
Assays
AtcClassification
BindingSites
BioComponentSequences
BiotherapeuticComponents
Biotherapeutics
CellDictionary
ChemblIdLookup
ComponentClass
ComponentDomains
ComponentSequences
ComponentSynonyms
CompoundProperties
CompoundRecords
CompoundStructures
ConfidenceScoreLookup
CurationLookup
DataValidityLookup
DefinedDailyDose
Docs
Domains
DrugMechanism
Formulations
LigandEff
MechanismRefs
MoleculeAtcClassification
MoleculeDictionary
MoleculeHierarchy
MoleculeSynonyms
OrganismClass
ParameterType
PredictedBindingDomains
Products
ProteinClassSynonyms
ProteinClassification
ProteinFamilyClassification
RelationshipType
ResearchCompanies
ResearchStem
SiteComponents
Source
TargetComponents
TargetDictionary
TargetRelations
TargetType
UsanStems
Version

Getting objects count

In [4]:
# For starters, lets see how many unique compounds we currently have at ChEMBL:
MoleculeDictionary.objects.count()
Out[4]:
1411786
In [5]:
# How about tagets?
TargetDictionary.objects.count()
Out[5]:
10579
In [6]:
# Assays?
Assays.objects.count()
Out[6]:
1106285

Filtering objects

In [7]:
# We will learn filtering on CompoundPoperties model. First let's get all CompoundProperties objects with molecular
# formula = C9H8O4
compounds = CompoundProperties.objects.filter(full_molformula="C9H8O4")
# how many of them exist?
print compounds.count()
29

In [8]:
# OK, let's take the first object:
compound = compounds[0]
# We can access all properties of this compound, for example alogp, numner of aromatic rings, molecular weight, etc:
print compound.alogp
print compound.aromatic_rings
print compound.full_mwt
1.23
1
180.16

In [9]:
# What if we would like for a result to be a list of certain fields not a collection of whole objects?
results = CompoundProperties.objects.filter(full_molformula="C9H8O4").values_list('hba', 'hbd', 'psa')
print results
[(4, 1, Decimal('63.60')), (4, 3, Decimal('77.76')), (4, 0, Decimal('60.44')), (4, 1, Decimal('63.60')), (4, 2, Decimal('74.59')), (4, 1, Decimal('63.60')), (4, 2, Decimal('74.60')), (4, 2, Decimal('74.60')), (4, 1, Decimal('63.60')), (4, 1, Decimal('63.60')), (4, 1, Decimal('63.60')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 1, Decimal('55.76')), (4, 3, Decimal('77.75')), (4, 2, Decimal('66.75')), (4, 2, Decimal('74.59')), '...(remaining elements truncated)...']

In [10]:
# Perfect, can I include many conditions in filters?

# Yes! Following query will get only structures with formula = C9H8O4, hbd > 1 and psa between (60 and 70):
compounds = CompoundProperties.objects.filter(full_molformula="C9H8O4", hbd__gt=1, psa__range=(60.0, 70.0))
print compounds.count()
print compounds.values_list('full_mwt', flat=True)
2
[Decimal('180.16'), Decimal('180.16')]

In [11]:
# Additionally to 'filter' there is 'exclude' method as well:

# Let's select all structures, without aromatic rings:
non_aromatic = CompoundProperties.objects.exclude(aromatic_rings__gt=0)
print non_aromatic.count()

# Having non-aromatic structures, can we apply more fileters only to them?

# Yes, we can chain 'filter' and 'exclude' methods:

light_non_aromatic = non_aromatic.filter(full_mwt__lte=180.0)
print light_non_aromatic.count()
134439
7851

In [12]:
# Great, but so har we've only seen a conjunction of filters, what if I would like to have alternative?

# How would I select all structures with 2 aromatic rings OR with formula = C9H8O4?
# This is where we need django helper operator called 'Q' (stands from Query):
from django.db.models import Q

# Now using Q, we can ask for alternative:
aromatic_or_formula = CompoundProperties.objects.filter(Q(aromatic_rings__gt=2) | Q(full_molformula="C9H8O4"))
print aromatic_or_formula.count()
559822

Distinct values and ordering

In [13]:
# I would like to see all distinct avlues of the numer of aromatics rings across all compound structures:

print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct()
[None, 8, 12, 17, 19, 2, 1, 15, 3, 10, 11, 4, 5, 13, 9, 0, 7, 6]

In [14]:
# Hmm, the result is not ordered...

print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct().order_by('aromatic_rings')
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, None]

In [15]:
# OK, but I need descending order...

print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct().order_by('-aromatic_rings')
[None, 19, 17, 15, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Limiting results

In [16]:
# I want just a ranking of 10 most heavy structures from ChEMBL:

heaviest = CompoundProperties.objects.order_by('-full_mwt')[0:10]
print heaviest.count()

# And the winner is:
winner = heaviest[0]
print winner.full_mwt
print winner.full_molformula

# Lets get a list of molformulas and wiights:
print '\n'.join(map(str, heaviest.values_list('full_molformula', 'full_mwt')))
10
13050.36
C614H1110N39O198P21S6
(u'C614H1110N39O198P21S6', Decimal('13050.36'))
(u'C396H390F252N66O24P42', Decimal('12546.26'))
(u'C540H960N39O198P21S6', Decimal('12010.38'))
(u'C504H888N39O198P21S6', Decimal('11505.43'))
(u'C324H318F252N42P42', Decimal('10888.77'))
(u'C468H828N27O198P21S6', Decimal('10844.48'))
(u'C480H828N27O186P21S6', Decimal('10796.62'))
(u'C325H387N118O209P29', Decimal('10188.45'))
(u'C319H383N118O209P29', Decimal('10112.36'))
(u'C318H382N119O208P29', Decimal('10097.35'))

Checking for existence

In [17]:
# Let's check if there is a compound with 2 aromatic rings and weight lesst than 200:

CompoundProperties.objects.filter(aromatic_rings=2, full_mwt__lt=200.0).exists()
Out[17]:
True

Getting single object

In [18]:
# Sometimes we want to be sure that only a single object has been selected, 
# this time let's take MoleculeDictionary as the example:

chembl_18 = MoleculeDictionary.objects.get(chembl_id='CHEMBL18')
print chembl_18.pref_name
ETHOXZOLAMIDE

In [19]:
# 'get' method will throw an exception in case when no results or more than one results were found:

try:
    imaginary = MoleculeDictionary.objects.get(chembl_id='CHEMBLX')
except MoleculeDictionary.DoesNotExist:
    print 'does not exist'
    
try:
    too_many = MoleculeDictionary.objects.get(prodrug=True)
except MoleculeDictionary.MultipleObjectsReturned:
    print 'too many'
does not exist
too many

The important feature of every chemistry database it the ability to perform substructure and similarity search. By default, Django is copletely chemistry unaware but 'chembl_core_model' brings this functionality.

In [20]:
# We can perform similarity search using CompoundMols object:
similar = CompoundMols.objects.similar_to('CC(=O)Oc1ccccc1C(=O)O', 80)
print similar.count()

# Due to some strane bug in Django we always have to mantion additional 'similarity' field but this is even better
# because this field is so useful :)
print '\n'.join(map(str, similar.values_list('molecule_id', 'similarity')))
5
(1280, 1.0)
(506546, 1.0)
(454071, 0.894736842105263)
(287927, 0.888888888888889)
(954218, 0.8)

In [21]:
# substructure search works the same way:
sub = CompoundMols.objects.with_substructure('CN(CCCN)c1cccc2ccccc12')
print sub.count()
72

The whole idea of having relational database is the ability to join related objecs together. Let's see how this can be achieved using ORM.

In [22]:
# First of all let's join our substructure and similarity search results with other objects to get some useful information.
# For example, we will get CHEMBL_ID and preffered name of similarity results.
# Here we join CompouldMols objects (`similar` variable) with `pref_name` field of MoleculeDictionary
# Double underscore before field name means we are accessing related object:
print '\n'.join(map(str, similar.values_list('molecule_id', 'similarity', 'molecule__pref_name')))
(1280, 1.0, u'ASPIRIN')
(506546, 1.0, u'ACETYLSALICYLATE')
(454071, 0.894736842105263, u'BROMOASPIRIN')
(287927, 0.888888888888889, None)
(954218, 0.8, None)

In [23]:
# And molecular formula and molecular weight of first 10 of substructure results:
print '\n'.join(map(str,sub.values_list('molecule__compoundproperties__full_molformula', 'molecule__compoundproperties__full_mwt')[0:10]))
(u'C45H35N5', Decimal('645.79'))
(u'C23H17N3O5', Decimal('415.40'))
(u'C28H21N3O5', Decimal('479.48'))
(u'C27H20N2O4', Decimal('436.46'))
(u'C23H20BrN3O3', Decimal('466.33'))
(u'C26H27N3O3', Decimal('429.51'))
(u'C25H22N4O3', Decimal('426.47'))
(u'C19H19N3O5', Decimal('369.37'))
(u'C25H23N5O3', Decimal('441.48'))
(u'C21H22N4O3', Decimal('378.42'))

In [24]:
# Get the synonyms of aspirin:
aspirin = MoleculeDictionary.objects.get(chembl_id='CHEMBL25')

# When having a single object we can simply access realted objects using dot (`.`) operator:
print ', '.join(aspirin.moleculesynonyms_set.values_list('synonyms', flat=True))
8-Hour Bayer, Acetylsalicylic Acid, Aspirin, Aspirin, Aspirin, Aspirin, Aspirin, Bayer Extra Strength, Measurin, Acetosalic Acid, Acetylsalicylic Acid, Ecotrin, Equi-Prin, Salicylic Acid Acetate, Acetylsalicylic Acid, Aspirin

In [25]:
# Get 10 preferred names of compounds containing 'CN(CCCN)c1cccc2ccccc12' substructure:
print ', '.join(CompoundMols.objects.filter(molecule__pref_name__isnull=False).with_substructure('c1cccc2ccccc12')[0:10].values_list('molecule__pref_name', flat=True))
METHYLRUBILACTONE, DANSYLSARCOSINE, R-NORDULOXETINE, NAPHTHYLOXYACETIC ACID, TOMEGLOVIR, BENZO[B]FLUORANTHENE, ACETOMENAPHTHONE, LANDOMYCIN R, LANDOMYCIN O, ARISTOLAMIDE II

In [26]:
#Retrieve all the compounds which have an IC50 bioactivity value in nM:
result = MoleculeDictionary.objects.filter(activities__standard_type = 'IC50', activities__standard_units = 'nM')
print result.count()
print ', '.join(result[0:10].values_list('chembl_id', flat=True))
1236586
CHEMBL113081, CHEMBL324340, CHEMBL324340, CHEMBL109600, CHEMBL357278, CHEMBL357119, CHEMBL152968, CHEMBL152968, CHEMBL152968, CHEMBL152968

In [26]: