ORM stands from Object Relational Mapping. It is a technology that allows access to a database in a native object oriented manner. Using ORM has several advantages:
There is already one ORM interface for ChEMBL, called pychembldb. It's based on the popular SQLAlchemy library. To complete this picture, we provide ORM models based on Django. This notebook presents how complex DB queries can be built using Django ORM within a IPython Notebook.
Django models described here are used in ChEMBL production code, for example in ChEMBL web services. As a result, the models can be loaded into myChEMBL, which uses postgres and they still work in ChEMBL production environment, where Oracle is used. The same codebase works against two different DB engines without modifying a single line of code.
Please note that this notebook covers only most basic usage of Django ORM in the context of ChEMBL data. For more comprehensive description and advanced usage, read Making Queries Django Tutorial and QuerySet API reference.
# In order to use django ORM we have to provide a settings module. This module must be kept in a separate file.
# A minimal settings.py contains database connection details, logging and caching configuration as well as a list of
# applications (python packages) defining models we are planning to use. Let's see how our settings.py file for this
# notebook will look like:
!cat /home/chembl/ipynb_workbench/settings.py
# In order to configure django to use our settings we have to append the module containing it to system path and set
# DJANGO_SETTINGS_MODULE environment variable.
import os
import sys
sys.path.append('/home/chembl/ipynb_workbench')
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
# That's it! We are ready to use our models.
# First, lets import our ChEMBL ORM models. The module is called 'chembl_migration_model' because it describes ChEMBL DB
# schema, that is used in my_ChEMBL VM and in public SQL dumps. Internally, ChEMBL group is working on slightly more complex
# schema, containing more tables and columns.
import chembl_migration_model
# OK, now we can take a look at all available models:
print '\n'.join([model for model in dir(chembl_migration_model.models) if not model.startswith('__')])
# As you see each model corresponds to single table in ChEMBL DB. The name of the model is the name of the corresponding table,
# but without underscores and written in camel case, i.e. ASSAY_PARAMETERS (DB table) => AssayParameters (ORM model)
# For convenience we will import model names into default namespece:
from chembl_migration_model.models import *
# We need CompoundMols model to perform substructure and similarity search:
from chembl_core_model.models import CompoundMols
# For starters, lets see how many unique compounds we currently have at ChEMBL:
MoleculeDictionary.objects.count()
# How about tagets?
TargetDictionary.objects.count()
# Assays?
Assays.objects.count()
# We will learn filtering on CompoundPoperties model. First let's get all CompoundProperties objects with molecular
# formula = C9H8O4
compounds = CompoundProperties.objects.filter(full_molformula="C9H8O4")
# how many of them exist?
print compounds.count()
# OK, let's take the first object:
compound = compounds[0]
# We can access all properties of this compound, for example alogp, numner of aromatic rings, molecular weight, etc:
print compound.alogp
print compound.aromatic_rings
print compound.full_mwt
# What if we would like for a result to be a list of certain fields not a collection of whole objects?
results = CompoundProperties.objects.filter(full_molformula="C9H8O4").values_list('hba', 'hbd', 'psa')
print results
# Perfect, can I include many conditions in filters?
# Yes! Following query will get only structures with formula = C9H8O4, hbd > 1 and psa between (60 and 70):
compounds = CompoundProperties.objects.filter(full_molformula="C9H8O4", hbd__gt=1, psa__range=(60.0, 70.0))
print compounds.count()
print compounds.values_list('full_mwt', flat=True)
# Additionally to 'filter' there is 'exclude' method as well:
# Let's select all structures, without aromatic rings:
non_aromatic = CompoundProperties.objects.exclude(aromatic_rings__gt=0)
print non_aromatic.count()
# Having non-aromatic structures, can we apply more fileters only to them?
# Yes, we can chain 'filter' and 'exclude' methods:
light_non_aromatic = non_aromatic.filter(full_mwt__lte=180.0)
print light_non_aromatic.count()
# Great, but so har we've only seen a conjunction of filters, what if I would like to have alternative?
# How would I select all structures with 2 aromatic rings OR with formula = C9H8O4?
# This is where we need django helper operator called 'Q' (stands from Query):
from django.db.models import Q
# Now using Q, we can ask for alternative:
aromatic_or_formula = CompoundProperties.objects.filter(Q(aromatic_rings__gt=2) | Q(full_molformula="C9H8O4"))
print aromatic_or_formula.count()
# I would like to see all distinct avlues of the numer of aromatics rings across all compound structures:
print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct()
# Hmm, the result is not ordered...
print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct().order_by('aromatic_rings')
# OK, but I need descending order...
print CompoundProperties.objects.values_list('aromatic_rings', flat=True).distinct().order_by('-aromatic_rings')
# I want just a ranking of 10 most heavy structures from ChEMBL:
heaviest = CompoundProperties.objects.order_by('-full_mwt')[0:10]
print heaviest.count()
# And the winner is:
winner = heaviest[0]
print winner.full_mwt
print winner.full_molformula
# Lets get a list of molformulas and wiights:
print '\n'.join(map(str, heaviest.values_list('full_molformula', 'full_mwt')))
# Let's check if there is a compound with 2 aromatic rings and weight lesst than 200:
CompoundProperties.objects.filter(aromatic_rings=2, full_mwt__lt=200.0).exists()
# Sometimes we want to be sure that only a single object has been selected,
# this time let's take MoleculeDictionary as the example:
chembl_18 = MoleculeDictionary.objects.get(chembl_id='CHEMBL18')
print chembl_18.pref_name
# 'get' method will throw an exception in case when no results or more than one results were found:
try:
imaginary = MoleculeDictionary.objects.get(chembl_id='CHEMBLX')
except MoleculeDictionary.DoesNotExist:
print 'does not exist'
try:
too_many = MoleculeDictionary.objects.get(prodrug=True)
except MoleculeDictionary.MultipleObjectsReturned:
print 'too many'
The important feature of every chemistry database it the ability to perform substructure and similarity search. By default, Django is copletely chemistry unaware but 'chembl_core_model' brings this functionality.
# We can perform similarity search using CompoundMols object:
similar = CompoundMols.objects.similar_to('CC(=O)Oc1ccccc1C(=O)O', 80)
print similar.count()
# Due to some strane bug in Django we always have to mantion additional 'similarity' field but this is even better
# because this field is so useful :)
print '\n'.join(map(str, similar.values_list('molecule_id', 'similarity')))
# substructure search works the same way:
sub = CompoundMols.objects.with_substructure('CN(CCCN)c1cccc2ccccc12')
print sub.count()
The whole idea of having relational database is the ability to join related objecs together. Let's see how this can be achieved using ORM.
# First of all let's join our substructure and similarity search results with other objects to get some useful information.
# For example, we will get CHEMBL_ID and preffered name of similarity results.
# Here we join CompouldMols objects (`similar` variable) with `pref_name` field of MoleculeDictionary
# Double underscore before field name means we are accessing related object:
print '\n'.join(map(str, similar.values_list('molecule_id', 'similarity', 'molecule__pref_name')))
# And molecular formula and molecular weight of first 10 of substructure results:
print '\n'.join(map(str,sub.values_list('molecule__compoundproperties__full_molformula', 'molecule__compoundproperties__full_mwt')[0:10]))
# Get the synonyms of aspirin:
aspirin = MoleculeDictionary.objects.get(chembl_id='CHEMBL25')
# When having a single object we can simply access realted objects using dot (`.`) operator:
print ', '.join(aspirin.moleculesynonyms_set.values_list('synonyms', flat=True))
# Get 10 preferred names of compounds containing 'CN(CCCN)c1cccc2ccccc12' substructure:
print ', '.join(CompoundMols.objects.filter(molecule__pref_name__isnull=False).with_substructure('c1cccc2ccccc12')[0:10].values_list('molecule__pref_name', flat=True))
#Retrieve all the compounds which have an IC50 bioactivity value in nM:
result = MoleculeDictionary.objects.filter(activities__standard_type = 'IC50', activities__standard_units = 'nM')
print result.count()
print ', '.join(result[0:10].values_list('chembl_id', flat=True))