In [1]:

```
%matplotlib inline
%pylab inline
from IPython.display import Image
from IPython.html.widgets import FloatProgress
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import interp1d
import random
import time
import sys
import os
from itertools import groupby
```

We are using RDKit for all chemoinformatics calculations:

In [2]:

```
from rdkit import Chem
from rdkit.Chem import AllChem
```

In [3]:

```
sys.path.append('/home/chembl/ipynb_workbench')
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
import chembl_migration_model
from chembl_migration_model.models import *
from chembl_core_model.models import CompoundMols
```

For connecting to mongoDB we will use pymongo.

In [4]:

```
import pymongo
from bson.binary import Binary
from bson.objectid import ObjectId
from pymongo import MongoClient
```

Here we establish connection with mongoDB and create a new database called 'similarity':

In [5]:

```
client = MongoClient()
db = client.similarity
```

Our first task will be to retrieve all compounds from ChEMBL and insert them to MongoDB collection. For each compound in ChEMBL we will select its CHEMBL ID and SMILES string, we will use Django ORM to retrieve this data like this:

In [6]:

```
smiles = MoleculeDictionary.objects.values_list('compoundstructures__canonical_smiles', 'chembl_id')
smi_count = smiles.count()
```

`try`

/`except`

blocks. It is not good practice to catch all exceptions (having single `except`

keyword is called pockemon exception handling) but this is not mission critical code. So for each compound we will store a document containing SMILES, RDKit object and CHEMBL ID.

In [13]:

```
if 'molecules' not in db.collection_names():
print 'populating mongodb collection with compounds from chembl...'
sys.stdout.flush()
molecules = db.molecules
percentile = int(smi_count / 100)
pbar = FloatProgress(min=0, max=smi_count)
display(pbar)
chunk_size = 100
chunk = []
for i, item in enumerate(smiles):
if not (i % percentile):
pbar.value = (i + 1)
try:
rdmol = Chem.MolFromSmiles(item[0])
except:
continue
if not rdmol:
continue
mol_data = {
'smiles': Chem.MolToSmiles(rdmol, isomericSmiles=True),
'chembl_id': item[1],
'rdmol': Binary(rdmol.ToBinary()),
}
chunk.append(mol_data)
if len(chunk) == chunk_size:
molecules.insert_many(chunk)
chunk = []
molecules.insert_many(chunk)
chunk = []
pbar.value = smi_count
print '%s molecules loaded successfully' % molecules.count()
```

In practice, efficient calculation of the similarity of any two molecules is achieved using molecular fingerprints that encode structural information about the molecule as a series of bits (0 or 1). Typically these bits represent the presence or absence of particular patterns or substructures â€” two molecules that contain more of the same patterns will have more bits in common, indicating that they are more similar. The most commonly used measure for quantifying the similarity of two fingerprints is the Tanimoto (a.k.a. Jaccard) coefficient, given by:

$$T = \frac{N_{ab}}{N_{a} + N_{b} - N_{ab}}$$where $N_{a}$ and $N_{b}$ are the number of â€˜onâ€™ fingerprint bits (i.e corresponding to the presence of a pattern) in the fingerprints of molecule $a$ and molecule $b$ respectively. $N_{ab}$ is the number of fingerprint bits common to the fingerprints of both molecule $a$ and $b$. The Tanimoto coefficient ranges from 0 when the fingerprints have no bits in common, to 1 when the fingerprints are identical.

Morgan (a.k.a. circular) fingerprints are an ECFP-like family of fingerprints that RDKit can generate. Instead of being based on bits that correspond to a fixed set of predefined substructural keys, ECFPs consist of integer codes that are generated using a hashing procedure from the local environments around each atom in multiple circular layers up to a given radius. Given a radius (e.g. 2), RDKit will return a Morgan fingerprint as a count vector that consists of a list of unique integers codes, each with a count that corresponds to the number of times the corresponding substructure occurs in the molecule:

```
mfp = AllChem.GetMorganFingerprint(rdmol, 2)
```

While this is the most information-rich representation of this fingerprint, there are two things to note:

Firstly, each fingerprint consists of typically less than 100 integer codes that are sparsely mapped over a vast space of $2^{32}$ possible integers. Storing these in the database as explicit bit vectors would be hugely inefficient and impractical (even impossible?), so RDKit also has the ability to fold the fingerprints down to a fixed length bit vector. Fortunately, for our purposes it is actually preferable to store the fingerprints in MongoDB as an array of just the â€˜onâ€™ bit positions, negating most of the problems with sparse fingerprints and making it possible to use unfolded fingerprints. Nevertheless, it is worth investigating the impact that folding has on both query performance and quality of the query results.

Secondly, itâ€™s worth considering how best to use the count information associated with each integer code. For now, we have just discarded this information and treated the fingerprint as a traditional on/off bit vector. However, it shouldnâ€™t be too difficult to retain and use this information when calculating similarities. We expect this could improve the quality of the query results, but could potentially have a negative impact on query performance.

It's important to pre-calculate the fingerprint for every molecule in the database so it doesnâ€™t have to be redundantly calculated every time a search is performed, we can append fingerprint information to each document in the `molecules`

collection:

In [18]:

```
print 'precalculating fingerprints...'
sys.stdout.flush()
mol_count = molecules.count()
percentile = int(mol_count / 100)
pbar = FloatProgress(min=0, max=mol_count)
display(pbar)
for i, molecule in enumerate(db.molecules.find()):
if not (i % percentile):
pbar.value = (i + 1)
rdmol = Chem.Mol(molecule['rdmol'])
mfp = list(AllChem.GetMorganFingerprintAsBitVect(rdmol, 2, nBits=2048).GetOnBits())
db.molecules.update_one({'_id': molecule['_id']}, {"$set":{"mfp":{'bits': mfp, 'count': len(mfp)}}} )
pbar.value = mol_count
print '...done'
```

In [84]:

```
response = db.molecules.aggregate([{'$group': {'_id': '$mfp.count', 'total': {'$sum': 1}}}])
data = pd.DataFrame().from_dict([r for r in response])
fig = plt.figure()
fig.set_size_inches(18, 6)
plt.ylabel('Number of molecules')
plt.xlabel('Number of 1-bits in fingerprint')
h = plt.hist(data._id, weights=data.total, histtype='bar', bins=105, rwidth=2)
```

`mfp_counts`

with documents that have a count field and an `_id`

that corresponds to the fingerprint bit.

In [85]:

```
print 'computing fingerprint bit counts...'
sys.stdout.flush()
counts = {}
mol_count = molecules.count()
percentile = int(mol_count / 100)
pbar = FloatProgress(min=0, max=mol_count)
display(pbar)
for molecule in db.molecules.find():
if not (i % percentile):
pbar.value = (i + 1)
for bit in molecule['mfp']['bits']:
counts[bit] = counts.get(bit, 0) + 1
counts_it = counts.items()
chunk_size = 100
for chunk in [counts_it[i:i + chunk_size] for i in range(0, len(counts_it), chunk_size)]:
db.mfp_counts.insert_many([{'_id': k, 'count': v} for k,v in chunk])
pbar.value = mol_count
print '... done'
```

In [20]:

```
db.molecules.create_index('mfp.bits')
db.molecules.create_index('mfp.count')
```

Out[20]:

There are two further efficiency tricks that work by placing constraints on easily searchable properties of fingerprints to filter out the most obviously unsuitable molecules before going through the expensive process of calculating the exact Tanimoto coefficient:

Given a query fingerprint with $N_{a}$ bits, any molecule in the database must have $N_{b}$ bits within the range $TN_{a} \leqslant N_{b} \leqslant \frac{N_{a}}{T}$ to have a similarity above a threshold $T$. This is because if the number of bits in each fingerprint differ by too much, it would be impossible for the similarity to be above the threshold even in the most ideal case where one fingerprint is an exact subset of the other.

Out of any $N_{a} - TN_{a} + 1$ bits chosen at random from the query fingerprint $a$, at least one must be present in any other fingerprint $b$ to achieve a similarity above the threshold $T$. This is because even if all the remaining other $TN_{a} - 1$ bits are in common, it still would not be enough for the similarity to be above the threshold (as shown by the previous constraint).

For the second constraint, the bits chosen from the query fingerprint donâ€™t actually need to be random â€” in fact, we can optimise even further by ensuring we choose the rarest fingerprint bits, thus restricting the query set as much as possible.

In [21]:

```
def similarity_client(smiles, threshold=0.8):
"""Perform a similarity search on the client, with initial screening to improve performance."""
if not smiles:
return
mol = Chem.MolFromSmiles(smiles)
if not mol:
return
qfp = list(AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048).GetOnBits())
qn = len(qfp) # Number of bits in query fingerprint
qmin = int(ceil(qn * threshold)) # Minimum number of bits in results fingerprints
qmax = int(qn / threshold) # Maximum number of bits in results fingerprints
ncommon = qn - qmin + 1 # Number of fingerprint bits in which at least one must be in common
# Get list of bits where at least one must be in result fp. Use least popular bits if possible.
if db.mfp_counts:
reqbits = [count['_id'] for count in db.mfp_counts.find({'_id': {'$in': qfp}}).sort('count', 1).limit(ncommon)]
else:
reqbits = qfp[:ncommon]
results = []
for fp in db.molecules.find({'mfp.bits': {'$in': reqbits}, 'mfp.count': {'$gte': qmin, '$lte': qmax}}):
intersection = len(set(qfp) & set(fp['mfp']['bits']))
pn = fp['mfp']['count']
tanimoto = float(intersection) / (pn + qn - intersection)
if tanimoto >= threshold:
results.append((tanimoto, fp['chembl_id'], fp['smiles']))
return results
```

Just checking how it works for aspirin...

In [22]:

```
# aspirin
similarity_client('O=C(Oc1ccccc1C(=O)O)C')
```

Out[22]:

One issue with the query method outlined above is that most of the time-consuming work is performed by the client rather than the database server. The server performs the initial filtering stages and then just sends thousands of molecules over to the client to calculate the exact similarities and produce the actual results. This makes the overall performance mostly dependent on the speed of the client or application server; it also has the potential for a network bottleneck when transferring such a huge amount of data for each query.

The solution to this problem is to use the MongoDB aggregation framework, which allows a pipeline of tasks to be performed on the MongoDB server. This way, the MongoDB server does all the work and only the actual query results are sent to the client.

In his blog post at Datablend, Davy proposed an aggregation pipeline for similarity searching that Rajarshi found to be disappointingly slow in his benchmarks. However, in the time since Davy wrote his blog post, MongoDB version 2.6 introduced some new aggregation features that may have better performance. Of particular interest is the `$setIntersection`

operator, which is exactly what we need to calculate the number of bits in common between two fingerprints.

Hereâ€™s a similarity search aggregation pipeline that uses these new features:

In [7]:

```
def similarity_search_fp(smiles, threshold=0.8):
"""Perform a similarity search using aggregation framework."""
if not smiles:
return
mol = Chem.MolFromSmiles(smiles)
if not mol:
return
qfp = list(AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048).GetOnBits())
qn = len(qfp) # Number of bits in query fingerprint
qmin = int(ceil(qn * threshold)) # Minimum number of bits in results fingerprints
qmax = int(qn / threshold) # Maximum number of bits in results fingerprints
ncommon = qn - qmin + 1 # Number of fingerprint bits in which at least 1 must be in common
if db.mfp_counts:
reqbits = [count['_id'] for count in db.mfp_counts.find({'_id': {'$in': qfp}}).sort('count', 1).limit(ncommon)]
else:
reqbits = qfp[:ncommon]
aggregate = [
{'$match': {'mfp.count': {'$gte': qmin, '$lte': qmax}, 'mfp.bits': {'$in': reqbits}}},
{'$project': {
'tanimoto': {'$let': {
'vars': {'common': {'$size': {'$setIntersection': ['$mfp.bits', qfp]}}},
'in': {'$divide': ['$$common', {'$subtract': [{'$add': [qn, '$mfp.count']}, '$$common']}]}
}},
'smiles': 1,
'chembl_id': 1
}},
{'$match': {'tanimoto': {'$gte': threshold}}}
]
response = db.molecules.aggregate(aggregate)
return [(r['tanimoto'], r['smiles'], r['chembl_id']) for r in response]
```

Checking improved function with aspirin:

In [10]:

```
similarity_search_fp('O=C(Oc1ccccc1C(=O)O)C')
```

Out[10]:

In order to perform benchmarks we will select 1000 random compound SMILES from ChEMBL. Note, that because in MongoDB collection we stored RDKit-generated SMILES, in most cases result SMILES will differ. For each discrete value of threshold, we will perform the same benchmark 5 times and save mean value.

In [11]:

```
sample_size = 1000
rand_smpl = [ smiles[i][0] for i in sorted(random.sample(xrange((smiles.count())), sample_size)) ]
repetitions = 5
```

In [42]:

```
timings = []
for thresh in [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]:
print 'measuring performance for similarity {0}'.format(thresh)
sys.stdout.flush()
rep_times = []
for i in range(repetitions):
start = time.time()
for sample in rand_smpl:
_ = similarity_search_fp(sample, thresh)
stop = time.time()
rep_times.append(stop-start)
timings.append((thresh, np.mean(rep_times)))
```

Once this is complete, we can take a look at results:

In [14]:

```
print timings
```

Plotting them will give us a better idea about time complexity:

In [87]:

```
thresholds = np.array([t[0] for t in reversed(timings)])
times = np.array([t[1] for t in reversed(timings)])
fig = plt.figure()
fig.set_size_inches(6, 6)
xnew = np.linspace(thresholds.min(), thresholds.max(), num=41, endpoint=True)
f2 = interp1d(thresholds, times, kind='quadratic')
plt.plot(thresholds, times, 'o', xnew, f2(xnew), '-')
plt.ylabel('Mean query time (ms)')
plt.xlabel('Similarity threshold')
plt.show()
```

Although this looks better than benchmarks performed by Rajarshi Guha, it's still disappointing. Apart from very high threshold values, all other data points indicate very long query times. The overall shape of the curve indicates polynomial time complexity with respect to decreasing threshold.

It's worth noting, that the results presented on the chart above are actually higher then original results presented by Matt Swain. This is most probably because Matt defined functions with a signatures like this:

```
profile_similarity(mols, fingerprinter, fp_collection, result_collection, threshold=0.8, count_collection=None)
```

So expecting binary RDKit mol object while all similarity functions defined in this notebook have signature like:

```
similarity_search_fp(smiles, threshold=0.8)
```

So expecting SMILES string as input. This require additional expensive step of parsing SMILES string and converting it to RDKit mol. The second form is much more useful in practical applications.

Basically the problem of finding similar compounds is a variation of nearest neighbour search. One of the methods that are helpful in solving this kind of problem is called Locality-sensitive hashing.

The idea behind LSH is pretty simple - while most hashes we use everyday (cryptographic hashes like SHA) generate completely different values for documents even when they differ only slightly, LSH tries to generate similar hash values for similar documents, so to maximize the probability of hash collision.

Since we are using Tanimoto coefficient for measuring similarity, it makes sense to use min-wise independent permutations locality sensitive hashing scheme. In fact, even the very short Wikipedia article about Tanimoto (Jaccard) index, mentions MinHash as an algorithm that "may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets".

In order to define new version of similarity search based on LSH we need to define some functions. We ported them from the code presented in article by Ciumac Sergiu. They generate random permutations, reorder fingerprint bits according to permutations and save the position of the first '1' bit in reordered fingerprints. Then compress the list of the '1' position into smaller list by composing one bigger number from several smaller using binary left shift operation.

In [12]:

```
def get_permutations(len_permutations=2048, num_permutations=100):
return map(lambda _: np.random.permutation(2048), range(num_permutations))
def get_min_hash(mol, permutations):
qfp_bits = [int(n) for n in AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)]
min_hash = []
for perm in permutations:
for idx, i in enumerate(perm):
if qfp_bits[i]:
min_hash.append(idx)
break
return min_hash
def hash_to_buckets(min_hash, num_buckets=25, nBits=2048):
if len(min_hash) % num_buckets:
raise Exception('number of buckets must be divisiable by the hash length')
buckets = []
hash_per_bucket = int(len(min_hash) / num_buckets)
num_bits = (nBits-1).bit_length()
if num_bits * hash_per_bucket > sys.maxint.bit_length():
raise Exception('numbers are too large to produce valid buckets')
for b in range(num_buckets):
buckets.append(reduce(lambda x,y: (x << num_bits) + y, min_hash[b:(b + hash_per_bucket)]))
return buckets
```

We can now generate some random permutations and save them to the separate MongoDB collection:

In [131]:

```
db.permutations.insert_many([{'_id':i, 'permutation': perm.tolist()} for i, perm in enumerate(get_permutations())])
```

Out[131]:

In [248]:

```
permutations = [p['permutation'] for p in db.permutations.find()]
print 'precalculating locality-sensitive hashing groups...'
sys.stdout.flush()
mol_count = db.molecules.count()
percentile = int(mol_count / 100)
pbar = FloatProgress(min=0, max=mol_count)
display(pbar)
for i, molecule in enumerate(db.molecules.find()):
if not (i % percentile):
pbar.value = (i + 1)
rdmol = Chem.Mol(molecule['rdmol'])
min_hash = get_min_hash(rdmol, permutations)
hash_groups = hash_to_buckets(min_hash)
db.molecules.update_one({'_id': molecule['_id']}, {"$set":{"lsh" : hash_groups}} )
pbar.value = mol_count
print '...done'
```

In [366]:

```
print 'constructing hash maps...'
sys.stdout.flush()
mol_count = db.molecules.count()
percentile = int(mol_count / 100)
pbar = FloatProgress(min=0, max=mol_count)
display(pbar)
for i, molecule in enumerate(db.molecules.find()):
if not (i % percentile):
pbar.value = (i + 1)
hash_groups = molecule["lsh"]
for n_hash, lsh_hash in enumerate(hash_groups):
db['hash_' + str(n_hash)].update_one({'_id':lsh_hash}, {'$push':{'molecules': molecule['_id']}}, True)
pbar.value = mol_count
print '...done'
```

`similarity_search_lsh`

function:

In [13]:

```
permutations = [p['permutation'] for p in db.permutations.find()]
def similarity_search_lsh(smiles, threshold=0.8):
"""Perform a similarity search using aggregation framework."""
if not smiles:
return
mol = Chem.MolFromSmiles(smiles)
if not mol:
return
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
qfp = list(fp.GetOnBits())
qfp_bits = [int(n) for n in fp]
min_hash = []
for perm in permutations:
for idx, i in enumerate(perm):
if qfp_bits[i]:
min_hash.append(idx)
break
hash_groups = hash_to_buckets(min_hash)
qn = len(qfp) # Number of bits in query fingerprint
qmin = int(ceil(qn * threshold)) # Minimum number of bits in results fingerprints
qmax = int(qn / threshold) # Maximum number of bits in results fingerprints
ncommon = qn - qmin + 1 # Number of fingerprint bits in which at least 1 must be in common
if db.mfp_counts:
reqbits = [count['_id'] for count in db.mfp_counts.find({'_id': {'$in': qfp}}).sort('count', 1).limit(ncommon)]
else:
reqbits = qfp[:ncommon]
nested_res = [ list(i)[0]['molecules'] for i in
[db['hash_' + str(i)].find({'_id':h},{'molecules':1}) for i,h in enumerate(hash_groups)]]
hashed_ids = [ObjectId(x) for x in (set([str(item) for sublist in nested_res for item in sublist]))]
aggregate = [
{'$match': {'_id':{'$in': hashed_ids}, 'mfp.count': {'$gte': qmin, '$lte': qmax}, 'mfp.bits': {'$in': reqbits}}},
{'$project':{
'tanimoto': {'$let': {
'vars': {'common': {'$size': {'$setIntersection': ['$mfp.bits', qfp]}}},
'in': {'$divide': ['$$common', {'$subtract': [{'$add': [qn, '$mfp.count']}, '$$common']}]}
}},
'smiles': 1,
'chembl_id': 1}},
{'$match': {'tanimoto': {'$gte': threshold}}},
]
response = db.molecules.aggregate(aggregate)
return [(r['tanimoto'], r['smiles'], r['chembl_id']) for r in response]
```

Traditionally, checking results for aspirin:

In [14]:

```
similarity_search_lsh('O=C(Oc1ccccc1C(=O)O)C')
```

Out[14]:

In order to check the performance of our `similarity_search_lsh`

function, we will use the same set of 1000 compounds we were using to test `similarity_search_fp`

.

In [62]:

```
timings_lsh = []
for thresh in [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]: #[0.7, 0.75, 0.8, 0.85, 0.9, 0.95]:
print 'measuring performance for similarity {0}'.format(thresh)
sys.stdout.flush()
rep_times = []
for i in range(repetitions):
start = time.time()
for sample in rand_smpl:
_ = similarity_search_lsh(sample, thresh)
stop = time.time()
rep_times.append(stop-start)
timings_lsh.append((thresh, np.mean(rep_times)))
```

In [11]:

```
timings_lsh
```

Out[11]:

Plotting the results...

In [24]:

```
thresholds = np.array([t[0] for t in reversed(timings_lsh)])
times = np.array([t[1] for t in reversed(timings_lsh)])
print timings_lsh
print times
fig = plt.figure()
fig.set_size_inches(6, 6)
xnew = np.linspace(thresholds.min(), thresholds.max(), num=41, endpoint=True)
f2 = interp1d(thresholds, times, kind='slinear')
plt.plot(thresholds, times, 'o', xnew, f2(xnew), '-')
plt.ylabel('Median query time (ms)')
plt.xlabel('Similarity threshold')
plt.show()
```

`similarity_search_fp`

but we will check them both on the same plot later...

We can check how accurate is function `similarity_search_lsh`

when compared to `similarity_search_fp`

. We will check number of discrepancies defined as the size of symmetrical set difference between results from both functions. We will measure total number of discrepancies over a set of 1000 random compounds as a function of similarity threshold.

In [15]:

```
discrepancies = []
discrepancies_details = []
for thresh in [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]: #[0.7, 0.75, 0.8, 0.85, 0.9, 0.95]:
print 'measuring discrepancies for similarity {0}'.format(thresh)
sys.stdout.flush()
rep_discrepancies = []
rep_coherent = []
rep_err_compounds = []
for i in range(1):
dis_sum = 0
sum_sum = 0
err_comp_sum = 0
for sample in rand_smpl:
sim_lsh = similarity_search_lsh(sample, thresh)
sim_fp = similarity_search_fp(sample, thresh)
lsh_res = set([x[2] for x in sim_lsh if (x and len(x) > 2)]) if sim_lsh else set()
reg_res = set([x[2] for x in sim_fp if (x and len(x) > 2)]) if sim_fp else set()
difference = (lsh_res^reg_res)
if len(difference):
err_comp_sum += 1
dis_sum += len(difference)
sum_sum += len(lsh_res|reg_res)
discrepancies_details.append((sample, thresh, difference))
rep_discrepancies.append(dis_sum)
rep_coherent.append(sum_sum)
rep_err_compounds.append(err_comp_sum)
discrepancies.append((thresh, np.mean(rep_discrepancies), np.mean(rep_coherent), np.mean(rep_err_compounds)))
```

In [16]:

```
print discrepancies
```

In [17]:

```
thresholds = np.array([t[0] for t in reversed(discrepancies)])
errors = np.array([t[1] / t[2] for t in reversed(discrepancies)])
fig = plt.figure()
fig.set_size_inches(6, 6)
xnew = np.linspace(thresholds.min(), thresholds.max(), num=41, endpoint=True)
f2 = interp1d(thresholds, errors, kind='quadratic')
plt.plot(thresholds, errors, 'o', xnew, f2(xnew), '-')
plt.ylabel('Percent of discrepancies')
plt.xlabel('Similarity threshold')
plt.show()
```

In [19]:

```
thresholds = np.array([t[0] for t in reversed(discrepancies)])
errors = np.array([t[3] / 10.0 for t in reversed(discrepancies)])
fig = plt.figure()
fig.set_size_inches(6, 6)
xnew = np.linspace(thresholds.min(), thresholds.max(), num=41, endpoint=True)
f2 = interp1d(thresholds, errors, kind='quadratic')
plt.plot(thresholds, errors, 'o', xnew, f2(xnew), '-')
plt.ylabel('Percent of discrepancies')
plt.xlabel('Similarity threshold')
plt.show()
```

`similarity_search_lsh`

function and `similarity_search_fp`

on a single chart.

In [15]:

```
thresholds = np.array([t[0] for t in reversed(timings)])
times = np.array([t[1] for t in reversed(timings)])
fig = plt.figure()
fig.set_size_inches(6, 6)
xnew = np.linspace(thresholds.min(), thresholds.max(), num=41, endpoint=True)
f2 = interp1d(thresholds, times, kind='cubic')
plt.plot(thresholds, times, 'o', xnew, f2(xnew), '-')
thresholds_lsh = np.array([t[0] for t in reversed(timings_lsh)])
times_lsh = np.array([t[1] for t in reversed(timings_lsh)])
xnew_lsh = np.linspace(thresholds_lsh.min(), thresholds_lsh.max(), num=41, endpoint=True)
f3 = interp1d(thresholds_lsh, times_lsh, kind='cubic')
_, swain = plt.plot(thresholds, times, 'o', xnew, f2(xnew), '-')
_, lsh = plt.plot(thresholds_lsh, times_lsh, 'o', xnew_lsh, f3(xnew_lsh), '--')
plt.legend([swain, lsh], ['Swain', 'LSH'])
plt.ylabel('Median query time (ms)')
plt.xlabel('Similarity threshold')
plt.show()
```

In [67]:

```
repetitions = 1
timings_cartrdge = []
for thresh in [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]:
print 'measuring performance for similarity {0}'.format(thresh)
sys.stdout.flush()
rep_times = []
for i in range(repetitions):
start = time.time()
for sample in rand_smpl:
_ = len(CompoundMols.objects.similar_to(sample, int(thresh * 100)).values_list('molecule_id', 'similarity'))
stop = time.time()
rep_times.append(stop-start)
timings_cartrdge.append((thresh, np.mean(rep_times)))
```

In [17]:

```
timings_cartrdge
```

Out[17]:

We can now plot all three curves:

In [18]:

```
thresholds = np.array([t[0] for t in reversed(timings)])
times = np.array([t[1] for t in reversed(timings)])
fig = plt.figure()
fig.set_size_inches(8, 8)
xnew = np.linspace(thresholds.min(), thresholds.max(), num=41, endpoint=True)
f2 = interp1d(thresholds, times, kind='quadratic')
plt.plot(thresholds, times, 'o', xnew, f2(xnew), '-')
thresholds_lsh = np.array([t[0] for t in reversed(timings_lsh)])
times_lsh = np.array([t[1] for t in reversed(timings_lsh)])
xnew_lsh = np.linspace(thresholds_lsh.min(), thresholds_lsh.max(), num=41, endpoint=True)
f3 = interp1d(thresholds_lsh, times_lsh, kind='quadratic')
thresholds_car = np.array([t[0] for t in reversed(timings_cartrdge)])
times_car = np.array([t[1] for t in reversed(timings_cartrdge)])
xnew_car = np.linspace(thresholds_car.min(), thresholds_car.max(), num=41, endpoint=True)
f4 = interp1d(thresholds_car, times_car, kind='quadratic')
_, swain = plt.plot(thresholds, times, 'o', xnew, f2(xnew), '-')
_, lsh = plt.plot(thresholds_lsh, times_lsh, 'o', xnew_lsh, f3(xnew_lsh))
_, cartridge = plt.plot(thresholds_car, times_car, 'o', xnew_car, f4(xnew_car))
plt.ylabel('Mean query time (ms)')
plt.xlabel('Similarity threshold')
plt.legend([swain, lsh, cartridge], ['Swain', 'LSH', 'postgres'])
plt.show()
```

In [ ]:

```
```