Future of ChEMBL
web services
Michał Nowotka
ChEMBL Group
EMBL-EBI
What has been done so far?
Technology switch
- Java ➡ Python
- Spring ➡ Django
- ORM made WS database agnostic
- WS shipped with myChEMBL (RDKit + PG)
- We can open the source code now
Consequences
- Web services are easier to maintain and deploy for US
- What about our USERS?
- WHO are our users?
- Code developers - just like us!
Are web services easy to use
by developers?
Which programming languages are supported?
- This is REST API so every language is supported!
- Not exactly...
- Web application security model implemented in web browsers defines same origin-policy
- Java Script code outside EBI domain can't call WS directly
- We support JSONP and CORS now
And what is the most important feature
of the RESTFul API?
Uniform interface, which manifests itself in:
URI!
(Uniform Resource Identifier)
The uniform interface constraint is fundamental to the design of any REST service. The uniform interface simplifies and decouples the architecture, which enables each part to evolve independently.
URI?
- string of characters used to identify a name of a web resource
- simple
- fixed
- self descriptive
- self documenting - HATEOAS
URI?
Web resource?
A primitive, fundamental element of web architecture.
Web resources in ChEMBL WS:
- Compound
- Target
- Assay
- Bioactivity (not currently)
- Image (not currently)
- Drug Mechanism (not currently)
- Form (not currently)
HATEOAS?
Hypermedia as the Engine of Application State
- A REST client enters a REST application through a simple fixed URL.
- All future actions the client may take are discovered within resource representations returned from the server.
So what's wrong with URIs in our WS?
- Redundancy
- Lack of consistency
- Not too flexible
- Not conforming to standards and thus hard to guess
Redundancy in the identifier type
Wrong:
Right:
- compounds/CHEMBL1
- compounds/QFFGVLORLPOAEC-SNVBAGLBSA-N
- compounds/CCCCCC
Identifier's form can be used to distinguish it's type.
Redundancy in resource type
Wrong:
Right:
- bioactivities/CHEMBL2
- bioactivities/CHEMBL240
- bioactivities/CHEMBL1217643
In order to get bioactivity data I have to know entity type - why???
Redundancy in resource type
This type of redundancy is especially unpleasant:
- No ability to loop over ids
- To get data you need to know a piece of this data
- This is not a canonical URI
- Bioactivity should be resource on it's own
Lack of consistency
- There is 9,414 targets currently in ChEMBL, which is about 5MB when saved as xml.
- But there is 1,566,998 compounds and 1,042,374 assays...
- Gigabytes of data in single request
- Potential DDOS threat
- And the solution is...
Pagination
Benefits of filtering
- Reduces amount of data only to interesting items
- Batch retrieval without specifying a list of ids
- Easy to cache
- Easy to implement
- Makes web services almost as flexible as SQL
Benefits of filtering
Keyword based searches
- One of the big missing bits from existing services
- Solr/Lucene can be used to improve speed
- Web interface can use web services to provide keyword based search
Lack of consistency - continued
Lack of consistency - GET and POST
- In POST, data is not embedded into URI
- Using POST is the only way for SMILES containing URI unsafe characters
- This is why all 3 API methods requiring SMILES (compound by SMILES, substructure, similarity) support POST
- But others don't - why?
- Let users decide
Conforming to standards - POST arguments
- Support for
x-www-form-urlencoded
POST arguments should be dropped
- In early days of Web POST was used to send data from online forms
- Special encoding called
x-www-form-urlencoded
was designed for that purpose
- REST API uses POST in different context
Conforming to standards - POST arguments
- Modern REST APIs don't allow
x-www-form-urlencoded
- The rule is: you get your data encoded in format in which your POST arguments were encoded
- This means that only json and xml should be supported in our case
- Currently we accept
x-www-form-urlencoded
by default and use 'Accept'
and 'Content-Type'
to get around some problems
To keep URIs simple we should have new kinds of resources:
- Image
- Bioactivity
- Compound Form
- Mechanism of Action
- Approved Drug
- Similarity
- Substructure
New kinds of resources:
- compounds/CHEMBL2/image ➡ images/CHEMBL2
- targets/CHEMBL240/bioactivities ➡ bioactivities/CHEMBL240
- compounds/CHEMBL278020/form ➡ form/CHEMBL278020
- compounds/CHEMBL1642/drugMechanism ➡ drugMechanisms/CHEMBL1642
- compounds/substructure/CCCCC ➡ substructure/CCCCC
- compounds/similarity/CCCCC/70 ➡ similarity/CCCCC?simscore=70
Conforming to standards - HATEOAS
- Having flat resources SPORE endpoint can be generated automatically
- Current SPORE endopint is written by hand because it's faster
- We use SPORE to generate JavaScript based live documentation
- This documentation is shipped with myChEMBL
- It should be presented on web services index page
- This would be a nice example of using ws from js
- Proof we have CORS up and running
Other useful things
- Make web serices code open source and available on github
- Everyone can have it's own ws instance even without myChEMBL
- Github excellent issue tracking
- Community support
Include Beaker in web serives
- So far ws only serve data
- Tandem with beaker would add ability to manipulate with compounds
- Still free and open source
- Ability to create web and mobile apps without installing software
Add vector formats for images
- More web friendly
- Smaller
- Better quality
- Interactive
How much work is this?
- Most of the features described are already implemented in our other apps
- Some of them are simply commented out from ws code to maintain backwards compatibility
- Very small amount of development required
- Still needs to be carefully tested
- Won't break anything (v2)
Timeline
- Development - summertime
- Community access to beta version - 6 months
- Base URL format change - by the end of year
- Depracation - based on stats
Things not mentioned
- Client libraries
- New python client lib
- Comming out this or next week
- Blog post is ready