Reverse engineering Marvin

Michał Nowotka
ChEMBL Group
EMBL-EBI

New Java vulnerability
discovered

Unspecified vulnerability in the Java Runtime Environment (JRE) component in Oracle Java SE 7 Update 21 and earlier, 6 Update 45 and earlier, and 5.0 Update 45 and earlier, and OpenJDK 7, allows remote attackers to affect confidentiality, integrity, and availability via unknown vectors related to 2D. NOTE: the previous information is from the June 2013 CPU. Oracle has not commented on claims from another vendor that this issue allows remote attackers to bypass the Java sandbox via vectors related to “Incorrect image attribute verification” in 2D.

Consequences

  • Java enabled browsers are highly vulnerable (thehackernews.com, 27-03-2013)
  • Firefox 26 Released With On Demand Java Plugin Feature (omgubuntu.co.uk, 10-12-2013)
  • Apple updates Safari web plugin blocker to disable new Java vulnerability (9to5mac.com, 29-08-2013)
  • Chrome does not support Java 7 on the Mac platform anyway...

More consequences

JavaScript to the rescue


Curation Interface detects Java availability and decides at runtime, which version to load.

DEMO

Viewer - technologies

ChEMBL contribution to RDKit:

https://github.com/rdkit/rdkit/pull/124/files

Compounds as JSON

  • Smaller than images
  • Highly compressible
  • Better quality
  • Web friendly


Use cases:

  • Interactive web widgets
  • Easily stored in DBs
  • Webservices?

Problems with viewer

  • Every web application is sandboxed by browser
  • No way to access clipboard
  • How to copy from viewer and paste to sketcher?

Solution - SO

Stack Overflow question:
How does Trello access the user's clipboard?


    _.defer =>
        $clipboardContainer = $("#clipboard-container")
        $clipboardContainer.empty().show()
        $("<textarea id='clipboard'></textarea>")
        .val(@value)
        .appendTo($clipboardContainer)
        .focus()
        .select()
    

Append invisible textarea, fill it with molfile, set focus, select whole text when user press Ctrl.

Demo

Marvin for JS (sketcher) limitations

  • Not all functionality present in Java version can be reimplemented in JavaScript
  • Format conversion / 2D coords / stereo info...
  • Need to be performed on the server side
  • Webservices!
  • ChemAxon requires separate licence to use their webservices!
  • But at least they publish specification

Open source solution - Beaker

RDKit and OSRA in Bottle on Tornado

What is Beaker?

  • A portable, lightweight webserver
  • REST-speaking
  • CORS-ready
  • Wraps RDKit and OSRA
  • Built on Bottle and Tornado

What it does?

  • Format conversion
  • Compound recognition
  • Image generation (including JSON)
  • Fingerprints, descriptors
  • Marvin 4 JS compatible webservices

Code & demo!

Beaker code is available as github repository:
https://github.com/mnowotka/chembl_beaker

Potential use cases

  • Access from languages like java script, ruby
  • Web applications
  • Mobile apps (camera + OSRA + RDKit)
  • Small desktop apps (clippy)
  • Marvin Backend
  • Part of webservices?

New webservices

  • Released last week
  • Different software stack: Java/Spring -> Python/Django
  • Can run on Oracle, Postgres, (MySQL)

New webservices

Can run on any machine:

New webservices

Image generation:

  • Improved quality
  • Two engines: RDKit and Indigo
  • Computing coordinates
  • Dimensions

New webservices

JSONP and CORS support

Can be used from JavaScript

Bio.js ChEMBL component can be improved

New webservices

NoSQL approach to caching:

  • New webservices are intended to be used outside ChEMBL
  • They can use only public part of the DB schema
  • No materialised views
  • Requests can be expensive
  • Caching is required

Cache characteristics

  • Once cached, request won't change until next ChEMBL release
  • Cache should be shared across many production machines
  • Independent of output format
  • Available from python, supported by EBI
  • Failproof, timeout

Our implementation

  • Key-value store built on MongoDB
  • Key is MD5 hash of certain request parameters
  • Value is base64-encoded, z-lib compressed pickle of Django QuerySet
  • Values are divided into 16MB chunks to bypass MongoDB limitation
  • Timeout set to 1 second.

How to monitor cache?

Sentry!

  • Realtime event logging and aggregation platform
  • Specializes in monitoring errors
  • Alternative to standard user feedback loop.
  • Can be used with any programming language.

Demo

Sentry is just a part of deployment environment

Other parts

  • Fabric
  • PIP
  • Virtualenv/Virtualenv wrapper
  • Rsync

Relationship between components

Deployment in practice

Almost zero downtime (apache restart)

Curation Interface update

  • Github-like image comparison
  • Document segmentation
  • Use case scenarios

Demo

Document segmentation

Document segmentation

CI use cases

Thank you!

Questions?