Java Language MapReduce Views
David Hardtke
September 08, 2010
One of the core principles at Cloudant is push code to data. In the era of big data and open source software, it’s much easier to move your code around than to move the data. Code is not tied to machine by software licenses, and the code is generally smaller than the data one wishes to analyze. When I first heard the concept push code to data, the retained marketing synapse in my brain immediate fired with write once run anywhere, the Sun corporation marketing slogan for Java.
The Cloudant Hosted CouchDB service uses MapReduce views to analyze data. Code is contained in a design document. The design document is passed to a view server along with each document in the corresponding database. CouchDB view servers have been written for many interpreted languages (see the CouchDB wiki for a list), but until now there has not been a Java Language View Server for CouchDB.
Today we are releasing the Java Language MapReduce View Interface for Cloudant’s Hosted CouchDB service. This interface defines the protocol for writing MapReduce views in Java that can be run on our hosted CouchDB platform. The Java language has several features that make it extremely useful for large scale analytics within CouchDB:
- Compile-Time Type Checking. CouchDB views are generally written in weakly typed interpreted languages (javascript, python). Anyone who has worked extensively with CouchDB, however, knows that debugging complicated views can be very challenging. Sometimes it is easier to let the compiler find the syntax errors.
- Libraries, Libraries, and More Libraries. Java is the most widely used computer language at the moment. As such, there is a library for almost everything. CouchJava allows for the re-use of existing Java langauge libraries within the CouchDB framework. Libraries can be packaged in jar files and uploaded directly to the view server.
When might you use this? Let’s assume you want to use a neural network to analyze financial options data in real time. You can directly import your favorite neural network library(there are many written for Java)directly into CouchDB along with the weights that you have calculated during training (the weights will be configuration parameters in the design document). Each new piece of data will be automatically analyzed in real-time. You can update the weights by modifying the design document. In fact, one could build the unsupervised learning cycle directly into CouchDB.
The Java view server works differently than a standard CouchDB view server. The design document does not contain code. Instead, the design document specifies which class should be called for the Map and Reduce steps. The code (a jar) is attached to the design document in the form of a binary attachment. This jar contains both user defined classes and external libraries that are needed. This paradigm (libraries as binary attachments) is a non-standard extension of the CouchDB view server API.
The interface and instructions for writing Java Language MapReduce views are now available on Github. Let us know what you think.