"Improving MySQL performance using Hadoop" was the talk which I and Manish Kumar gave at Java One & Oracle Develop 2012, India. Based on the response and interest of the audience, we decided to summarize the talk in a blog post. The slides of this talk can be found here. They also include a screen-cast of a live Hadoop system pulling data from MySQL and working on the popular 'word count' problem.
MySQL and Hadoop have been popularly considered as 'Friends with benefits' and our talk was aimed at showing how!
The benefits of MySQL to developers are the speed, reliability, data integrity and scalability. It can successfully process large amounts of data (upto terabytes). But for applications that require massive parallel processing we may need the benefits of a parallel processing system such as Hadoop.
Hadoop is a highly scalable distributed framework and extremely powerful in terms of computation. Hadoop is fault tolerant and parallelizes data processing across many nodes. Popular users of Hadoop are Yahoo, Facebook, Twitter and Netflix.
Combine the speed and reliability of MySQL with the high scalability of Hadoop and what you get is a powerful system capable of using relational data and crunching it at tremendous speeds!
Submitting a task to a Hadoop system is done by writing a map-reduce job. A map-reduce job splits input data into independent chunks where each chunk is processed by the map task in a parallel manner. During the 'Reduce' phase, data from data nodes is merge sorted so that the key-value pairs for a given key are contiguous. The merged data is read sequentially and the valus are passed to the reduce method with an iterator reading the input file until the next key in encountered.
You can find more about Sqoop here.
In the demo session, we showed a live hadoop system with 1 name node and 2 data nodes. The idea was to show how easily a live Hadoop cluster can be brought up. We demonstrated the various phases of setting up the Hadoop system: installation, formatting the HDFS, editing the configuration files, starting the Hadoop cluster and finally running the map reduce job. We also addressed questions related to trouble shooting Hadoop deployments and common mistakes done while setting up a Hadoop cluster. After this, we wrote a basic mapreduce job for the word count problem in Java and ran the job on our Hadoop cluster. Although a 2-node cluster did not give much improvement in the time required to complete the task, the increased speed at which the map-reduce job returned the results was clearly noticeable in comparison to the same problem solved using simple SQL queries.
The following image depicts the word count problem solved using a map-reduce job: