Pavlo Comparison

958 Words2 Pages

The paper “A Comparison to Approaches to Large-Scale Data Analysis” by Pavlo, compares and analyze the MapReduce framework with the parallel DBMSs, for large scale data analysis. It benchmarks the open source Hadoop, build over MapReduce, with two parallel SQL databases, Vertica and a second system form a major relational vendor (DBMS-X), to conclude that parallel databases clearly outperform Hadoop on the same hardware over 100 nodes. Averaged across 5 tasks on 100 nodes, Vertica was 2.3 faster than DBMS-X which in turn was 3.2 times faster than MapReduce. In general, the parallel SQL DBMSs were significantly faster and required less code to implement each task, but took longer to tune and load the data. Finally, the paper talk about API’s of these two classes of system moving toward each other, and ends with its visionary note on integration of SQL with MapReduce. I found many flaws in this paper and feel that it’s been written by relational database experts who are essentially inefficient in using MapReduce framework.

The paper gives a strong feeling as if it’s been written by proponents of RDBMS and it turns out two of the authors have been involved in the creation of Vertica. The paper benchmarks their result on 100 nodes and states anything over it is not useful which is not true. Google, Facebook, Yahoo! and other cooperation run their MapReduce jobs efficiently on around 1000 nodes. This is also evident from the paper “PigLatin: A Not So Foreign Language” [4] which was presented by team Cloud Nine. As the team presented, PigLatin is effectively used at Yahoo! and is built over Hadoop. They also stated that some of motivation for building PigLatin was the costliness and rigidness for parallel databases which even the Pa...

... middle of paper ...

...ogeneous environment. Another important factor I feel that the Pavlo’s paper lack is the cost. They authors never talk about the cost in the paper. MapReduce is designed to work on cheap commodity whereas DBMSs may necessarily perform well on such systems.

In a nutshell, I feel the authors failed to identify the problem domain for their analysis. Their claims were too general without much evidence. It would have been much better if they had ran their tests on selected domain to determine the area where DBMS or MapReduce could be more efficient.

Works Cited

6. http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/

7. Health paper by team phoenix

8. Team nimbus

9. http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html

10. MapReduce paper

11. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

More about Pavlo Comparison

Open Document