The Benefits of Apache Hadoop

790 Words2 Pages

Introduction to Apache Hadoop
Nowadays, people are living in the data world. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 〖10〗^21 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world [1].
So people see there is a lot of data out there. The storage capabilities of hard drives have increased massively over the years, but the access speeds, the rate at which data can be read from drives have not kept up. One typical drive from 1990 could store 1370 MB of data and had a transfer speed of 4.4 MB/s [2], (1370 MB)/(4.4 MB/s) = 311 s = 5.1 minutes, so the time for reading all the data from a full drive was around 5 minutes at that time. After 20 years, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.
This is a very long time to read all data on a single drive, and writing is even slower. The obvious solution to reduce the time cost is to read from and write to multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in less than two minutes.
Apache Hadoop is one of the solutions; it is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware [3]. Also, Apache Hadoop is a scalable fault-tolerant distributed system for data storage and processing. The core of Hadoop has ...

... middle of paper ...

...type constraints of the destination database, then errors will occur and the data being transferred will be rejected. So this is what we call Schema-on-Write.
On the contrary, Hadoop has a Schema-on-Read approach. When we write data into HDFS, we just copy the data in without any gatekeeping rules. Then when we read the data, we just simply apple rules to the code that reads the data rather than preconfiguring the structure of the data ahead of time.
Now the concept of Schema-on-Write versus Schema-on-Read has profound implications on how the data is stored in Hadoop versus RDBMS. Additionally, in RDBMS, the data is stored in a logical form with interrelated tables and defined columns. In Hadoop, the data is a compressed file of either text or any other data types, and the data will be replicated across multiple nodes in HDFS when the data enters into Hadoop.

More about The Benefits of Apache Hadoop

Open Document