Essay PreviewMore ↓
List of Figures 3
Literature Review 4
History of Hadoop Technology 4
Applications of Hadoop 6
Main Components of Hadoop 6
Map Step. 7
Reduce Step. 8
Hadoop Distributed File System (HDFS) 8
Advantages and Disadvantages of using Hadoop 11
Competitors to the Hadoop Technology 12
List of Figures
Figure 1: MapReduce Programming Model 7
Figure 2: HDFS architecture 9
Figure 3: HDFS Operation Process 10
Hadoop is a free, java based programming framework that usually supports processing of large data sets that are in diverse computing environment. Hadoop is a cluster computation framework. Apache Hadoop is a programming framework that provides support to data intensive distributed applications with a free license. The program was inspired by Google file and Google’s MapReduce system (Eadline, 2013). According to Eadline (2013), the Hadoop technology was designed to solve various problems such as to provide a fast and reliable analysis for both complex and clustered data. Consequently, different organizational enterprises deployed Hadoop with the existing IT systems, thereby, allowing them to combine old and new data in a strong framework. The major industrial players who used Hadoop technology include IBM, Yahoo, and Google (Lavalle, Lesser, Shockley, Hopkins & Kruschwitz, 2011).
History of Hadoop Technology
Hadoop technology was created and developed by Doug Cutting who is recognized as the brain behind the Apache Lucene, a popularly known text search library. Originally, Hadoop had its source from Apache Nutch which was an Open Source search engine and also formed part of different Lucene projects. According to the project creator, Doug Cutting, the name Hadoop was not an acronym but just a makeup name. For instance, the ‘contrib’ module and other subprojects provided names that were basically unrelated to the functions of the program (Krishnan, 2013).
According to Lynch (2008), creating a web based search engine from scratch was an ambitious objective for the software requirement and the index website. The process of developing the system was costly but Doug Cutting and Mike Cafarella believed it was worth the cost. The success of this project unlocked the ultimately democratized algorithm of search engine system. After the success of this project, Nutch was started in 2002 as a working crawler and gave rise to the emergence of various search engines.
However, the developers came to realize that the Nutch systems architecture could not scale up the billions of pages available on the web. In 2003, the publication of Google File System (GFS) described the architecture as one of the most productive web hosting’s storage that is needed for the mega files generated.
How to Cite this Page
"How Hadoop Saved the World." 123HelpMe.com. 18 Jan 2020
Need Writing Help?
Get feedback on grammar, clarity, concision and logic instantly.Check your paper »
- Introduction to Apache Hadoop Nowadays, people are living in the data world. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 〖10〗^21 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s roughly the same order of magnitude as one disk drive for every person in the world .... [tags: data, storage, solution, software]
790 words (2.3 pages)
- “We Saved The World” WWI can help explain the debate and tension between Wilsonian idealism and realism. This tension takes place when America rose to power and influence during WWI, as the U.S. transitioned from unilaterism to internationalism. Also, each theory tries to reshape America’s national interest differently. Wilsonian idealism says U.S. national interest should be based on values like democracy, self-determination, human rights, and freedom. As a result, Wilson argues that America needs to be more engaged in internationalism.... [tags: World War II, League of Nations, World War I]
1101 words (3.1 pages)
- Primo Levi, in The Drowned and the Saved, expresses theories of memory. My objective is to prove that Primo Levi’s theories of memory being transitive and selective are correct. I will do this by examining and critiquing not only Levi’s perspective on memory, but also those of other philosophers and psychoanalysts whose work explored the subject. Writer and chemist, survivor and witness, Primo Levi was born in Turin, Italy, in 1919. Like most Italian Jews of his generation, Levi was assimilated to the hilt: "Religion," he later recalled, "did not count for much in my family." In 1938, however, his religion of Judaism became a sudden and serious liability.... [tags: The Drowned And the Saved. ]
1865 words (5.3 pages)
- Introduction Haiti is most commonly referred to as the poorest nation in the western hemisphere with 80% of its population living in poverty. The country is plagued with the misfortunes of natural disasters, a history of extensive debts, and a poor social system. When looking at the country today, there seems to be little productivity. However, most are unaware that Haiti was once one of the richest colonies in the New World. Much like today, the exports were agricultural and textile products such as tobacco, cocoa beans, coffee, fruit, and cotton.... [tags: Countries of The World]
1185 words (3.4 pages)
- Endangered Languages Should Be Saved Endangered languages and languages alike all hold specific information and traditions that differentiate them from one another. Endangered languages are filled with centuries of human thinking and historical knowledge, cultural gains and traditions that provide people with a sense of identity, and they help assemble diversity in this world that brings mankind vital information to grow. Endangered languages should be saved because just like any language, they hold a vast amount of information that gives outsiders a deeper look at who these people are, what they believe in, and what they think and know about the world around them.... [tags: Cognition, Thought, Mind, Psychology]
1149 words (3.3 pages)
- J.K Rowling, the author of the Harry Potter series, saved reading. She took 5 years to do it. It took her 5 years to write and get the first Harry Potter book published. She went to over 10 publishers before someone finally published it. “The original edition appeared on June 30, 1997, in a run of 500 copies, most of which went to public libraries.”(Lebrecht), that’s how many people were expected to read her book. But after the word got out of Harry Potter everyone started reading it. Harry Potter greatly impacted society greatly by making everyone; even adults want to read again.... [tags: publishers, detail, books]
534 words (1.5 pages)
- A Hadoop cluster consists a single master node and multiple worker nodes. The master node includes a JobTracker, TaskTracker, NameNode and DataNode. A slave or node acts as both DataNode and TaskTracker, though it could have data-only slave nodes and compute-only slave nodes.Hadoop requires JRE 1.6 (Java Runtime Environment)or higher. The standard shutdown scripts and start-up require Secure Shell to set up among nodes in the cluster. In a larger cluster,an extra NameNode called secondary NameNode is configured to avoid single point of failure .HDFS is managed with dedicated NameNode to host the file system index, and secondary NameNode can generate snapshots of namenode's memo... [tags: nodes, file systems, data]
1481 words (4.2 pages)
- In looking back upon his experience in Auschwitz, Primo Levi wrote in 1988: ?It is naïve, absurd, and historically false to believe that an infernal system such as National Socialism (Nazism) sanctifies its victims. On the contrary, it degrades them, it makes them resemble itself.. (Primo Levi, The Drowned and the Saved, 40). The victims of National Socialism in Levi?s book are clearly the Jewish Haftlings. Survival in Auschwitz, a book written by Levi after he was liberated from the camp, clearly makes a case that the majority of the Jews in the lager were stripped of their human dignity.... [tags: Nazism Jews Auschwitz Essays]
1398 words (4 pages)
- The Doctor Who Saved Boston The memories of the Red Sox run to the 2004 World Series championship this year will remain in the minds of Sox fans forever. Whether it is David Ortiz’s walk off performances against the Angels and Yankees, Manny Ramirez’s tape measure home runs, or Derek Lowe winning the clinching games of all three playoff series, Red Sox nation will not soon forget the memories that the “bunch of idiots,” as centerfielder Johnny Damon dubbed this year’s Sox club,provided them, nor will they forget the Sox all out assault on baseball and Boston sports history.... [tags: Short Stories History Baseball Papers]
2207 words (6.3 pages)
- In the movie Saved, the main character, Mary, comes to the conclusion after weeks of physical ailment that she may either be pregnant or have cancer. As she comes back from the store with home pregnancy test in hand, she thinks to herself that she would rather have cancer than be pregnant. She hopes for this because she is far too intimidated by how her peers and family may judge her if they find out she had taken part in premarital sex. While it has become fairly common for people to go to absolute extremes to cover up their immoral deeds, political leaders and others in powerful positions have particularly developed this characteristic.... [tags: Ethics]
576 words (1.6 pages)
Google Inc. published a paper that introduced MapReduce in online file storage. One year later, the Nutch developers already had an operational MapReduce program implemented in their system. By mid-2005, most of the Nutch algorithms were ported to run on NDFS and MapReduce system. MapReduce and NDFS implementation process in Nutch became applicable beyond the search realm and in 2006, the developers moved out of Nutch to create an independent subproject of Lucene that came to be known as Hadoop. At the same time, Cutting joined Yahoo, and this move provided a more dedicated team and resources to change Hadoop into a system that could run on a web scale (Eadline, 2013).
In February 2008, Yahoo reported that its production search engine index was generated by a 10K core Hadoop cluster. Consequently, Hadoop had made its customized high level project at Apache which confirmed its diversity and success as an active online community. Currently, Hadoop is used by many companies including Yahoo, Facebook, eBay, Twitter, Salesforce, Yelp and Etsy.com. Even in other areas of the physical world, most future and market oriented companies like those in entertainment, satellite imaging and energy management are using Hadoop to make analysis of the unique types of data collected and generated internally (Krishnan, 2013).
Applications of Hadoop
Hadoop was built to solve an existing problem. Governments and businesses had tremendous amounts of data that required to processing and analyzing quickly and efficiently. The technology helps in chopping large chunks of data into smaller bits and spreading them over multiple machines and makes all the machines process the assigned portions of the data in a parallel process. This technology is able to run applications on systems that have thousands of nodes that involve hundreds and thousands terabytes. The distributed file system in Hadoop enhances rapid data transfer rates within the nodes and allows the system to continue in operation uninterrupted in case of a failure in the nodes. This approach reduces the probability of risk to catastrophic system failure even when significant number of nodes fails to operate (Jain, Tate & International Business Machines Corporation, 2013).
Main Components of Hadoop
The main technology for Hadoop is based on MapReduce programming model and Hadoop Distributed File System (HFDS). The operation and management of large data is not possible in a serial programming paradigm. MapReduce performs several parallel tasks to accomplish data operation process with less time and this forms the main objective of the technology. MapReduce requires a special file system to operate such as petabyte data types in a real time scenario. Hadoop technology was invented in order to handle the process of storing and maintaining such intensity of data in a commodity hardware system that was inspired by the Google File System (Perera & Gunarathne, 2013).
This is a framework of processing high distributable problems over large datasets using millions of computers (nodes) which are collectively termed as cluster or grid depending on the support hardware. Computational processing is performed on stored data, either in the unstructured file system, or the structured database (Tomasic, Rashkovska & Depolli, 2013).
Figure 1. MapReduce programming model; Source: Holt & Holt-Bradley (2013): Retrieved on 2nd Feb, 2014 from http://mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in- python/
MapReduce as a software development framework and programming model was first developed by Google in 2004. It was intended to simplify and facilitate processing of large amounts of data parallel to a large cluster of commodity hardware in a fault-tolerant and reliable manner (Tomasic, Rashkovska & Depolli, 2013).
This is a specific master node that takes the input, partition it into smaller sub-problems and then distributes them to the worker nodes. This process may be repeated in the worker node leading to a multilevel structure. The workers node processes the smaller problems and then passes the response back to the master nodes (O'Reilly Strata Conference & Hadoop World, 2012).
After the master node collect the answers to all sub-problems, it combines them in particular ways to form the output. The answer to such problems is to provide a solution in the reduce step. The MapReduce program allows for a distributed processing framework of mapping and reduction operations. As long as each map operation is independent and free from the other, all mapping operations can be performed in a parallel mode but in practice, the process is limited to the number of independent sources of data or the number of CPUs available near the source (Tomasic, Rashkovska & Depolli, 2013).
Similarly, a complete set of reducers can be performed in the reduction phase as long as all the outputs of the map operations share the same key presented in the same reducer at a time. However, this process may appear insufficient compared to the sequential algorithm, MapReduce is applicable to significantly bigger datasets than the commodity server can handle (Warden, 2011). For instance, a large server firm can make use of MapReduce to sort out Petabyte data in just a few hours. The parallelism principle also provides the organization with some possible recovery options from any kind of partial server or storage failure during the operation process. In case of failure in a single mapper of reducer, the whole process of work can be rescheduled in the assumption that the input data is still available (Warden, 2011).
Hadoop Distributed File System (HDFS)
HDFS is designed to be run on commodity hardware. It is more fault tolerant as it is designed for deployment in the low cost hardware systems. The system provides high output access to application data which is more suitable to be applied in large datasets. HDFS reduces a few POSIX requirements in order to allow accessing the data in the file system. The HDFS was a subproject originally created as a search engine for Apache Nutch infrastructure (Lynch, 2008).
Figure 2. HDFS architecture. Source: Krishnan (2013). Retrieved on 2nd Feb, 2014 from http://hadoop.apache.org/docs/stable1/hdfs_design.html
HDFS is usually a master-slave architecture whose clusters consist of one NameNode and a master server for managing the file system namespace and also to regulate access to files from the clients. In addition, the architecture comprises of more DataNodes normally one per cluster, which manage the storage attached to the nodes. HDFS reveals a file system namespace and also permits the files to store the user data. A file is split internally into one or more blocks which are then deposited in a set of DataNodes. The file namespace operation such as closing, opening, naming, renaming, and the directories are executed at the NameNode. The NameNode also determines the blocks mapping to DataNodes. Normally, serving readers and writing requests from the client’s file system is the responsibility of the DataNodes which also performs the process of block creation, replication, and deletion upon receiving instructions from the NameNode (Krishnan, 2013).
Figure 3. HDFS Operation Process; Source: Zikopoulos and Eaton (2011). Retrieved on 2nd
Feb, 2014 fromhttp://yoyoclouds.wordpress.com/2011/12/15/hdfsarchitecture/
Both the DataNode and the NameNode are pieces of software that are designed to be operated on commodity machines which are typically run on Linux/GNU operating systems. HDFS is created using the Java programming language and any machine that can support the Java program is able to run either the NameNode or the DataNode software. The use of the portable java language implies that HDFS has the ability to be deployed into a wide range of hardware components (Krishnan, 2013).
Each machine in the cluster runs a particular instance in the DataNode software because the architecture does not prevent running of multiple DataNodes on one machine but only in real deployment in rare instances. The existence of one NameNode in a unit cluster is a way of simplifying the architecture of the system. The NameNode arbitrates and repositions all HDFS metadata because the design of the system does not allow data to flow through the NameNode (Hurwitz, Halper & Kaufman, 2013).
Advantages and Disadvantages of using Hadoop
Hadoop is usually an Open Source processing system for large data. It was introduced by Google to leverage the concept of MapReduce framework to reduce the functions in a process known as Functional Programming. Despite being developed from a Java program, Hadoop framework allows customized deployment written programs either coded in Java or other compatible programming languages. This enables the organization to provide a unique environment to process and analyze in-house data in a parallel method across thousands of commodity machines (Lin & Dyer, 2010). Hadoop technology is optimized for continuous reads which consist of data scanning depending on the level of complexity. In addition, Hadoop technology is listed as one of the fastest data processor as well as providing massive scalability as one of its major advantage. Lastly, Hadoop approach allows access to more formats that can be handled with more complexity (Zikopoulos & Eaton, 2011).
As the world’s level of instrumentation increases, the volume of big data rises in terms of magnitude. The need for data continues to be critical to firms if properly analyzed and the current process proves difficult. Businesses have realized that the technology can be deployed to ingest their data when performing the ETL operations. And due to the simplicity of the Open-Source Hadoop tools, organization can assess the value of the data easily and once the data is loaded, the system simply retains it by solving the crucial problem first (Agrawal, Das & El Abbadi, 2011).
However, according to Perera and Gunarathne (2013), once Hadoop system is in place, an analytics data warehouse is not required. The database qualities for Hadoop are not meant to be a replacement for a real analytical platform. Hadoop fundamentals were basically not designed to enhance highly interactive analytics, and this is the reason it is used for archival and ETL by the Hadoop community. Specific Hadoop limitations include:
(1) Creating multiple copies of the already existing data- HDFS was developed without the idea of efficiency in it and this result to creation of multiple copies of data. Due to the need for the data locality in performance maintenance, more copies are required in the process.
(2) Limited SQL support- some of the open source component have made attempts of making Hadoop a query-abled warehouse but the system does not have the support (Holt, 2013).
(3) HDFS does not have the notion of optimizing query, so it cannot pick an efficient cost based plan to execute. Thus, the Hadoop clusters are significantly larger than is required for a database of the same kind (Lin & Dyer, 2010).
(4) Challenging MapReduce framework- this framework is generally difficult in leveraging more than simple transformation logic. The open source components simplify this but the proprietary language poses a problem.
Competitors to the Hadoop Technology
For a decade now, several companies in Big Data have tried to leverage massive system for data warehousing and other important Big Data tools in order to optimize the business process and customer target. However, the attempts to compete with Big Data technologies for the upcoming companies prove futile. This is due to the Open Source nature of the leading Big Data technologies of Hadoop and NoSQL. These two companies have presented unfair advantage for small companies in the industry. About 95% of organization’s data is unstructured and the old order of relational big data is hard to manage. Hadoop and NoSQL have proved to be the best solution for Big Data enterprises. The other minor competitor is Oracle, but the rivalry does not appear to threaten the firm’s technological basement of Hadoop (Eadline, 2013).
NoSQL being the major competitor of Hadoop technology applies a default scaling approach which is fundamentally a centralized architecture using a relational database technology. In order to scale, additional servers are added to the cluster and the database operations are then spread across the bigger cluster. And because the commodity servers are estimated to fail from time to time, the NoSQL has an inbuilt database designed to tolerate and enhance the recovery process which makes them more resilient (Eaton, 2011).
Big data organization has been faced with several alternatives of analyzing the Big Data in the best way outside the relational database. Adoption of Hadoop which uses the MapReduce paradigm to process data has provided the companies with flexible solutions that specifically require software development skills to access the data. Hadoop, NoSQL, and Oracle are different technologies that at times are used to complement each other rather than competing against. With the market dominance trend released by Hadoop, it appears that this technology will last long in the market.
Hadoop seems to provide the best technology in the industry. By feeding the data in the database, and further analysis, the users are able to get a better insight correlated with the Hadoop cluster. By leveraging the Hadoop processing technique, the client user can take advantage of the power of support programs to simplify the results for better application in business and industry.
Agrawal, D., Das, S., & El Abbadi, A. (2011). Big data and cloud computing: current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology (pp. 530-533). ACM. Retrieved on 2nd Feb, 2014 from
Eadline, D. (2013). Hadoop fundamentals. S.l.: Addison-Wesley Professional. Pearsons: UK Retrieved on 2nd Feb, 2014 from
Eaton, C. (2011).Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill. Retrieved on 2nd Feb, 2014 from
Holt, B. (2013). Writing and querying mapreduce views in couchdb. O'Reilly Media Inc. Retrieved on 2nd Feb, 2014 from
Hurwitz, J., Halper, F., & Kaufman, M. (2013). Big data for dummies. Hoboken, N.J: Wiley. Retrieved on 2nd Feb, 2014 from
Jain, P., Tate, S., & International Business Machines Corporation. (2013). Big data networked storage solution for Hadoop. United States: IBM, International Technical Support Organization. Retrieved on 2nd Feb, 2014 from
Krishnan, K. (2013). Data warehousing in the age of big data. Burlington: Elsevier Science.
Lavalle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-31.Retrieved on 2nd Feb, 2014 from
Lin, J., & Dyer, C. (2010). Data-intensive text processing with MapReduce. San Rafael, Calif: Morgan & Claypool Publishers. Retrieved on 2nd Feb, 2014 from
Lynch, C. (2008). Big data: How do your data grow? Nature, 455(7209), 28-29.Retrieved on 2nd Feb, 2014 from http://bi.snu.ac.kr/Info/DM/(2002)DataMining_practical.pdf
O'Reilly Strata Conference & Hadoop World. (2012). Strata Conference New York + Hadoop World 2012: Complete video compilation. S.l.: O'Reilly Media. Retrieved on 2nd Feb, 2014 fromhttps://220.127.116.11/Nikon/Books/Temporary%20Folder/Upload/IT/DVD%201/DVD%201/Hadoop%20-%20The%20Definitive%20Guide%20-%20Tom%20White.pdf
Perera, S., & Gunarathne, T. (2013). Hadoop MapReduce cookbook. Birmingham, UK: Packt Pub. Retrieved on 2nd Feb, 2014 from http://packtlib-ebooks.s3-website-eu-west-1.amazonaws.com/Instant%20MapReduce%20Patterns%20%E2%80%93%20Hadoop%20Essentials%20How-to%20%5BeBook%5D.pdf
Tomasic, I., Rashkovska, A., & Depolli, M. (2013). Using hadoop MapReduce in a multicluster environment. Mipro 2013, 369-374.Retrieved on 2nd Feb, 2014 from
Zikopoulos, P., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media.ISBN:0071790535 9780071790536. Retrieved on 2nd Feb, 2014 fromhttps://18.104.22.168/Nikon/Books/Temporary%20Folder/Upload/IT/DVD%201/DVD%201/Hadoop%20-%20The%20Definitive%20Guide%20-%20Tom%20White.pdf