Apache Hadoop is the open source project that acts as the agenda for storage and processing large amount of data. The Apache Hadoop project mainly aims at the development of the open source software that will be used for consistent, accessible and scattered or better known as distributed computing. It is highly designed to scale and balance up from single servers to thousands of machines, each providing local advance computation and high storage capacity.
The license under which it is developed is Apache License 2.0 and it is developed for cross platform so it can be termed as interoperable. Hadoop is built under the Apache top level project which is the community of large number of people. It is developed by the people of this community and since it is the open source community people from all over the world have given the ideas and many of them have contributed to it in many ways.
The project contains some modules which can be explained as follows:
• Hadoop Common: It is the common functions and utilities that supports and maintain the other Hadoop modules.
• Hadoop Distributed File System: It is also called as HDFS. Basically it is a distributed file system which provides and gives high-throughput access to many application data. It is one of the main modules of the project.
• Hadoop YARN: It is theagenda for not only job scheduling but also cluster resource maintenance and management.
• Hadoop MapReduce: MapReduce isa YARN-based module for implementation of parallel processing of large data sets.
History of Hadoop
The history of Hadoop is quite interesting. Earlier around 2004 there was simply Google file system paper then in the mid came Google MapReduce paper. In 2005 nutch uses MapReduce. Then in the starting of 2006 Hadoop got separated from Nutch and then finally around the mid of 2008 Hadoop became Apache project. The large community of this project was formed and then the development of Hadoop project started.
Hadoop Distributed File System includes the following features:
• Very Large integrated Distributed File System
– contains around 10K nodes with 100 million files supporting system with capacity of 10 PB.
• Undertakes Commodity Hardware
– All the files are fully or partially replicated to handle hardware and software failure.
–Itdetects all kinds of failures and recovers data from them with maximum accuracy.
• Enhanced for Batch Processing
– Basically data locations are exposed such that computations and other calculations can move to wherever data exists in.
– It also provides very advanced and high comprehensive bandwidth.
So these are some advanced features of HDFS or Hadoop distributed file system that have really made the project successful.
Following are the advantages of Hadoop project:
• Flexible
• Fast
• Secure
• Scalable
• Reliable
• Cost effective
• Cross platform
• High storage capacity
• Failure resistant
• Data Replication to prevent failure.
So these are some amazing features that have really helped Hadoop to become a successful project and no doubt it has been adopted by many famous companies like Facebook, Amazon, Google, Yahoo, IBM, etc. So it’s really very successful project with highly advance features.
Some other projects related to Hadoop are Ambari, Avro, Cassandra, Pig, Hive, Chukwa, Spark etc. All of them are related to big data and cluster computing. In later posts we will also discuss about some other projects related to big data to stay connected.