

Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. The Hadoop framework transparently provides applications both reliability and data motion. It supports the running of applications on large clusters of commodity hardware. Hadoop Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. I thought it would be useful to self-taught enthusiasts like me if I lay out the steps in a comprehensive manner, since I have spent some time dealing with the quirks in the process. I did manage to clear these hurdles and went on to installing R and RStudio along with RHadoop packages. Although there are solutions, the resources are scattered and obscure. I came across different hurdles when it came to addition of VirtualBox Guest Additions, which is intended to spruce up the virtual machine by offering such features as a shared folder with the host OS. Most of the trouble started after a hassle free installation of VirtualBox and creation of the cloudera’s demo VM. VirtualBox offers an open-source alternative and thenceforth, I chose this. I know most of the people including me like to hear the words open-source and free, especially when it is a smooth ride. One downside to using VMware is that it’s not free. However, this tutorial describes the implementation using VMware’s application. I was inspired by Revolution’s blog and step-by-step tutorial from Jeffrey Breen on the set up of a local virtual instance of Hadoop with R.
