Somebody once said - if you're going to stick around in this business, you have to have the ability to reinvent yourself, whether consciously or unconsciously.
This is the first in a series of blog posts about Big Data from a Microsoft perspective. I have always used my blog as a notebook as it helped me to get a clearer view on different topics. I hope that you stay with me in this journey in the exciting world of big data.
One of the challenges with working with big data is that the volume and the expected growth of data volume can be quite hard to predict. When starting with Big Data a cloud platform is an ideal way to start given its pay per use model and the flexible scalability model. Another thing to consider is the fact that Big Data technology evolves quite rapidly and cloud providers such as Microsoft will evolve along giving you the opportunity to work with the latest technology. So if you are just getting started and you have a Microsoft background Windows Azure HDInsight might be a good place to start. Also remember that if you have an MSDN account you are eligible for Azure monthly credits up to 150 USD.
Microsoft worked together with Hortonworks to build their Hadoop-based big data solution, the Hortonworks Data Platform (HDP). It exists in 3 different flavors:
- HDInsight is an Apache Hadoop-based distribution running in the cloud on Azure. Apache Hadoop is an open source framework that supports data-intensive distributed applications. It uses HDFS storage to enable applications to work with 1000s of nodes and petabytes of data using a scale-out model.
- Hortonworks Data Platform( HDP) for Windows is a complete installation package which can be installed on Windows Servers running on premise or on virtual machines running in the cloud.
- Microsoft Analytics Platform System (formerly called Microsoft PDW)
HDFS (Hadoop File System) is a distributed file system designed to run on commodity hardware and is highly fault tolerant by nature. HDFS was developed by Doug Cutting and Mike Cafarella when they worked at Yahoo on the Nutch search project in 2005 and was inspired by the Google GFS white paper (See an interview with Doug Cutting, the founder of Hadoop (April 2014) and How Yahoo spawned Hadoop, the future of Big Data). In Hadoop, a cluster of servers stores the data using HDFS, each node in the cluster is a data node and contains a HDFS data store and execution engine. The cluster is managed by a server called the name node.
This distributed file system however poses some challenges for the processing of data and this is where the MapReduce paradigm comes in which was also inspired by Google (MapReduce: Simplified Data Processing on Large Clusters, 2004). The term itself refers to the two basic computations in distributed computing, map (determining where the data is located in the different nodes and moving the work to these nodes) and reduce (bringing the intermediate results back together and computing them). These Mapreduce functions are typically written in Java, but you can use Apache streaming to plug in other languages
There are a number of advantages of using HDInsight:
- You can quickly spin up a Hadoop cluster using the Azure Portal or using Windows PowerShell
- You only pay for what you use. When your Hadoop processing jobs are complete, you can deprovision the the cluster and retain the data because Azure HDInsight uses the Azure Blob storage as the default file system which allows you to store data outside of the HDFS clusters.
- Microsoft provides deep integration into the rest of their BI stack such as PowerPivot, Powerview, Excel, etc. …
- HDInsight is exposed using familiar interfaces for Microsoft developers such as a .NET SDK (see for example Submit Hive jobs using HDInsight .NET SDK) and PowerShell
- Windows Azure HDInsight
- HDFS Architecture Guide
- Introduction to Hadoop in HDInsight
- Microsoft Big Data
- Hortonworks Data Platform