Thursday, July 31, 2014

Microsoft Big Data– looking into the HDInsight Emulator

As outlined in a previous post Introducing Windows Azure HDInsight, HDInsight is a pay-as-you go solution for Hadoop-based big data processing running on Windows Azure. But you don’t even have to use Azure to develop and test your code. Microsoft has an HDInsight Emulator that runs locally on your machine and that simulates the HDInsight environment in Azure locally.

HDInsight emulator is installed through the Microsoft Web Platform Installer (WebPI) a 64-bit version of Windows (Windows 7 SP1, Windows Server 2008 R2 SP1, Windows 8 or Windows Server 2012). It also seems to install without issues on Windows 8.1.

After you have installed it you will see three shortcuts on your desktop: Hadoop command line, Hadoop Name Node status and Hadoop MapReduce Status. You should also see three folders created on your local disk (Hadoop, HadoopFeaturePackSetup and HadoopInstallFiles) as well as a number of new windows services being added. The HDInsight emulator is actually a single node HDInsight cluster including both name and data node and use local CPU for compute.

One of the sample Mapreduce jobs which is included is WordCount. To try it out we will first download the complete works of Shakespeare from project Gutenberg. The file is called pg100.txt and we copy it into the folder c:\hadoop\hadoop-1.1.0-SNAPSHOT – next just follow the steps as outlined in Run a word count MapReduce job. For most HDFS-file specific commands use hadoop fs –<command name>  where most of the commands are Linux system commands. For a full list of support Hadoop file system commands, type hadoop fs at the Hadoop command prompt, for a full explanation check out the Hadoop FS Shell Guide. The Hadoop command line shell can also run other Hadoop applications such Hive, Pig, HCatalog, Sqoop, Oozie,etc … To actually run the Wordcount sample job on the HDInsight emulator run:

hadoop jar hadoop-examples.jar wordcount /user/hdiuser/input /user/hdiuser/output

Once the job is submitted you can track its progress using MapReduce status webpage. When the job is finished, the results are typically stored in a fixed file named part_r_NNNNN where N is the file counter.

A second sample which is also included is estimating PI using Monte Carlo Simulation – check out this excellent video explaining the principle behind estimating PI using Monte Carlo Simulation. For an explanation of how to use this with an actual HDInsight cluster check out the Pi estimator Hadoop sample in HDInsight

hadoop jar hadoop-examples.jar pi "16","10000000"

Tags van Technorati: ,,,,

Wednesday, July 30, 2014

Microsoft Big Data - Introducing Windows Azure HDInsight

Somebody once said - if you're going to stick around in this business, you have to have the ability to reinvent yourself, whether consciously or unconsciously.

This is the first in a series of blog posts about Big Data from a Microsoft perspective. I have always used my blog as a notebook as it helped me to get a clearer view on different topics. I hope that you stay with me in this journey in the exciting world of big data.

One of the challenges with working with big data is that the volume and the expected growth of data volume can be quite hard to predict. When starting with Big Data a cloud platform is an ideal way to start given its pay per use model and the flexible scalability model. Another thing to consider is the fact that Big Data technology evolves quite rapidly and cloud providers such as Microsoft will evolve along giving you the opportunity to work with the latest technology. So if you are just getting started and you have a Microsoft background Windows Azure HDInsight might be a good place to start. Also remember that if you have an MSDN account you are eligible for Azure monthly credits up to 150 USD.

Microsoft worked together with Hortonworks to build their Hadoop-based big data solution, the Hortonworks Data Platform (HDP). It exists in 3 different flavors:
  • HDInsight is an Apache Hadoop-based distribution running in the cloud on Azure. Apache Hadoop is an open source framework that supports data-intensive distributed applications. It uses HDFS storage to enable applications to work with 1000s of nodes and petabytes of data using a scale-out model.
  • Hortonworks Data Platform( HDP) for Windows is a complete installation package which can be installed on Windows Servers running on premise or on virtual machines running in the cloud.
  • Microsoft Analytics Platform System (formerly called Microsoft PDW)
In this post we will focus on Windows Azure HDInsight. HDInsight 3.1 which contains HDP 2.1 and is currently the default version for new Hadoop clusters (Always check Microsoft HDInsight release notes for the latest version info). At the core HDInsight is providing the HDFS/MapReduce software framework but related projects such as Pig, Hive, Oozie, Sqoop and Mahout are also included. 

HDFS (Hadoop File System) is a distributed file system designed to run on commodity hardware and is highly fault tolerant by nature. HDFS was developed by Doug Cutting and Mike Cafarella when they worked at Yahoo on the Nutch search project in 2005 and was inspired by the Google GFS white paper (See an interview with Doug Cutting, the founder of Hadoop (April 2014) and How Yahoo spawned Hadoop, the future of Big Data). In Hadoop, a cluster of servers stores the data using HDFS, each node in the cluster is a data node and contains a HDFS data store and execution engine. The cluster is managed by a server called the name node.

This distributed file system however poses some challenges for the processing of data and this is where the MapReduce paradigm  comes in which was also inspired by Google (MapReduce: Simplified Data Processing on Large Clusters, 2004). The term itself refers to the two basic computations in distributed computing, map (determining where the data is located in the different nodes and moving the work to these nodes) and reduce (bringing the intermediate results back together and computing them). These Mapreduce functions are typically written in Java, but you can use Apache streaming to plug in other languages

There are a number of advantages of using HDInsight:
  • You can quickly spin up a Hadoop cluster using the Azure Portal or using Windows PowerShell
  • You only pay for what you use. When your Hadoop processing jobs are complete, you can deprovision the the cluster and retain the data because Azure HDInsight uses the Azure Blob storage as the default file system which allows you to store data outside of the HDFS clusters.
  • Microsoft provides deep integration into the rest of their BI stack such as PowerPivot, Powerview, Excel, etc. …
  • HDInsight is exposed using familiar interfaces for Microsoft developers such as a .NET SDK (see for example Submit Hive jobs using HDInsight .NET SDK) and PowerShell
That’s it for now – in a next post I will delve a little deeper in how you can setup your first HDInsight cluster, the different architecture components and I will look into the HDInsight emulator.


Tags van Technorati: ,,,,