Tuesday, March 31, 2015

Overview of Apache Hadoop components in HDInsight, from Ambari to Zookeeper

A couple of months ago I wrote a first post about Microsoft Big Data – Introducing Windows Azure HDInsight. In this post I will delve a little deeper into the different components which are used in HDInsight. This is not an exhaustive list of components but it lists a number of components which you might encounter when working on your first big data project using Microsoft Azure HDInsight.


  • Ambari – provides provisioning, monitoring and management layer on top of Apache Hadoop clusters. It provides a web interface for easy management as well as a REST  API.
  • Flume – allows you to collect, aggregate and move large volumes of streaming data into HDFS in a fault tolerant fashion.
  • HBase – provides NoSQL database functionality on top of HDFS. It is a columnar store, which provides fast access to large quantities of data. HBase tables can have billions of rows and these rows can have almost unlimited number of columns.
  • HCatalog – provides a tabular abstraction on top of HDFS. Pig, Hive and Mapreduce use this layer to make it easier to work with files in Hadoop. HCatalog has been merged into the Hive project. Hive uses it kind of a like a master database. For more details check out Apache HCatalog – a  table management layer that exposes Hive metadata to other Hadoop applications.
  • Hive – allows you to perform data warehouse operations using HiveQL. HiveQL is a SQL like language and provides an abstraction layer on top of MapReduce. Hive allows you to use Hive tables to project a schema onto the data (schema on read). Through the use of HiveQL you can view your data as a table and create queries just as you would in a normal database with support for selects, filters, group by, equi-joins, etc…. Hive inherits schema and location information from HCatalog.  Hive will act as a bridge to many BI products which expect tabular data. One of the recent developments around Hive is the Stinger initiative – its main aim is to deliver performance improvements while keeping SQL compatibility
  • Kafka – is a fast, scalable, durable and fault-tolerant messaging system. It is commonly used together with Storm and HBase for stream processing, website activity tracking, metrics collection and monitoring or log aggregation. It is provides similar functionality as AMQP, JMS or Azure Event Hub
  • Mahout – the goal of Mahout is build scalable machine learning libraries. The main machine learning use cases Apache Mahout support are recommender systems (people who buy x also buy y), classification (assigning data to discrete categories e.g. is a credit card transaction fraudelent or not) and clustering (grouping unstructured data without any training data). For more details take a look at Introducing Mahout (IBM)
  • Oozie – enables you to create repeatable, dynamic workflows for tasks to be performed in a Hadoop cluster. An Oozie workflow can include Sqoop transfers, Hive jobs, HDFS commands, Mapreduce jobs, etc … Oozie will submit the jobs but Mapreduce will execute them.  Oozie also has built-in callback and pollback mechanisms to check for the status of jobs
  • Pegasus provides large scale graph mining capabilities by offering important graph mining algorithms such as degree calculation, pagerank calculation, random walk with restart (RWR), etc .. Most graph mining algorithms have limited scalability, they support up to millions of nodes. Pegasus billion-node graphs. Graphs (also referred to as networks) are everywhere in real life going from web pages, social networks, biological networks and many more… Finding patterns, rules etc within these networks allow you to rank web pages (or documents), measure viral marketing, discover disease patterns, etc … The details of Pegasus can be found in the white paper  Pegasus: a peta-scale graph mining system – implementation and observations.
  • Pig is developed to make data analysis on Hadoop easier. It is made up of two components: a high level scripting language (which is called Pig Latin but most people just reference it as Pig) and an execution environment. Pig Latin is a procedural language which allows you to build data flows, it contains a number of built in User Defined Functions (UDFs) to manipulate data. These UDFs allow you to ingest data from files, streams or other sources, make selections and transform the data. Finally Pig will store the results back into HDFS.  Pig scripts are translated into a series of MapReduce jobs that are run on Apache Hadoop. Users can create their own functions or invoke code in other languages such as JRuby, Jython and Java. Pig will gives you more control and optimization over the flow of the data than Hive does.
  • RHadoop – is a collection of R packages that allow users to manage and analyze data with Hadoop in R, including the creation of map-reduce jobs. Check out Step-by-step guide to setting up an R-Hadoop system and Using RHadoop to predict website visitors to get started with some hands-on examples.
  • Storm – distributed real-time computation system, it supports a set of common stream analytics operations, provides guaranteed message processing with support for transactions. It was originally created by Nathan Marz (see History of Apache Storm and lessons learned) – the guy who cam up with the term Lambda architecture for a generic, scalable and fault tolerant data processing architecture.
  • SQOOP – was built to transfer data from relational structured data stores (such as SQL Server, MySQL or Oracle) to Apache Hadoop and vice versa. Because Sqoop can handle database metadata, it is able to perform type-safe data movement using the data types specified in the metadata.
  • Zookeeper – manages and store configuration information. It is responsible for managing and mediating conflicting updates across your Hadoop cluster.

Thursday, March 26, 2015

People insights– data driven insights regarding people

Whereas marketing and sales as well as financial departments have been using advanced analytics for quite a while, it seems that HR is still in one of the early maturity phases of analytics usage. This  is a view which seemed to be shared by CEOs. In a recent study CEOs gave their HR department a 5.9 (out of 10) for their analytical skills.  (See CEO niet overtuigd van analytische skills HR )

Whereas HR controls a lot of data (and needs to keep it up to date) it does not seem to be able to use this data to provide strategic advise to the board of directors. HR can only deliver truly added value by providing data-driven insights regarding people that are both compelling to business leaders and actionable by HR. This is a view which is also quite nicely outlined by consultancy firm Inostix in their HR Analytics Value Pyramid (See The HR Analytics Value Pyramid (Part 3) ). To make sure that HR team stays current and viable, they will need to adopt a whole need set of skills of which analytics is just one (See The reskilled HR team – transform HR professionals into skilled business consultants  and the capability gap across the 2015 Human Capital Trends)

In a number of upcoming posts I will delve a little deeper into this topic and will show some practical examples of how you can realize some quick wins without a huge upfront investment.

Related links:

SharePoint Saturday 2015 : How to build your own Delve, combining machine learning, big data and SharePoint

BIWUG is organizing the fifth edition of SharePoint Saturday Belgium – this year in Antwerp – for more information check out the site http://www.spsevents.org/city/Antwerp/Antwerp2015/ . Here is the excerpt of the session I will be delivering.

How to build your own Delve: combining machine learning, big data and SharePoint

You are experiencing the benefits of machine learning everyday through product recommendations on Amazon & Bol.com, credit card fraud prevention, etc… So how can we leverage machine learning together with SharePoint and Yammer. We will first look into the fundamentals of machine learning and big data solutions and next we will explore how we can combine tools such as Windows Azure HDInsight, R, Azure Machine Learning to extend and support collaboration and content management scenarios within your organization.

Related posts:

Wednesday, March 04, 2015

BIWUG session on advanced integration between SharePoint Online and Yammer

On the 19th of March BIWUG (www.biwug.be) is organizing its next session – don’t forget to register for BIWUG1903 – we have planned a great speaker and an interesting session

Advanced integration between SharePoint Online and Yammer using Yammer Apps (Speaker: Stephane Eyskens, SharePoint Technical Architect - http://www.silver-it.com/ )

First things first, the session will start describing what are the required steps to bind an Office 365 Tenant with an Enteprise Domain, how to federate on-premises users with Office 365 in order to have a SSO in place and how to bind Yammer to the Office 365 Tenant. Next, developers will learn how to leverage the Yammer App Model in order to build deeper integration between SPO(+on-prem) and Yammer. Business scenarios such as leveraging Yammer's Open Graph in SPO Workflows and associating Yammer Groups to SPO Team sites (& groups) will be covered. Security aspects will be discussed as well : from acting on behalf of a user with his consent to impersonating it completely, we'll see how to manage tokens and discuss some best practices.

Intended audience: The session is primarily intended for developers.

Key benefits: After this session, developers should have a good visibility on how to go beyond the OOTB Yammer App integration with
SPO and what Open Graph is all about.

Also thanks to Xylos for hosting this session

Monday, March 02, 2015

Resetting content index in SharePoint Server 2013: why and how

When you are developing against SharePoint Server 2013 search, you might forced to reset the search index. You can do this using the SharePoint user interface through the screen shown below or using PowerShell. I prefer to use PowerShell since resetting through the user interface seems to give me timeouts especially when the index is a quite large. One of the reasons why you are required to reset your content index is when your Search Service Application got into an unhealthy state because of insufficient disk space (See Fixing the Search Service after the Index Drive fills) but I also noticed that when you are working on your development machine and are making lots of changes to the search schema – it might also be useful to reset the search index for your changes to be picked up. If you want to change it using the user interface go to the Search Administration screen of the Search Service Application and select the “Index Reset” option underneath the crawling section of the left menu.



Don’t just reset your search index in a production environment since this will also impact the analytics processing component (Read Reset the index in SharePoint Server 2013). Listed below is the syntax for the PowerShell command (the snippet below assumes that you only have one SearchServiceApplication)

(Get-SPEnterpriseSearchServiceApplication).Reset($true,$true)

The SearchServiceApplication.Reset method takes two parameters -  public void Reset(    bool disableAlerts,   bool ignoreUnreachableServer) – I would recommend always setting disableAlerts to true if necessary. The value for the second parameter will depend on your specific case. If you also get a timeout when using the PowerShell cmdlet – you can use the steps outlined in SharePoint 2013 Content Index Reset Timeout – they worked for me.