Monday, January 11, 2016

Data science for the rest of us

A couple of weeks ago I followed an interesting webinar from Microsoft called Data Science for the rest of us. I have been interested in data science ever since I read the excellent book Doing Data Science: straight talk from the frontline from Cathy O’Neill and Rachel Schutt and articles like the Data Scientist: the sexiest job of the 21st century sparked this interest even more.

In this webinar Brendan Rohrer (@_brohrer_)  explains with a number of great examples some key ingredients or trade secrets of doing data science in easy to understand terms – here’s a quick recap (although I really recommend you to watch the video):
  • Trade secret 1: You can’t use any data (and you have to ask sharp questions): I really like the definition as formulated by  Jeff Leek (@jtleek) (taken from Data science done well looks easy, which is a big problem) Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience. So you first need a precise question and then you need to look for the right data or as indicated in the webinar relevant,connected, accurate and enough data. I’m not a data scientist but this really seems like the hardest part (or as phrased here For Big Data scientist, ‘janitor work’ is the key hurdle to insights )
  • Trade secret 2: Turn your data in a picture – check out the example used in the seminar below. It is important to understand that people effortlessly recognize and classify objects among tens of thousands of  possibilities so visualization of your data can help you to make sense of the data (For an interesting scientific article on this topic – take a look at How does the brain solve visual object recognition? )

  • Trade secret 3: Data science can only answer five questions: predict how much/how many [regression], which category does something belong to [classification], which groups exist in a dataset [clustering], is something weird [anomaly detection] and which action should you take[reinforcement learning].
  • Trade secret 4: Machine learning is simple. This statement is a little aggerated – but the analogy of mastering a foreign language and mastering machine learning is indeed correct. You need to learn the lingo (everyone probably knows tables – either in Excel or a database, but data scientist will refer to these  rows of data in a table as data point or samples by data scients. The columns in your table typically describe a specific characteristic – well  data scientist will call this a feature.)
  • Trade secret 5: there are a lot of right ways to solve a specific problem. If you look at the Machine learning algorithm cheat sheet for Microsoft Azure Machine Learning Studio you will notice that there a lot of different ways to solve a specific problem (with certain nuances such as the number of features available, or speed of calculating the model, …) but in most cases it apparently does not seem to matter that much.

To get an overview of other Microsoft webinars on similar topics check out Big Data and Advance Analytics: On-demand and upcoming live webinars
References links:

Resolving the Dynamics CRM minFreeThreads error

A while ago I booted up a new Hyper-V virtual machine with Dynamics CRM installed and I received the following error “The value for ‘minFreeThreads’ must be less than the thread pool limit of 400 threads.” when opening the Dynamics CRM web page.

Apparently the  Microsoft .NET Threadpools settings of the machine.config where changed based on the guidelines defined in Optimizing and maintaining a Microsoft Dynamics CRM 2011 Server Infrastructure.

Parameter Value
maxWorkerThreads 100
maxIoThreads 100
maxconnection 12*n (where n is the number of CPUs)
minFreeThreads 88*n
minLocalRequestFreeThreads 76*n
minWorkerThreads 50 (manually add this parameter to the file

So if you change the number of processors assigned to the machine - you will also need to change the machine.config

Technorati Tags: ,,

Thursday, January 07, 2016

Problem with filled Maps (choropleths) in Power BI for Belgian provinces

Update 2016/02/06: Thanks to the Power BI team feedback – I managed to get this working correctly – check out Using filled maps in Microsoft Power BI for provinces, regions and counties in European countries for the explanation.

A couple of weeks ago I wanted to try out the new filled map functionality (also referred to as choropleth) in Power BI ( See Tutorial: Filled Maps (Choropleths) in Power BI) – I wanted to start with a very simple data set

Province Dutch name French name Capital Surface Population
Antwerp Antwerpen Anvers Antwerpen 2860 1813282
East-Flanders Oost-Vlaanderen Flandre orientale Gent 2982 1477346
Flemish Brabant Vlaams-Brabant Brabant flamand Leuven 2106 1114299
Limburg Limburg Limbourg Hasselt 2414 860204
West-Flanders West-Vlaanderen Flandre occidentale Brugge 3151 1178996
Hainaut Henegouwen Hainaut Mons 3800 1335360
Liège Luik Liège Liège 3844 1094791
Luxembourg Luxemburg Luxembourg Arlon 4443 278748
Namur Namen Namur Namur 3664 487145
Brabant-Walloon Waals-Brabant Brabant wallon Wavre 1093 393700

Unfortunately I could not get the filled map to display correctly – I tried the province names in three different languages but nothing seemed to work.

According to Bing Maps Geographic Coverage – geocoding precision for Belgium should be fairly good. What are your experieces with this – do filled maps work correctly for provinces/regions outside of US? Leave a comment.

Wednesday, January 06, 2016

Using Microsoft Power BI Desktop to build Dynamics CRM Online Reports Part 5 –Refreshing data and custom visuals

This is the fifth part in a series of blog posts about Power BI and Dynamics CRM Online – previous blog posts:
When you publish a Power BI report the data will not be automatically refresh (except for direct query data sources e.g. connectivity with SQL Server Analysis Services) – so you will need to define a data refresh schedule.

But before you can define the refresh schedule Power BI needs to be able access the Dynamics CRM Online OrganizationData.svc service, fortunately this service supports certain authentication capabilities found in the oAuth2 protocol. The OAuth 2.0 authorization framework - definition from the spec at Internet Engineering Task Force (IETF) enables a third-party application to obtain limited access to an HTTP service, either on behalf of a resource owner by orchestrating an approval interaction between the resource owner and the HTTP service, or by allowing the third-party application to obtain access on its own behalf. So oAuth is one of the industry standards around federated identity and it’s main goal is to eliminate the need to give system A your user name and password for accessing system B and it allows you to determine what system B can get from system A once it’s been allowed access. So in simple terms – oAuth allows Power BI to talk to Dynamics CRM Online using the access token that you got back when first authenticate using the screen below and in this way Power BI does not need to store the user name and password.

You have to make sure that the credentials for the different data sources are up to date before you can set up the refresh schedule so you have to specify the credentials and make sure that you use oAuth as authentication method.

In Power BI Standard edition you then have the option to schedule a daily or weekly refresh – for an hourly data refresh you will need to upgrade to Power BI Pro. The table below lists the different available refresh options and the required subscription of Power BI (Source: Data Refresh in Power BI).

Data Refresh Power BI (free) Power BI Pro
Datasets scheduled to refresh Daily Hourly
Streaming data in your dashboards and reports using Microsoft Power BI REST API or Microsoft Stream Analytics 10K rows/hour 1M rows/hour
Live data sources with full interactivity (Azure SQL Data Warehouse, Spark on HDInsight) Not supported Supported
On premise data sources requiring Power BI Personal Gateway and on-premise SQL Server Analysis Services requiring Analysis Services Connector Not supported Supported

In this second part we will explore how you can use custom visuals developed by third parties into Power BI (a feature introduced with the October 2015 Update – see Visualize your data, your way using custom visuals in Power BI for more details).  In this post I will not focus on how you can build your own custom visuals but here is some background information for those who want to get started. To help developers get started, Microsoft published the code for all their visualizations on GitHub Power BI Visuals as an open source project. The project contains over 20 visualization types, the framework to run them and the testing framework. The visuals are built using D3.js which is a JavaScript library for manipulating (html) documents based on data.  From the website:
D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document. For example, you can use D3 to generate an HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction.

So how does this look like from a visualization designer perspective -  first you can take a look at the Power BI Visual Gallery for some example custom visuals. Here you will need to download your custom visual definition file – in this example I will use the TadPole Spark Grid Plus – when you click on the download link – you will see that it downloads a pbiviz file.

As a data source I will start from the  sales and marketing sample which you can download from the Power BI industry samples (Excel workbooks) . You can  import this Excel file within Power BI Desktop, and Power BI Desktrop will try to import the Power Query queries, Power Pivot models and Power View worksheets which you can later on refine using Power BI Desktop. (See Import Excel workbooks into Power BI Desktop for more details).

Next I will create a new report page using data from the sales fact table (Total Units and Sales $) per manufacturer and per year. Afterwards you will need to import the definition file for your custom visual by selecting File>Import> Power BI Custom Visual or clicking the three dots in the visualizations pane and selecting the pbviz file that you just download. Next you can apply your visualization to the report data.

As you see in the example below, it shows a spark line (for sales in units and dollars) with colored and thickened line segments. The black colored segments mean that the value has gone up since last period (desirable), and the red colored segments mean the value has gone down (undesirable). (This behavior is configurable using the properties of the visualization)

In a next post I will take a look at how you can embed Power BI reports in other web applications as well as Dynamics CRM.