CTRL+ALT+JOPX: Synapse

Showing posts with label Synapse. Show all posts

Sunday, November 26, 2023

Implementing Azure Synapse Link for Dataverse: gotchas and tips

Azure Synapse Link for Dataverse allows you to easily export data from a Dataverse (or Dynamics 365) instance to Azure Data Lake Storage Gen2 (ADLS) and/or Azure Synapse. Azure Synapse Link for Dataverse provides a continuous replication of standard and custom entities/tables to Azure Synapse and Azure Data Lake.

I highly recommend you to view the awesome YouTube playlist Azure Synapse Link and Dataverse - better together from Scott Sewell (@Scottsewell) as an introduction.

This blog post provides a number of tips & tricks but is not an exhaustive list - it is highly recommended to go through the links in the Microsoft documentation listed in the reference section below. You can also take a look at the presentation I delivered at Techorama in May 2023 which is available on Github - Azure Synapse Link for Dataverse from 0 to 100

1. Check the region of your Dataverse/Dynamics 365 instance

The configuration of Azure Synapse Link for Dataverse is done through the Power Platform maker portal but before you can get started you should first setup Azure Data Lake Storage Gen2 and Azure Synapse in your Azure subscription.

It is however best that you first check in the configuration screen in which region your instance is located since the storage account and Synapse Workspace must be created in the same region as the Power Apps environment for which you want to enable Azure Synapse Link. From the PPAC user interface it is currently not possible to create a Dataverse/Dynamics 365 instance in a specific region but this is possible with the PowerShell - see Creating a Dataverse instance in a specific Azure region using Power Apps Admin PowerShell module.

If you need to move a Dataverse or Dynamics 365 instance to a different Azure region, you can open a Microsoft support tickets. Based on recent experience this specific type of Microsoft support request is handled fairly quickly (within 1-2 business days).

Azure Data Lake Storage is a set of capabilities, built on Azure Blob Storage. When you create a storage account and check the "enable hierarchical namespace" checkbox on the advanced tab, you create an Azure Data Lake Storage Gen2.

2. Make sure all prerequisites are in place before enabling Azure Synapse Link

Definitely make sure that all security configuration outlined on Create an Azure Synapse Link for Dataverse with your Azure Synapse Workspace (Microsoft docs) are correctly setup. The exception messages which are shown in the Azure Synapse Link configuration pages aren't always very helpful.

3. Azure Synapse Link for Dataverse is a Lake Database

In the documentation from Microsoft (Understand lake database concepts) a lake database is defined as:

A lake database provides a relational metadata layer over one or more files in a data lake. You can create a lake database that includes definitions for tables, including column names and data types as well as relationships between primary and foreign key columns. The tables reference files in the data lake, enabling you to apply relational semantics to working with the data and querying it using SQL. However, the storage of the data files is decoupled from the database schema; enabling more flexibility than a relational database system typically offers.

The data is stored ADLS Gen2 in accordance with the Common Data Model (CDM) -the folders used conform to well-defined and standardized metadata structures (mapped 1:1 with Dataverse tables/entities). At the root you will see a metadata file (called model.json) which contains semantic information about all of the entity/table records, attributes and relationships between the tables/entities.

The way the files are being written depends on the Azure Synapse Link for Dataverse configuration - both the partitioning mode and in place vs append only mode can be configured - see Advanced Configuration Options in Azure Synapse Link

4. Synapse Link for Dataverse uses passthrough authentication using ACLs in Azure Data Lake - no support for SQL authentication

Since all the the data for the tables in Azure Synapse Link for Dataverse are CSV files which are stored in Azure Data Lake Storage, this also means that security needs to be set at the level of the files in Azure Data Lake Storage Gen2. There is no support for SQL authentication in the Lake DB which is created by Azure Synapse Link for Dataverse.

References:

Wednesday, November 22, 2023

Near real-time and snapshots in Azure Synapse Link for Dataverse

The Azure Synapse Link for Dataverse documentation contains a section about Access near real-time data and read-only snapshot data but it does not really explain why you want to use one or the other.

When you open an Azure Synapse SQL Serverless LakeDB in SQL Server Management Studio you see a clear distinction between the two versions of the table data - whereas in Azure Synapse Studio there is no obvious distinction besides the name you will see the "account" table the "account_partitioned" view:

Near real time data: external table for all the underlying CSV files exported by the Azure Synapse Lin for Dataverse sync engine. There is a soft SLA for the data to be present in these tables within 15 minutes
Snapshot data/partitioned views: views on top of the near-real time data which are updated on an hourly interval.

In most scenarios, it best to do queries against these partitioned views since you will avoid read conflicts and you are sure that a full transaction has been written on the CSV files in Azure Data Lake storage.

A typical exception that you might receive when doing queries directly against the "tables" is "https://`[datalakestoragegen2name].dfs.core.windows.net$$/[lakedbname]/[tablename/Snapshot/2023-05_1684234580/2023-05.csv" does not exist or you don't have file access rights)" but this also depends on your specific context. If you have a lot of create, updates or deletes on Dataverse tables this might happen more regularly. Even though, the partitioned views are update on an hourly basis - it might be that the Synapse Link engine is just refreshing the views at the same point that you perform a query, which will give you a similar exception but the changes that this occurs are more rare.

You can check the last sync timestamp and sync status in the Power Platform maker portal (see screenshot below)

For the moment, you will also have to manually check the monitoring page (which can be quite tedious if you have a lot of environments) but there is an item in the Microsoft release planner "Receive notifications about the state of Azure Synapse Link for Dataverse" which is apparently in public preview but I haven't seen it in for environments (not in the https://make.powerapps.com and also not in https://make.preview.powerapps.com/) I have access to.

It is also not easy to see if something went wrong with the refresh of the partitioned views - up until now the easiest way to find out is running a SQL query - select name,create_date from sys.views order by create_date desc against the LakeDB.

Friday, April 28, 2023

Azure Synapse Link for Dataverse playlist by Scott Sewell

A great starting point if you are new to Azure Synapse Link for Dataverse is Scott Sewell's YouTube playlist on Azure Synapse Link for Dataverse

Wednesday, November 16, 2022

Visualize S&P 500 data in Power BI using Azure Synapse Serverless SQL Pool

In Explore and analyze stock ticker data in Azure data lake with Azure Synapse serverless SQL Pool, I showed you can download stock ticker data from Yahoo Finance, stored it in Azure Data Lake and retrieve the data using standard T-SQL in Azure Synapse Studio. In this post, I will show how easy it is to consume the data from Synapse SQL Serverless using Power BI.

For the standard visual with the evolution of the S&P 500 closing price, I connected directly on SP500 external table in the Synapse SQL. You can connect to Synapse SQL Serverless using either the Azure SQL Database or Azure Synapse Analytics SQL connector and you will need to enter the Serverless SQL endpoint which looks something like this <yoursynapse>-ondemand.sql.azuresynapse.net

With the second reported I want to visualize the S&P 500 yearly return and the average return since December 1927. To make it easier, I created a separate view on top of the external table which calculates the yearly returns

As you see from the visual, returns can vary quite a lot both on the negative side as well as on the positive side - for the last 20 years, there was a huge drop in 2008 (-38%) and also this year is not looking great (-22%), but 2013, 2019 and 2021 all had returns above 20%. On average across the S&P 500 returned 7% (not included dividends).

For the last visual in the Power BI report, I wanted to show a histogram with the S&P 500 yearly returns. I based myself on Power BI Histogram example using DAX since Power BI does not have a standard histogram and I did not want to use a custom visual ( I used Power BI custom visuals from Pragmatic Works in the past)

Equity returns roughly follow a normal distribution or "bell curve", meaning that most values cluster near the central peak and values farther from the average are less common. Stock returns however have fat tails - meaning that the occurrences on the extremes are far more common than expected in a normal distribution. The Greate Depression (1931) and the Global Financial Crisis (2008) led to two of the largest stock market losses of the S&P 500. With a loss between -20% and -30% this year, we are in the same category/bin as 1930, 1974 and 2002.

You can download the synapsestockdemo.pbix file and the benchmark.csv file from my Power BI repo on GitHub

References:

Sunday, September 18, 2022

Speaking engagements in coming months

With all Covid bans lifted and summer holidays well over, the conference season kicks off.

I will be speaking at a couple of events in the coming weeks and months:

Dataminds evening session Upcoming in-person event on September 29th organized by dataMinds.be at Inetum-Realdolmen offices in Kontich together with Benni De Jagere. First session a little bit off the beaten track for data professionals: #dataviz for investors. Second session: #PowerBI roadmap and #AMA by Benni.
Collabdays Belgium 2022. Free community-driven event in Brussels, Belgium. Focus is Microsoft 365 with some Power Platform and Azure sprinkled on top. I am particularly excited to be speaking at this conference which was born out of the SharePoint Saturday conferences which I helped organize many years ago. I will be delivering Dataverse Deep Dive: watch out for sharks.
Cloudbrew 2022. A two-day conference focusing on all things Azure on November 18-19 in Mechelen Belgium. I will be delivering Using Python and Azure Cloud for trading and investing

Tuesday, August 02, 2022

Explore and analyze stock ticker data in Azure data lake with Azure Synapse serverless SQL Pool

In this walkthrough, I will show how you can perform exploratory data analysis on stock market data using Azure Synapse serverless SQL pools. To simplify things I will just focus on daily quotes for the S&P 500.

The S&P 500 (short for Standard & Poor's 500) tracks the performance of 500 large companies listed on exchanges in the United States. The composition of the S&P 500 is typically rebalanced four times per year. The S&P 500 is a capitalization-weighted index meaning that the stocks with a higher market capitalization have a big impact on the changes in the index (See Top 10 S&P 500 stocks by index weight)

I downloaded all daily data for the S&P 500 stock market index (ticker symbol is ^GSPC) from Yahoo Finance using the historical data tab in CSV format. The S&P CSV file contains the date, open, high, low, close, volume, dividends and stock splits for the S&P 500 from December 1927 (but the index in its current form was only created in 1957) until now (dividends and stock splits are not relevant). I manually downloaded the file but take a look at Using Python and Pandas Datareader to retrieve financial data part 3: Yahoo Finance and Using the yFinance Python package to download financial data from Yahoo Finance for ways to automate retrieving data from Yahoo Finance using Python.

Serverless SQL Pools in Azure Synapse

Serverless SQL Pool is an auto-scale SQL query engine that is built-in to Azure Synapse - as the term serverless indicates you don't need to worry about provisioning underlying hardware or software resources. Serverless SQL Pool uses a pay-per-use model so you will only be charged for a query if you run it to process data. Like Synapse dedicated SQL pool, serverless SQL pool also distributes processing across multiple nodes using a scale-out architecture (Check out the Microsoft research publication Polaris: the distributed SQL engine in Azure Synapse for an in-depth discussion).

Synapse Serverless SQL enables you to query external data stored in Azure Storage (including Data Lake Gen 1 and Data Lake Gen2), Cosmos DB and Dataverse. The data remains stored in Azure storage in a supported file format (CSV, JSON, Parquet or delta) and is query processing is handled by the Synapse SQL engine.

Walkthrough: analyzing S&P 500 data with Synapse serverless SQL

In this post I will not show you how you need to setup Azure Synapse - take a look at Quickstart: Create a Synapse Workspace for a detailed walkthrough - the Microsoft Learn learning paths which I added in the references are also quite useful.

In this post, I will be primarily using SQL to analyze the data but this is a matter of preference (having a coding background I prefer Python to do exploratory data analysis)

After you downloaded the data you will need to upload the CSV file to the Azure data lake storage associated with Synapse Link (you can also use a different Azure storage).

The OpenRowset (Bulk..) function allows you to access files in Azure storage. The SP500.csv file has a header row specifying the different columns in use - it contains all daily ticker data since December 1927. I am using Parser_Version 2.0 since it is more performant but it has some limitations (see the Arguments section in Microsoft's OpenRowSet documentation) - also check out How do Synapse serverless SQL pools deal with different file schemas (or schema evolution) part 1 CSV for some interesting info on how schema changes are handled.

If you will be using the data quite frequently, it might make more sense to use a CETAS process (CREATE EXTERNAL TABLE AS SELECT) to generate a dataset pointing to the data residing in the data lake ready for querying. In the Synapse Studio data hub, you can simply right click on a file and select the option to create an external table.

Next, select the database and the name of the table. You will need to create the external table by selecting "Use SQL Script" since you will need to adapt the script to skip the header row for reading data. For CSV files you have the option to infer column names.

You will need to modify the generated script for creating the external file format so that it skips the header row. You are still able to modify the database in which you want to create the external table (1) and I added a line to indicate that the external file contains a header row so data read should start on row 2 (2). Once you understand the script, it also possible to modify it to use wildcards, so that you can read from multiple files in multiple folders.

Now let's try out some queries in Azure Synapse Studio:

Let's get all closing prices for this century ([date]> '2020-01-01') - you will notice that you can also visualize the data using some basic graphs.
Which were the years with the largest percentage difference between the highest and lowest close for the S&P 500? No surprises here - we have the Wall Street crash of 1929 followed by the Great Depression of the 1930s, the Financial Crisis of 2007-2008 and the Covid crash in 2020 in the top 10
Which were the days with the highest difference between the day's closing price and the previous closing price - so the days in which the market crashed. In this example I used the SQL Lag() function. Besides the 1930s we also see Black Monday with a 20% decline in the S&P 500 - this triggered a global sell-off (Take a look at this video about Black Monday documentary (YouTube) with traders actually still working on the market floor)
You can also use common table expressions (CTE) for working with temporary named result sets for more complex queries and data manipulations. In the example below I want to find the 3-day trend for the S&P 500. (See Introduction to the SQL With clause if you are new to CTEs). The idea behind this query is to create a three-day trend variable for any given row. If the closing price on a day is greater than the closing price on the previous day, then we assign that day +1 one, otherwise, that date gets assigned -1 (minx_close columns). If the majority in the previous 3 days consists of positive values, the trend is positive, otherwise the trend is negative. (Example taken from Coursera: Introduction to Trading, Machine Learning & GCP )

As seen in this post, Synapse serverless SQL is quite useful for data professionals in different situations. Data engineers can use it to explore data from data lake to optimize data transformations, data scientists and data analysts can use it to quickly carry out exploratory data analysis (EDA). Take a look at Build data analytics solutions using Azure Synapse serverless SQL pools (Microsoft Learn) if you want to learn more. In an upcoming post I will show how easy it is to consume the data from Azure Synapse SQL Serverless in Power BI.

References:

Wednesday, March 23, 2022

Recreating an Azure Synapse Link for Dataverse connection

If you encounter an exception during the initial setup of Azure Synapse Link for Dataverse, it is best that you check in Azure Synapse workspace whether the lake database was only partially created. When you want to retry the configuration, you will first need to remove the lake database.

Previously you had to manually write a script but Microsoft has now added a handy delete button which will generate an Azure Synapse Analytics notebook for you. To be able to run the script you will however need to setup a serverless Apache Spark pool.

The smallest default configuration (4vCores/32GB) is sufficient to run this notebook - double check the pause settings of the Spark pool after the initial setup or just delete the pool if you don't to expect to need it anymore afterwards to save costs.

Other blog posts on Azure Synapse and Dataverse:

Getting started with Azure Synapse Analytics on demand webinars

Friday, January 22, 2021

Getting started with Azure Synapse Analytics on demand webinars

Mid December 2020, Microsoft organized some webinars to get Microsoft partners up to speed with Azure Synapse Analytics. All these webinars are now available on demand on the Microsoft events page - just search for synapse and add the filters "category: on demand" and "events for: Partners"

On demand webinars for Microsoft Partners - Introduction to Azure Synapse Analytics

Session1: Introduction to Azure Synapse Analytics: Studio and orchestration
Session 2: Introduction to Azure Synapse Analytics: Azure Synapse Analytics - Synapse and SQL Serverless
Session 3: Introduction to Azure Synapse Analytics: Apache Spark and Synapse Workspace security
Session4: Introduction to Azure Synapse Analytics: Azure Synapse Analytics Security, Metastore and Synapse link for Cosmos DB
Session 5: Introduction to Azure Synapse Analytics: Azure Synapse Analytics - Power BI and Azure Synapse Analytics Integration