Azure Purview–notes

Reading notes

Where exactly in your organization is the data which you are searching for? It is a usual fiasco happends in any big/enterprise to search for information/data when an employee resigns, because ‘catalog’ information resides with certain people in the organization which creates a dependency.

Find my reading notes on Azure Purview:

Azure Purview provides:

Unified data governance service
Manage and govern
     – on-premises,
     – multi-cloud, and
     – SaaS data

Create a
     – holistic,
     – up-to-date map of your data landscape
with
     – automated data discovery,
     – sensitive data classification,
     – and end-to-end data lineage.

Unified Map
  – Automate and Manage metadata from hybrid sources
  – Classify data using built-in and custom classifiers
  – Label sensitive data
  – Integrate all your data systems using Apache Atlas API

Catalogue Insights:
– Asset Insights
– Glossary Insights
– Scan Insights
– Classification Insights
– Sensitive Label Insights

Docs
https://docs.microsoft.com/en-gb/azure/purview/overview

Supported data sources
https://docs.microsoft.com/en-us/azure/purview/purview-connector-overview

Pricing
https://azure.microsoft.com/en-in/pricing/details/azure-purview/

Questions for planning:

Scenarios:
Persona – Who are the users?
Source system – What are the data sources such as Azure Data Lake Storage Gen2 or Azure SQL Database?
Impact Area – What is the category of this scenario?
Detail scenarios – How the users use Purview to solve problems?
Expected outcome – What is the success criteria?

Deployment:
What are the main organization data sources and data systems?
For data sources that are not supported yet by Purview, what are my options?
How many Purview instances do we need?
Who are the users?
Who can scan new data sources?
Who can modify content inside of Purview?
What process can I use to improve the data quality in Purview?
How to bootstrap the platform with existing critical assets, glossary terms, and contacts?
How to integrate with existing systems?
How to gather feedback and build a sustainable process?

Ref: https://docs.microsoft.com/en-gb/azure/purview/deployment-best-practices

SSAS: Dimension Relationships in Cubes

“Dimension relationship” refers to the direct or indirect relationships between dimension and its measure groups in a Cube.

Regular Refers to a standard relationship, when a Key column in the dimension is directly joined to fact table.
Reference When a Key column in the dimension is indirectly joined to fact table by referencing another dimension.
Fact / Degenerate Dimensions constructed from attribute columns in fact tables than from attribute columns in dimension tables.
Many-to-Many One dimension is associated with multiple facts

Read more: https://docs.microsoft.com/en-us/sql/analysis-services/multidimensional-models-olap-logical-cube-objects/dimension-relationships?view=sql-server-2017

Note: My study notes

Getting started with Azure Databricks

Introduction

What is Azure Databricks?

Azure Databricks is the same Apache Databricks, but a managed version by Azure. This managed service allows data scientists, developers, and analysts to create, analyse and visualize data science projects in cloud.

Databricks is a user friendly, analytics platform built on top of Apache Spark. Databricks acts as an UI layer, a WYSIWYG dashboard where you can create clusters, manage notebooks, write code and analyse data without knowing the internals of the system. Apache Spark is a unified analytics engine for large scale data processing and currently it supports popular languages such as Python, Scala, SQL and R.

About the article

If you know Apache Databricks already, then a tutorial is not necessary to get started because Azure Databricks also uses the same management portal used by Databricks.

Though there are different strategies possible to create and manage Databricks projects, I have followed below flow in this article:

image

Screenshots and steps provided in this article are valid as on 20 Sept 2018. Advancement in technology happening at a faster pace so as the Azure portal upgrades. So, please be aware of any portal flow changes when you try out the same. I will try to keep this tutorial up to date.

Login to Azure Portal

You must be having at least a trial account to get started. Visit Azure home page to get one – https://azure.microsoft.com/

Step 1: Create your first Databricks workspace

First step in creating a Databricks project is by creating a Workspace.

Typical steps will be to click “+ Create a resource” à “Analytics” à Azure Databricks

image

In the workspace creation wizard, you will have to provide below details:

A. Workspace name: Give a unique name (retry until you get a green tick mark at the right. You get a red X mark because someone already took your favourite names).

B. Subscription: Choose an appropriate subscription plan, or leave the default value if you do not know what this is about

C. Resource Group: Choose an existing resource group, or give a new one. (Provide a new name if you do not know what this box is about)

D. Location: This is the data center. Select your nearest location in the dropdown, or keep the default

E. Pricing Tier: Now this is about cost so be careful. I would prefer to go with a Free trial if I am doing this for learning purpose. You can read more about the pricing tiers here.

image

Click “Create” button and wait till the workspace get created. This will take couple of minutes and you will get the notification once it is completed.

image

Once he workspace is created, you can go to “All resources” and click your newly created workspace name in the list.

image

The resource dashboard will look like this:

image

Now it is time for some action. Click “Launch Workspace” button, and you will be directed to a new browser page. You will be signed into the portal automatically.

Your Azure Databricks journey starts here.

image

From here, there are different strategies possible to execute projects. Since a full-fledged project which includes a meaningful data analysis is out of scope of this article, we will try out a simple example like querying a dataset or plotting a bar chart.

Let us load a dataset and visualize using a notebook.

For the purpose, I have downloaded a dataset from internet, which is about the literacy rate in India. You may also download a freely available one, or create a dataset of your own. We are not going to do any complex analysis in this example so this simple dataset is enough. May note that the values in the dataset are not real values. My CSV file looks like this, with first row as header row.

image

Create Cluster

For storing the data and doing processing, we need some powerful machines. Let us call it clusters and create one in this section.

On the dashboard, click on “New Cluster

I am giving the cluster a name “MyFirstCluster”. If you are good in Azure portal already then you know most of the input parameters mentioned in the page. Otherwise if you are a beginner, I suggest you to leave all the other settings ‘as it is’ and click “Create Cluster” button to proceed further.

image

It will take some time to complete the cluster creation. For me it took about 5-10 minutes. You can see the status of cluster creation in next screen.

image

Once the cluster is created, the status will change from “Pending” to “Running

image

Once the cluster is crated then we are read to upload data or creating notebooks. Let us upload the data first.

Upload data

Upload the already prepared/downloaded dataset to the newly created cluster.

Go back to the dashboard and click “Upload Data

image

In the next screen, give the dataset a name and upload the dataset. In my case I am using a CSV file with some 35 rows. Your dataset can be a bigger one but note that depending on the size of the dataset the upload and processing can take more time.

image

Once upload is completed, you can create the Notebook.

Create Notebook

A Notebook in the context is an interactive web based editor which allows data scientists, analysts and developers to write and collaborate scripts and notes to analyse and visualize.

You can either create the Notebook by clicking “Create Table” in the Dashboard screen, or as the continuation of the last step. When you click “Create Table in Notebook” button in the above screen, Databricks service will create sample notepad for you with sufficient sample code, with python as the default language.

image

Make sure that you have the cluster attached to this notepad. If you see “Detached” status at left-top side, then make sure to choose a cluster by clicking on the “detached” text. Without a cluster, you cannot run the scripts.

image

Now it is time to test the script. You can see the sample python scripts in various script boxes in the page. You can click on the play button you see on right-top side of any script snippet box:

image

You should be able to see the script getting executed and result will be displayed below in the form of a table. If there are errors, you will be provided with proper error messages which you can use to debug the script.

image

Now it is your time for experimenting and more learning.

As a bonus, let us see how to visualize the same data using a bar chart. Click on the bar chart icon. If you do not see any charts auto generated, then click “Plot Options” and play around with the parameters.

image

image

Click “Apply”, and now you can see the bar chart updated in the Notebook.

image

Happy Learning!

References:

  1. https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
  2. https://databricks.com/
  3. http://spark.apache.org/

Change Server Mode in SSAS

You can change the Analysis Services mode from Multidimensional Mode to Tabular Mode or vice versa easily by following below steps. I did this on SQL Server Analysis Services 2016 version.

Step 1: Edit msmdsrv.ini file

Go to folder X:\Program Files\Microsoft SQL Server\MSAS13.MSSQLSERVER\OLAP\Config

Change X: with your correct installation drive.

Open file msmdsrv.ini in notepad

It is recommended to take a backup first. There is a chance this folder require additional permission to edit so I would suggest you to open the file “as Administrator”

Find the tag <DeploymentMode>0</DeploymentMode>

If your current mode is Multidimensional then the DeploymentMode value will be 0 or if Tabular, then it will be 2. Change it to 0 or 2 as per your requirement.

Step 2: Restart SSAS

Open SQL Server Configuration Manager from Start menu and
Right click on “SQL Server Analysis Services” and click “Restart” in the context menu.

Finished! Try connecting to SSAS instance in SSMS to test.

 

OLAP vs ROLAP vs MOLAP vs HOLAP vs DOLAP vs WOLAP

  • OLAP – OnLine Analytical Processing
  • ROLAP – Relational OnLine Analytical Processing
  • MOLAP – Multidimensional OnLine Analytical Processing
  • HOLAP – Hybrid OnLine Analytical Processing
  • DOLAP – Desktop/Database OnLine Analytical Processing
  • WOLAP – Web Enabled OnLine Analytical Processing

Read more here – https://social.technet.microsoft.com/wiki/contents/articles/19898.differences-between-olap-rolap-molap-and-holap.aspx