This week I got an opportunity to be part of the program “Analytics In A Day”, organized by my employer, Orion Innovation. This is a one full-day workshop by Microsoft, delivers through their partners and target audience is usually technology leaders, architects, managers and developers. I have experience with various components of Azure Synapse Analytics individually such as ADF/Azure Data Factory, Azure SQL Data Warehouse, Databricks and DataLake but it is a very nice experience working with a unified platform which provides seamless integration, or in other words – each stakeholders: Data Scientists, Analytics, Architects, Business users and IT guys gets the relief of having bothered only about the area they really have to bother about, such as: Data Scientist can focus on his data and models and don’t worry anymore about how he/she can bring the data inside the tool. The program starts with data sources, and talks about the ingestion, processing storage, machine learning and visualization.
I was previously part of “DIAD aka Dashboard In A Day” also, which is all about preparing reports and dashboards in Power BI stack.
When it comes to costing/pricing on cloud based platforms such as Azure or AWS, it is always a confusion especially for beginners what various units mean. Let us have a look at some such common unit terminologies.
- SKU – Stock Keeping Unit – A purchasable units in a platform. Ref – https://en.wikipedia.org/wiki/Stock_keeping_unit
- ACU – Azure Compute Unit – A unit used to compare compute performance across Azure SKUs. Ref – https://docs.microsoft.com/en-us/azure/virtual-machines/acu
- TU – Transaction Unit – Usually 10K transactions = 1 Transaction Unit
- DTU – Database Transaction Unit – Ref – https://docs.microsoft.com/en-us/azure/azure-sql/database/purchasing-models#understanding-dtus
- eDTU – elastic DTU – Ref – https://docs.microsoft.com/en-us/azure/azure-sql/database/purchasing-models#dtu-based-purchasing-model
- RU – Request Unit – Ref – https://docs.microsoft.com/en-us/azure/cosmos-db/request-units
- DBCU – Databricks Commit Unit – Ref – https://azure.microsoft.com/en-in/pricing/details/databricks/
- DBU – Databricks Unit – Ref – https://docs.databricks.com/administration-guide/capacity-planning/cmbp.html
- DPU – Database Processing Unit – Ref – https://aws.amazon.com/glue/pricing/
This list will be updated regularly.
Registration & Agenda here.
Well, the question is slightly wrong until the context is specified because it is possible to build Modern Data Warehouse by including Cosmos DB in the architecture. This is too much relevant today because the data is no more straight forward content with human readable entities and relations (structured), but unstructured and/or streaming too. Also the pace of the data flow, or business requirement is becoming near real-time.
See a reference architecture below:
Here, in this blog, the context is about Traditional Data Warehouse possibility, where you will be modelling the data, specifying relationships, etc. Let us look at the definition of Data Warehouse mentioned in Oracle Docs:
“A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.”
Now let us ask the right question – Why Cosmos DB may not be apt for using as a data store in a Data Warehouse? – It is not apt, because, Cosmos DB is a NoSQL database where it is literally not easy to draw relationships between entities/tables/data. Check what MSDN blog said about this:
“Cosmos DB is not a relational database. You cannot just take your relational database and expect it to run in Cosmos DB. You could move tables of data into Cosmos, but not the relational aspects of your existing data structures.”
As of today, this is the conclusion. But we cannot say tomorrow what will happen to these concepts because Cosmos DB is becoming powerful and I am already in love with it.
You can read common scenarios (use cases) where you can use, or the companies use Cosmos DB here.
Do you have different thoughts on this? Please comment.
What is Azure Databricks?
Azure Databricks is the same Apache Databricks, but a managed version by Azure. This managed service allows data scientists, developers, and analysts to create, analyse and visualize data science projects in cloud.
Databricks is a user friendly, analytics platform built on top of Apache Spark. Databricks acts as an UI layer, a WYSIWYG dashboard where you can create clusters, manage notebooks, write code and analyse data without knowing the internals of the system. Apache Spark is a unified analytics engine for large scale data processing and currently it supports popular languages such as Python, Scala, SQL and R.
About the article
If you know Apache Databricks already, then a tutorial is not necessary to get started because Azure Databricks also uses the same management portal used by Databricks.
Though there are different strategies possible to create and manage Databricks projects, I have followed below flow in this article:
Screenshots and steps provided in this article are valid as on 20 Sept 2018. Advancement in technology happening at a faster pace so as the Azure portal upgrades. So, please be aware of any portal flow changes when you try out the same. I will try to keep this tutorial up to date.
Login to Azure Portal
You must be having at least a trial account to get started. Visit Azure home page to get one – https://azure.microsoft.com/
Step 1: Create your first Databricks workspace
First step in creating a Databricks project is by creating a Workspace.
Typical steps will be to click “+ Create a resource” à “Analytics” à Azure Databricks
In the workspace creation wizard, you will have to provide below details:
A. Workspace name: Give a unique name (retry until you get a green tick mark at the right. You get a red X mark because someone already took your favourite names).
B. Subscription: Choose an appropriate subscription plan, or leave the default value if you do not know what this is about
C. Resource Group: Choose an existing resource group, or give a new one. (Provide a new name if you do not know what this box is about)
D. Location: This is the data center. Select your nearest location in the dropdown, or keep the default
E. Pricing Tier: Now this is about cost so be careful. I would prefer to go with a Free trial if I am doing this for learning purpose. You can read more about the pricing tiers here.
Click “Create” button and wait till the workspace get created. This will take couple of minutes and you will get the notification once it is completed.
Once he workspace is created, you can go to “All resources” and click your newly created workspace name in the list.
The resource dashboard will look like this:
Now it is time for some action. Click “Launch Workspace” button, and you will be directed to a new browser page. You will be signed into the portal automatically.
Your Azure Databricks journey starts here.
From here, there are different strategies possible to execute projects. Since a full-fledged project which includes a meaningful data analysis is out of scope of this article, we will try out a simple example like querying a dataset or plotting a bar chart.
Let us load a dataset and visualize using a notebook.
For the purpose, I have downloaded a dataset from internet, which is about the literacy rate in India. You may also download a freely available one, or create a dataset of your own. We are not going to do any complex analysis in this example so this simple dataset is enough. May note that the values in the dataset are not real values. My CSV file looks like this, with first row as header row.
For storing the data and doing processing, we need some powerful machines. Let us call it clusters and create one in this section.
On the dashboard, click on “New Cluster”
I am giving the cluster a name “MyFirstCluster”. If you are good in Azure portal already then you know most of the input parameters mentioned in the page. Otherwise if you are a beginner, I suggest you to leave all the other settings ‘as it is’ and click “Create Cluster” button to proceed further.
It will take some time to complete the cluster creation. For me it took about 5-10 minutes. You can see the status of cluster creation in next screen.
Once the cluster is created, the status will change from “Pending” to “Running”
Once the cluster is crated then we are read to upload data or creating notebooks. Let us upload the data first.
Upload the already prepared/downloaded dataset to the newly created cluster.
Go back to the dashboard and click “Upload Data”
In the next screen, give the dataset a name and upload the dataset. In my case I am using a CSV file with some 35 rows. Your dataset can be a bigger one but note that depending on the size of the dataset the upload and processing can take more time.
Once upload is completed, you can create the Notebook.
A Notebook in the context is an interactive web based editor which allows data scientists, analysts and developers to write and collaborate scripts and notes to analyse and visualize.
You can either create the Notebook by clicking “Create Table” in the Dashboard screen, or as the continuation of the last step. When you click “Create Table in Notebook” button in the above screen, Databricks service will create sample notepad for you with sufficient sample code, with python as the default language.
Make sure that you have the cluster attached to this notepad. If you see “Detached” status at left-top side, then make sure to choose a cluster by clicking on the “detached” text. Without a cluster, you cannot run the scripts.
Now it is time to test the script. You can see the sample python scripts in various script boxes in the page. You can click on the play button you see on right-top side of any script snippet box:
You should be able to see the script getting executed and result will be displayed below in the form of a table. If there are errors, you will be provided with proper error messages which you can use to debug the script.
Now it is your time for experimenting and more learning.
As a bonus, let us see how to visualize the same data using a bar chart. Click on the bar chart icon. If you do not see any charts auto generated, then click “Plot Options” and play around with the parameters.
Click “Apply”, and now you can see the bar chart updated in the Notebook.
Architectures to help you design and implement secure, highly-available, performant and resilient solutions on Azure.
Nice set of design references for your next project.
I have started an Azure training series under the brand ‘Tech Hour’ at the company I work for – Orion. Plan is to deliver short sessions of 30 to 45 minutes which spans 100 days. Below are the azure topics planned to cover:
- Day 001 Azure: Cloud Computing
- Day 002 Azure: Portal
- Day 003 Azure: XaaS
- Day 004 Azure: Web Apps
- Day 005 Azure: App Service
- Day 006 Azure: Virtual Machines
- Day 007 Azure: Linux
- Day 008 Azure: Functions – Part I
- Day 009 Azure: Functions – Part II
- Day 010 Azure: SQL Database – Part I
- Day 011 Azure: SQL Database – Part II
- Day 012 Azure: Storage – Part I
- Day 013 Azure: Storage – Part II
- Day 014 Azure: Storage – Part II
- Day 015 Azure: Logic Apps – Part I
- Day 016 Azure: Logic Apps – Part II
- Day 017 Azure: Logic Apps – Part III
- Day 018 Azure: Service Fabric
- Day 019 Azure: Cloud Services
- Day 020 Azure: Cognitive Services – Part I
- Day 021 Azure: Cognitive Services – Part II
- Day 022 Azure: Cognitive Services – Part III
- Day 023 Azure: Cognitive Services – Part IV
- Day 024 Azure: Key Vault
- Day 025 Azure: Data and BigData
- Day 026 Azure: Data Factory – Part I
- Day 027 Azure: Data Factory – Part II
- Day 028 Azure: HDInsight
- Day 029 Azure: API Management
- Day 030 Azure: Machine Learning – Part I
- Day 031 Azure: Machine Learning – Part II
- Day 032 Azure: Application Insights
- Day 033 Azure: Unstructured Data
- Day 034 Azure: Cosmos DB
- Day 035 Azure: Spark for HDInsight
- Day 036 Azure: Storm for HDInsight
- Day 037 Azure: R Server for HDInsight
- Day 038 Azure: IoT Suite
- Day 039 Azure: Active Directory
- Day 040 Azure: Mobile Services
- Day 041 Azure: CDN
- Day 042 Azure: SQL Data Warehouse
- Day 043 Azure: Multi-Factor Authentication
- Day 044 Azure: Media Services
- Day 045 Azure: Stream Analytics
- Day 046 Azure: Event Hubs
- Day 047 Azure: Service Bus
- Day 048 Azure: Scheduler
- Day 049 Azure: Notification Hub
- Day 050 Azure: Automation
- Day 051 Azure: Log Analytics
- Day 052 Azure: Redis Cache
- Day 053 Azure: Search
- Day 054 Azure: Application Gateway
- Day 055 Azure: Data Catalog
- Day 056 Azure: Data Lake Store
- Day 057 Azure: Data Lake Analytics
- Day 058 Azure: Bot Service
- Day 059 Azure: Containers
- Day 060 Azure: Container Service
- Day 061 Azure: SQL Server Stretch Database
- Day 062 Azure: Media Player
- Day 063 Azure: Monitor
- Day 064 Azure: Insight & Analytics
- Day 065 Azure: Analysis Services
- Day 066 Azure: Time Series Insights
- Day 067 Azure: MySQL
- Day 068 Azure: PostgreSQL
- Day 069 Azure: Virtual Machine Scale Sets
- Day 070 Azure: Bing integration
- Day 071 Azure: PowerShell – Part I
- Day 072 Azure: PowerShell – Part II
- Day 073 Azure: Cloud Shell
- Day 074 Azure: Service Bus
- Day 075 Azure: Serverless computing
- Day 076 Azure: High Availability
- Day 077 Azure: DevOps
- Day 078 Azure: Load Balancer
- Day 079 Azure: Virtual Private Networks
- Day 080 Azure: Web App for Containers
- Day 081 Azure: SQL Elastic database pool
- Day 082 Azure: DaaS for MongoDB
- Day 083 Azure: Developer Tools
- Day 084 Azure: Design of applications
- Day 085 Azure: ASP.NET Core – Part I
- Day 086 Azure: ASP.NET Core – Part II
- Day 087 Azure: Securing your Apps
- Day 088 Azure: Event Grid
- Day 089 Azure: Stack
- Day 090 Azure: Cost Consciousness
- Day 091 Azure: vs Amazon Web Services
- Day 092 Azure: with PowerBI
- Day 093 Azure: Flow vs LA vs Functions
- Day 094 Azure: Best Practices
- Day 095 Azure: Resouce Manager
- Day 096 Azure: Deep Learning with CNTK
- Day 097 Azure: Case Study – Part I
- Day 098 Azure: Case Study – Part II
- Day 099 Azure: Case Study – Part III
- Day 100 Azure: Learning Roadmap