Data Science

Loading and Moving Internal Data using Azure Data Factory

Locate the SQL Data Warehouse instance in the Azure portal
Select Load Data under the Common Tasks tab
Select Azure Data Factory
Create a Data Factory
Specify the following details:
- Data Factory name
- Subscription
- Select resource group
- Select region
- Select load data
Specify the following configurations:
- Task name
- Task description
- Task cadence
- Expiration time
Select the data source
Select the destination

Loading and Moving External Data using Polybase

Create the following queries for the database within either Azure SQL Data Warehouse or Azure Data Studio
Create an external Hadoop data source

CREATE EXTERNAL DATA SOURCE LabAzureStorage
WITH
(
	TYPE = Hadoop,
	LOCATION = 'wasbs://labdata@<Name_Of_Storage_Account>.blob.core.windows.net/'
);

Define the external file format

CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
	FORMAT_OPTIONS (
		FIELD_TERMINATOR = ',',
		STRING_DELIMITER = '',
		DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff',
		USE_TYPE_DEFAULT = FALSE
	)
);

Integrating Data Factory with Databricks

Create an Azure storage account
Create a Data Factory instance
Create a data workflow pipeline
- This involves copying data from our source by using a copy activity in Data Factory
- A copy activity allows us to copy data from different on-premises and cloud services
Add a Databricks notebook to the pipeline
Analyze the data

Defining Best Practices

We should pause the SQL Data Warehouse instance when we don't need to run any queries if we want to save in compute costs
Saving data in a format like Parquet is the recommended way to save data if we plan to run several queries against one SQL Data Warehouse Table, since each query can extract a large amount of data to Blob storage
Linked services define the connection information needed for Data Factory to connect to external resources
In Azure Databricks, a target cluster will start automatically if the cluster isn't already running by Data Factory
We can connect our Spark cluster in Databricks to Azure Blob storage by mounting the cluster

References

Azure SQL Data Warehouse