Loading and Moving Internal Data using Azure Data Factory
- Locate the SQL Data Warehouse instance in the Azure portal
- Select Load Data under the Common Tasks tab
- Select Azure Data Factory
- Create a Data Factory
-
Specify the following details:
- Data Factory name
- Subscription
- Select resource group
- Select region
- Select load data
-
Specify the following configurations:
- Task name
- Task description
- Task cadence
- Expiration time
- Select the data source
- Select the destination
Loading and Moving External Data using Polybase
- Create the following queries for the database within either Azure SQL Data Warehouse or Azure Data Studio
- Create an external Hadoop data source
CREATE EXTERNAL DATA SOURCE LabAzureStorage
WITH
(
TYPE = Hadoop,
LOCATION = 'wasbs://labdata@<Name_Of_Storage_Account>.blob.core.windows.net/'
);
- Define the external file format
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '',
DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff',
USE_TYPE_DEFAULT = FALSE
)
);
Integrating Data Factory with Databricks
- Create an Azure storage account
- Create a Data Factory instance
-
Create a data workflow pipeline
- This involves copying data from our source by using a copy activity in Data Factory
- A copy activity allows us to copy data from different on-premises and cloud services
- Add a Databricks notebook to the pipeline
- Analyze the data
Defining Best Practices
- We should pause the SQL Data Warehouse instance when we don't need to run any queries if we want to save in compute costs
- Saving data in a format like Parquet is the recommended way to save data if we plan to run several queries against one SQL Data Warehouse Table, since each query can extract a large amount of data to Blob storage
- Linked services define the connection information needed for Data Factory to connect to external resources
- In Azure Databricks, a target cluster will start automatically if the cluster isn't already running by Data Factory
- We can connect our Spark cluster in Databricks to Azure Blob storage by mounting the cluster
References
Previous