Azure Data factory Components - An Introduction - TechDB

Latest

All about Database Programming, Performance Tuning and Best Practices.

BANNER 728X90

Saturday, 4 September 2021

Azure Data factory Components - An Introduction

Azure Data Factory Components

Azure Data Factory Components


PIPELINES

A pipeline is defined as a logical group of activities to perform a piece of work. A Data Factory can comprise of multiple pipelines with each pipeline having multiple activities. The activities inside the pipelines can be structured to run sequentially or in parallel, depending on your needs. The pipeline allows users to easily manage and schedule multiple activities together.


Examples of activities are : Copy Data, Data Flow, For Each , Lookup Activity etc.

List Of few Activities example. 1) Copy Activity 2) Data Flow Activity 3) Stored Procedure Activity 4) Lookup Activity 5) Pipeline Activity
There are multiple activities Inside the pipeline and execute in SEQUENCE order if each activities are connected and PARALLEL if activities are idependent.


One Pipeline can call another Pipeline using Pipeline Activity and each Pipeline having multiple Activity inside. It is similar to SSIS Parnet and child package concept.
The pipeline allows users to easily manage and schedule multiple activities together.


ACTIVITIES

Activities are processing steps that perform each individual task. This component supports data movement, data transformation and control activities. The activities can be executed in both a sequential and parallel manner.

You can have more than one activity in a pipeline. If you have multiple activities in a pipeline and subsequent activities are not dependent on previous activities, the activities may run in parallel. Activity Dependency defines how subsequent activities depend on previous activities, determining the condition of whether to continue executing the next task. An activity can depend on one or multiple previous activities with different dependency conditions.
The different dependency conditions are: Succeeded, Failed, Skipped, Completed.
Maximum activities per pipeline, which includes inner activities for containers is 40 Maximum timeout for pipeline activity runs - 7 Days


Mapping Data Flows

Create and manage data transformation logic that you can use to transform any-sized data. Data Factory will execute your logic on a Spark cluster that spins-up and spins-down when you need it. You won't ever have to manage or maintain clusters. Azure Data Factory controls all the data flow execution and code translation. It handles a large amount of data with ease.It actually use Spark Cluster behind the scene to execute the code


When you call this Data Flow within Pipeline using Data Flow Activity, at that time you have to provide cluster details.


It is similar to SSIS Data Flow Component, where we design Data Transformation logic to move data from Source To Destination.


DATASETS

A dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents.
For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read the data If you read write from a table then Your DataSet will point to respective Server, database and specific table from where activity read the data. Before you create a dataset, you must create a linked service to link your data store to the data factory. Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources


LINKED SERVICES

Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources.Think of it this way: a linked service defines the connection to the data source, and a dataset represents the structure of the data.

For example, an Azure Storage-linked service specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob dataset specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:
1-To represent a data store. Like SQL Server Database, ORACLE Database,File Share , Azure blob storage etc.
2-To represent a compute resource that can host the execution of an activity. Like HDInsightHive activity runs on an HDInsight Hadoop cluster, Python script or Notebook activity runs on Spark Cluster.


This is very similar to the concept of a connection string in SQL Server, where you’re saying what is the source and destination of your data.


Trigger

A trigger is a unit of processing that determines when a pipeline needs to be run. These can be scheduled or set off (triggered) by a different event. The processing unit specifies when the pipeline will be executed.

Azure Data Factory supports three main types of triggers: A Schedule trigger that invokes the pipeline on a specific time and frequency, a tumbling window trigger that works on a periodic interval and an Event-based trigger that invokes the pipeline as a response to an event


Parameter

Parameters are key-value pairs of read-only configuration.Parameters are defined in the pipeline. The arguments for the defined parameters are passed during execution We can use Parameter in Linked Service, DataSet and Pipeline to make the execution dynamically. Activities within the pipeline consume the parameter values.
Maximum parameters per pipeline - 50


Control flow

Pipeline activities orchestration controller that specifies the execution flow of the pipeline, in sequence or parallel, with the ability to define execution branches and loops defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger


Variables

Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with parameters to enable passing values between pipelines, data flows, and other activities.


Integration Runtime

In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a compute service. An integration runtime provides the bridge between the activity and linked Services. It's referenced by the linked service or activity, and provides the compute environment where the activity either runs on or gets dispatched from.


Refer this post for detail description-- Integration Runtime (IR) in Azure Data Factory - An Introduction




Data Flow

No comments:

Post a Comment