a step-by-step guide to migrating microsoft data quality ... · sql server data quality services...

25
A step-by-step guide to migrating Microsoft Data Quality Services to Azure

Upload: others

Post on 12-Feb-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

A step-by-step guide to migrating

Microsoft Data Quality Services to Azure

Page 2: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

1

With HEDDA.IO, your data is

optimally prepared for all

your purposes at all times.

Page 3: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

2

Contents Microsoft Data Quality Services .............................................................................................................. 3

HEDDA.IO ................................................................................................................................................ 4

Advantages of migration to HEDDA.IO ................................................................................................... 5

DQS Knowledge Base .............................................................................................................................. 6

Exporting a DQS Knowledge Base ........................................................................................................... 8

Installing HEDDA.IO ............................................................................................................................... 10

Importing a DQS Knowledge Base to HEDDA.IO ................................................................................... 12

Creating an SSIS Package ...................................................................................................................... 15

Creating an SSIS-IR with HEDDA.IO ....................................................................................................... 16

Publishing the SSIS Package .................................................................................................................. 20

Executing the SSIS Package ................................................................................................................... 21

Conclusion ............................................................................................................................................. 22

Page 4: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

3

Microsoft Data Quality Services SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you

to build a knowledge base and use it to perform a variety of critical data quality tasks, including

correction, enrichment, standardization, and de-duplication of your data. The service enables you to

perform data cleansing by using cloud-based reference data services provided by reference data

providers. It also provides you with profiling that is integrated into its data-quality tasks, enabling

you to analyze the integrity of your data.

DQS consists of Data Quality Server and Data Quality Client, both of which are installed as part of

SQL Server 2017. Data Quality Server is a SQL Server instance feature that consists of three SQL

Server catalogs with data-quality functionality and storage. Data Quality Client is a SQL Server shared

feature that business users, information workers, and IT professionals can use to perform computer-

assisted data quality analyses and manage their data quality interactively. You can also perform data

quality processes by using the DQS Cleansing component in Integration Services and Master Data

Services (MDS) data quality functionality, both of which are based on DQS.

DQS was introduced with SQL Server 2012. Since the first version, the product has been continuously

maintained. In 2012, oh22information services GmbH developed additional SSIS components as

Open Source solutions and published them on Codeplex. With these components the duplicate

search as well as the loading of the domains could also be carried out within the ETL process.

SSIS DQS Matching Transformation

https://archive.codeplex.com/?p=ssisdqsmatching

SSIS DQS Domain Value Import

https://archive.codeplex.com/?p=domainvalueimport

Page 5: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

4

HEDDA.IO oh22’s HEDDA.IO is a knowledge-driven data quality product completely built in Microsoft Azure.

HEDDA.IO enables you to build a knowledge base and use it to perform a variety of critical data

quality tasks, including correction, enrichment and standardization of your data. HEDDA.IO enables

you to perform data cleansing by using cloud-based reference data services provided by reference

data providers or developed and provided by yourself.

HEDDA.IO consists of a WEB API, WEB UI, Excel Add-in and SSIS Component which are fully hosted in

Microsoft Azure and can be integrated into your cloud and local processes. HEDDA.IO Excel Add-in is

a local feature that covers the complete scope of DQS. You can also perform data quality processes

by using the HEDDA.IO cleansing component in Integration Services on premises and in the new

Azure Data Factory SSIS Integration Runtime.

Page 6: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

5

Advantages of migration to HEDDA.IO Some Microsoft SQL Server functions of the classic on-premises product have been increasingly used

in Azure during the last months.

For example, Azure Data Factory SSIS Integration Runtime is a complete PaaS that is fully compatible

with on-premises SQL Server Integration Services. Also, with the current version of the SQL Server

Managed Instance, the Microsoft SQL Server Master Data Services can now be used in Azure.

Microsoft Data Quality Services, on the other hand, cannot be used in Azure except on an Azure VM

as IaaS. Many processes involved in loading a data warehouse or moving data between different

services are increasingly taking place in the cloud. However, the need to carry out pure data quality

processes in the cloud increases as well. At this point a gap remains, whereby necessary data quality

processes either cannot be integrated into cloud processes or corresponding processes cannot be

migrated to Azure.

HEDDA.IO is a DQ service developed entirely for the cloud. With the concepts of knowledge bases,

domains and composite domains, HEDDA.IO is compatible with Microsoft Data Quality Services.

Through an SSIS component that is fully aligned with the SSIS-IR, the validation and cleansing of data

within the ETL processes can be easily performed. Existing on-premises processes with Microsoft

Data Quality Services can thus be migrated quickly and easily to Azure using HEDDA.IO.

With the discontinuation of Azure Data Marketplace, Reference Data Services were removed from

Data Quality Services. This means that various checks and cleanups based on composite domains can

no longer be performed with Microsoft Data Quality Services. HEDDA.IO has an open API with which

Reference Data Services can be quickly and easily integrated into the product. Various services can

be deployed directly with HEDDA.IO from the Marketplace. Some of the services are available as

open source on GitHub so that end users can create their own services.

Page 7: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

6

DQS Knowledge Base Let's start with a DQS Knowledge base and a domain in Microsoft Data Quality Services. Open the

SQL Server 2017 Data Quality Client. In the start screen, the Knowledge Base Management area on

the left displays the Knowledge Bases that you have already defined. Click on the Open Knowledge

Base button and select the Knowledge Base DQS Data in the following dialog. You may also find the

DQS Data knowledge base under Recent Knowledge Base.

DQS Data is a standard knowledge base that is automatically created on your system when Data

Quality Services are installed.

Page 8: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

7

If you have opened the Knowledge Base, you will see the various domains that belong to this

Knowledge Base in the Domain area on the left. In the right area you will see the corresponding

domain properties for the selected domain, in this example the domain Country/Region. In addition

to the name, a more detailed description of the domain and its data type is included here.

Via the Domain Values tab you can switch to the actual values within your domain. Here you see the

individual values for the selected domain as well as the assigned leading values in the column

"Correct To".

As you can see in the selection, a domain can also have completely different character sets here.

Page 9: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

8

Exporting a DQS Knowledge Base To export a Knowledge Base, click the second icon from the right in the Domain Management area

on the left. The Icon looks like a small table with an arrow pointing to the right.

You can now select whether you want to export the full Knowledge Base or the selected domain.

Click on Export Knowledge Base to export the entire KB.

Page 10: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

9

In the next dialog "Export to Data File", select the storage location for the KB and enter a name for it.

Exporting the Knowledge Base may take a few seconds based on the amount of data you have

stored in it.

After the export, you can import the domain into HEDDA.IO.

Page 11: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

10

Installing HEDDA.IO As a full cloud service, you do not need to install HEDDA.IO on a local server. You can deploy the

application including all resources directly from the Azure Marketplace to your Azure Subscription.

Open the Azure Portal at https://portal.azure.com and click on the button "Create a resource". Then

enter the name HEDDA.IO in the search field and press Enter. In the search result list, select

HEDDA.IO.

Page 12: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

11

In the next screen, select the software plan for HEDDA.IO and click Create.

Now follow the seven steps to deploy HEDDA.IO via the Azure Portal directly into your subscription.

You can find complete instructions on deploying HEDDA.IO at

https://hedda.io/documentation/

After deploying HEDDA.IO, you must install the HEDDA.IO Excel Add-in to manage knowledge bases

and domains. You can download the Excel add-in either from the home page of the HEDDA.IO

service you created or from the HEDDA.IO website. After you have installed the add-in, you will see a

new tab HEDDA.IO the next time you start Excel.

Page 13: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

12

Importing a DQS Knowledge Base to HEDDA.IO To import the previously created DQS Knowledge Base into HEDDA.IO, start Excel and log on to your

HEDDA.IO service via the HEDDA.IO tab. You can get the URL and the API key from the properties of

your service via the Azure Portal. Further information can be found in the HEDDA.IO documentation

at: https://hedda.io/documentation

Once you are connected to your HEDDA.IO instance, you can select Import from the Configuration

Group.

Page 14: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

13

The Import dialog allows you to import HEDDA.IO exports as well as DQS exports. To import the

previously exported DQS file, select DQS Import from the HEDDA.IO Import dialog.

When you import a DQS file, it is uploaded from your local computer to the HEDDA.IO service.

The HEDDA.IO service then imports the corresponding DQS file and creates a new knowledge base

and the corresponding domains from this file. Based on the file size, the import may take some time.

During the import, the Import button is disabled at each connected client.

Page 15: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

14

After the backup has been imported from the Data Quality Server, you can access the Knowledge

Base and its corresponding domains.

All members, including synonyms and validation status have been imported into HENNDA.IO. Both

DQS and HEDDA.IO can handle UTF-8 and can export or import corresponding members.

Page 16: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

15

Creating an SSIS Package To create an SSIS package to clean up data using the previously imported DQS Backup and HEDDA.IO

service, you must first install the HEDDA.IO SSIS component. You can download the component

either from the portal of your previously created service or from the HEDDA.IO Web site at

https://hedda.io/download

After installing the component, open SQL Server Data

Tools. Create a new SSIS project with a data flow using

the HEDDA.IO Domain Cleansing component.

Configure the components to use the previously

imported DQS domain. Then save the data again in a

database.

Page 17: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

16

Creating an SSIS-IR with HEDDA.IO To use an Azure Data Factory SSIS-IR with HEDDA.IO components, you must specify a custom setup

script when creating the SSIS-IR.

Follow the next steps to create an Azure Data Factory with HEDDA.IO components.

Create a new Azure Data Factory using the Azure Portal.

You can create the Azure Data Factory in the same resource group in which you created the

HEDDA.IO service. You can of course also use a new or an existing resource group. For performance

reasons however, make sure that the Azure Data Factory is created in the same region in which the

HEDDA.IO service was created.

Page 18: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

17

After the Azure Data Factory has been successfully created, go to the Author & Monitor area by

clicking on the corresponding button in the Azure Portal.

Click "Configure SSIS Integration" to create a new SSIS-IR.

In the next dialog you have to configure several parameters to create a new SSIS Integration

Runtime. Since in this step-by-step guide we want to concentrate on the migration from DQS to

HEDDA.IO, we recommend Microsoft Docs for complete information on the individual parameters:

https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime

When you create a new SSIS-IR, you must also create an SSIS Catalog. You can use the same SQL

server on which your HEDDA.IO database is hosted. To use the component with the Azure Data

Factory SSIS Integration Runtime, the component must be installed and configured on the

corresponding Azure nodes. The actual installation of the component takes place via a batch file-

which is automatically executed when the Azure SSIS IR is started.

Page 19: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

18

The batch file or the location of the batch file must be defined when creating the Azure SSIS IR.

The batch file runs the installer of the HEDDA.IO Domain Cleansing SSIS Component via msiexec. To

do this, the MSI file together with the batch file must be on the same blob storage.

The content of the batch file, called “main.cmd” is:

msiexec /i oh22is.HeddaDomainCleansing.msi /qn

Create a shared access signature for the corresponding blob storage or container on which the batch

file was saved. You can easily create a Shared Access Signature (SAS) with the Azure Storage

Explorer. Note that a date can be specified when creating an SAS. In general, a date not far in the

future is more secure, but this may prevent the Azure SSIS-IR from accessing the blob storage when

it is restarted.

For this reason, choose a date that fits the life cycle of your Azure SSIS-IR and copy the URL of the

SAS before closing the window.

With the new Integration Runtime Setup, you can now easily install third party components via the

Azure portal. Open your already created Azure Data Factory V2 in the Azure Portal and click on

“Author & Monitor”. Then click on “Configure SSIS Integration Runtime” in the Azure Data Factory to

configure a new Azure SSIS-IR.

In the next windows, enter all parameters for the configuration of the Azure SSIS-IR.

In the last window, under “Custom Setup Container SAS URI”, enter the URL of the previously

created Shared Access Signature. The UI then automatically validates the specified SAS. Then click on

the “Validate VNet” button to check your defined VNet.

Page 20: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

19

Once all the information has been correctly and properly validated, the Azure SSIS-IR can now be

created using the “Finish” button. When creating the Azure SSIS-IR, the setup is automatically loaded

from the given SAS and executed on the nodes.

For more information on installing HEDDA.IO and the SSIS components, please refer to the

documentation:

https://hedda.io/documentation

https://hedda.io/download

Page 21: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

20

Publishing the SSIS Package To publish your newly created SSIS package, click Deploy in the context menu of your SSIS solution.

Then follow the steps of the wizard to deploy your SSIS package to an Azure Data Factory SSIS-IR.

The deployment steps do not differ from those of on-premises deployment.

As with the on-premises SQL Server Integration Services, the deployment process does not check

whether the necessary SSIS components are installed on the runtime. Make sure that you have done

this as described in the step "Creating an SSIS-IR with HEDDA.IO".

With the current version of the SSDT, you have the possibility to enable your SSIS project directly for

Azure. If you have activated these settings, you can add an Azure SSIS Integration Runtime as a

linked Azure Service to your project. You can then debug the packages created in the SSDT directly

on the Azure SSIS-IR.

Page 22: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

21

Executing the SSIS Package To run your SSIS package, create a new pipeline within your Azure Data Factory.

In the Activities toolbox, expand General, then drag & drop an Execute SSIS Package activity to the

pipeline designer surface. Define all necessary task settings so that you can run the previously

created package.

If you are not sure how to configure the task, the following article can help you:

https://docs.microsoft.com/en-us/azure/data-factory/how-to-invoke-ssis-package-ssis-activity

After you have configured the Task correctly, you can publish the new pipeline. To do this, click the

Publish All button.

To execute the pipeline, click on the button Trigger and then on the button Trigger Now.

In the Pipeline Run window, select Finish.

You can check the execution of the SSIS packages via the integrated monitor within the Azure Data

Factory or via the SSIS reports of the SSISDB, which you can call via the SQL Server Management

Studio.

Page 23: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

22

Conclusion As you can see, you can easily export existing DQS Knowledge Bases and import them into

HEDDA.IO. You can easily migrate your SSIS and DQS workloads completely to Azure. All processes

and data remain in Azure and you can completely modernize your processes.

Page 24: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

23

oh22information services GmbH

Otto-Hahn-Str. 22 Am Turm 34

65520 Bad Camberg 53721 Siegburg

Germany Germany

[email protected]

https://www.oh22.is

https://www.hedda.io

Page 25: A step-by-step guide to migrating Microsoft Data Quality ... · SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you ... DQS Data is

24