Implement Data Masking – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu link ➢ select your dedicated SQL pool ➢ start the SQL pool ➢ after the SQL pool is running, select the Dynamic Data Masking blade ➢ select the + Add Mask menu item ➢ make the configurations as shown in Figure 8.39 ➢ and then click the Add button.

FIGURE 8.39 Implement data masking and masking rule.

  1. Click the Save button ➢ log in to the dedicated SQL pool using Azure Data Studio, with the credentials created in Exercise 8.7 ➢ and then execute the following:
    SELECT ID, FIRSTNAME, LASTNAME, EMAIL, COUNTRY, CREATE_DATE FROM dbo.SUBJECTS
  2. Notice that the data in the EMAIL column has the mask configured in step 1 and illustrated in Figure 8.39. Stop the dedicated SQL pool.

It is possible to configure a custom mask instead of using the preconfigured masks. This requires that you provide the number of starting characters to show, followed by the padding string (something like xxxxx.), followed by the number of ending characters to display. Using a prefix and suffix value of three and the padding on the EMAIL column would result in benxxxxx.net, for example, which is a bit more useful than what is provided using the default.

Manage Identities, Keys, and Secrets Across Different Data Platform Technologies

Protecting credentials has historically been very challenging. Developers and data engineers need to access data, and that data is protected by an ID and password. As the complexity and size of your organization grows, it is easy to lose control over who has what credentials. Add that loss of control to the potential impact changing a password can have on a production environment. This scenario is commonly referred to as credential leakage. An initial solution to credential leakage was to store connection details in a managed credential store, something like Azure Key Vault. However, access to the credential store also requires credentials, so you are back in the same place as before the implementation of the credential store. The ultimate solution is to use a combination of Azure Key Vault and managed identities. Instead of using a credential to make a connection to a storage account or a database from application code, you instead reference the Azure Key Vault endpoint. An Azure Key Vault secret endpoint resembles the following:

https://<accountName>.vault.azure.net/secrets/<secretName>/5db1a9b5…

The code that uses that endpoint must implement the DefaultAzureCredential class from the Azure Identity library. The library works with all popular programming languages: .NET, Python, Go, Java, etc. Passing a new DefaultAzureCredential class to the SecretClient class results in the acquisition of the managed identity credential, which is a token. The client then stores all necessary attributes to perform the retrieval of a secret from the Azure Key Vault endpoint. The following C# code performs this activity:

 var kvUri = “https://” + accountName + “.vault.azure.net”;
 var client = new SecretClient(new Uri(kvUri), new DefaultAzureCredential());

You can use the client to get a secret by using the following C# syntax:

 var secret = await client.GetSecretAsync(secretName);

Now you know how a managed identity can avoid credential leakage, but you might be wondering what exactly are managed identities and what important aspects must you know in order to implement them safely and securely? Table 8.3 compares the two types of managed identities: system‐assigned and user‐assigned.

TABLE 8.3 Managed identity types

CharacteristicSystem‐assigned managed identityUser‐assigned managed identity
ProvisioningAzure resources receive an identity by default, where supported.Created manually
RemovalThe identity is deleted when the associated Azure resource is deleted.Deleted manually
SharingThe identity cannot be shared among Azure resources.Can be shared

A system‐assigned managed identity is created during the provisioning of the Azure resource. For example, an Azure Synapse Analytics workspace and a Microsoft Purview account both have a system‐assigned identity by default. Azure products that are generally used to make connections to other Azure products or features have this managed identity created by default. In contrast, an Azure storage account receives data but does not commonly push data out to other systems, that would need an identity to do so. This is why you see no managed identities for Azure storage accounts. A system‐assigned managed identity can be used only by the Azure resource to which it is bound, and it is deleted when the Azure resource is deleted. A user‐assigned managed identity can be shared across Azure products and is a separate resource in itself and can have its own lifecycle. Perform Exercise 8.9, where you create a user‐assigned managed identity.

Encrypt Data at Rest and in Motion – Keeping Data Safe and Secure

With very few exceptions, data stored on the Azure platform is encrypted at rest by default. This means that if a hard drive that contains your data is stolen or the server where your data is stored is unlawfully accessed, the data cannot be deciphered. Keep in mind that when you give a group, individual, or a service principal access to the data, the unencrypted data is available to them. From an Azure Synapse Analytics dedicated SQL pool perspective, the means for encrypting server‐side data files, data backups, and log files at rest rely on the Transparent Data Encryption (TDE) feature. As shown in Figure 8.34, TDE can be enabled on the Transparent Data Encryption blade in the Azure portal.

FIGURE 8.34 Encrypt data at rest, TDE, dedicated SQL pool.

Encryption of data in motion is achieved by using the strongest possible version of TLS when accessing your data using the HTTP protocol. When HTTPS is utilized, a secure connection is made between the data consumer and the server. When the data is transferred through the network, it is encrypted using the strongest cypher that both the client and server support. If you want to make the network transmission even more secure, you might consider isolating the network communication channel using a virtual private network (VPN). This can be achieved by implementing the Azure VPN Gateway product or an Azure ExpressRoute. There are three common VPN models: point‐to‐point, point‐to‐site, and site‐to‐site. A point‐to‐point connection is restricted to a single connection between two machines, whereas a point‐to‐site connection means that a single machine has access to all the machines within a given site. A site in this context means an on‐premises network. A site‐to‐site model would allow all machines contained in two networks to connect with each other. An ExpressRoute connection is a dedicated connection from Azure to your on‐premises datacenter. It is configured with the help of an Internet service provider and is costly; however, it is the most secure means for transferring data, as the transmission does not traverse the Internet at all.

An additional concept, encryption‐in‐use, was explained earlier in the “Design Data Encryption for Data at Rest and in Transit” section. Recall that encryption‐in‐use is enabled using a product called Always Encrypted. As mentioned previously, TDE is used to encrypt data on the server‐side; however, there may be a need to encrypt data client‐side within an application. Always Encrypted is useful for encrypting data client‐side, directly within an application, before storing it on the Azure platform. Encrypting data client‐side ensures that individuals with highly privileged server‐side credentials cannot view or decrypt the data without being specifically granted permission. This gives you complete control over your data, while allowing a third party, like Microsoft, to perform the database administration responsibilities.

Implement Row‐Level and Column‐Level Security

Row‐level security (RLS) is implemented at the information protection level of the layered security model (refer to Figure 8.1). Row‐level security restricts access to rows in a database table and is realized using the CREATE SECURITY POLICY command and predicates. Figure 8.35 illustrates row‐level security.

FIGURE 8.35 Row‐level security

Using the EXECUTE AS statement prior to the SELECT statement, as follows, results in the result set being filtered based on the user permissions:

 EXECUTE AS USER = ‘BrainwaveBrainjammer1’

 SELECT * FROM BRAINWAVES



Column‐level security restricts access to specific columns on a database table, as illustrated in Figure 8.36.

To get hands‐on experience implementing column‐level security, complete Exercise 8.7.

FIGURE 8.36 Column‐level security

Apply Sensitivity Labels and Data Classifications Using Microsoft Purview and Data Discovery – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Microsoft Purview Governance Portal you provisioned in Exercise 8.2 ➢ select the Data Estate Insights hub ➢ select the link under the Assets heading on the Data Stewardship blade (the link is represented by the number 49 in Figure 8.26) ➢ scroll down and select the SUBJECTS table link ➢ select the Schema tab ➢ and then click the Edit button. The configuration should resemble Figure 8.27.

FIGURE 8.27 Microsoft Purview Data estate insights schema data classification

  1. Click the Save button ➢ observe the results ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu link ➢ select your dedicated SQL pool ➢ start the SQL pool ➢ and then select Data Discovery & Classification. The following message will appear at the top of the Data Discovery & Classification blade: “Currently using SQL Information Protection policy. We have found 6 columns with classification recommendations,” as shown in Figure 8.28. Click that message. The six columns and the recommendations should resemble Figure 8.28.

FIGURE 8.28 SQL Information Protection policy classification recommendations

  1. Click the + Add Classification menu button ➢ select dbo from the Schema Name drop‐down ➢ select SUBJECTS from the Table Name drop‐down ➢ select COUNTRY from the Column Name drop‐down ➢ select Contact Info from the Information Type drop‐down ➢ select General from the Sensitivity Label drop‐down➢ and then click Add Classification. The configuration should resemble Figure 8.29.

FIGURE 8.29 Data Discovery & Classification, Add classification 2

  1. Select the Select All check box ➢ click the Accept Selected Recommendations button ➢ and then click Save. The Overview tab will display something similar to Figure 8.30.

FIGURE 8.30 Data Discovery & Classification overview

  1. Execute the following two SQL statements on your dedicated SQL pool. The statements are in the auditSubjects.sql file in the Chapter08/Ch08Ex05 directory on GitHub.
  2. Navigate back to your Dedicated SQL Pool blade in the Azure portal ➢ select the Auditing navigation menu item ➢ select View Audit Logs ➢ select Log Analytics ➢ and then execute the following query; the query is in the Chapter08/Ch08Ex05 directory on GitHub.
  3. Notice the contents added to the DataSensitivityInformation column. Consider stopping the dedicated SQL pool.

Microsoft Purview is an extensive tool, and many of its capabilities are outside the scope of this book. A note in Figure 8.30 nicely summarizes Microsoft Purview: “For advanced classification capabilities, use Azure Purview.” This is because Microsoft Purview can span a much greater scope of data sources when compared to the Auditing capabilities available for the dedicated SQL pool and an Azure Synapse Analytics workspace. The exercises that included Microsoft Purview are meant as an introduction to get you started. In step 1 of Exercise 8.5, you added column level classification values to the SUBJECTS table. In step 3, you added information type values (aka column‐level classification values) and sensitivity labels to the SUBJECTS table again. You also added an additional classification on the COUNTRY column of the SUBJECTS table with a sensitivity label of General.

After these data labeling activities were completed, and because Auditing is enabled on this dedicated SQL pool, the INSERT and SELECT statements were logged. Navigating to your Log Analytics workspace and executing the query that searches the SQLSecurityAuditEvents table, you notice some new results being populated into the DataSensitivityInformation column. The following is a summary of the result. The full value from that column is in the DataSensitivityInformation.xml file in the Chapter08/Ch08Ex05 directory on GitHub.

The information contained in the DataSensitivityInformation column describes the security label and type of information being retrieved by the SELECT statement. Remember that the SELECT statement is stored in the Statement column. Using the data in this table in combination with the user identity stored in the ServerPrincipalName column provides good information about who accessed what information and how often. There are many methods for adding sensitivity labels. Consider, for example, the following SQL statement, which sets the CREATE_DATE column on the SUBJECTS table to Public:
ADD SENSITIVITY CLASSIFICATION TO dbo.SUBJECTS.CREATE_DATE
WITH ( LABEL=’Public’, INFORMATION_TYPE=’Administrative’, RANK=LOW )

The Azure CLI also provides many options for managing data classifications. To view the sensitivity label that was placed on the CREATE_DATE column by the SQL statement, execute the following Azure CLI cmdlet, which is followed by the output:
az synapse sql pool classification show –name sqlpool \

You can also create sensitivity labels using Azure CLI cmdlets. The last topic to cover concerning managing sensitive information has to do with the management of files. Up to this point the context has been on tables within a relational database. However, while working on a data analytics solution, you will very likely come across the scenario of sensitive data sent and received within files.

To protect sensitive data, you can create directory structures like the following, which include a directory named Confidential, for example:
EMEA\brainjammer\raw-files\Confidential\YYYY\MM\DD\HH
EMEA\brainjammer\cleansed-data\Confidential\YYYY\MM\DD
EMEA\brainjammer\business-data\Confidential\YYYY\MM

Then, as shown in Figure 8.31, the directories are protected using ACLs.

FIGURE 8.31 Protecting sensitive data in files

Figure 8.31 is taken from Microsoft Azure Storage Explorer and illustrates that one individual and the Azure Synapse Analytics service principal identity have ACL access to the Confidential directory. Consider creating a folder for each sensitivity label—for example, Public, General, Highly Confidential, and GDPR—and granting the necessary permissions to groups and service principals based on your business requirements.

Implement a Data Retention Policy

In Exercise 4.5, you implemented an Azure Storage account lifecycle management policy. By adding the deleteAfter90Days policy definition, as discussed previously, you would realize the implementation of a data retention policy in this context. To implement a data retention policy that applies to data stored on a relational database, for example, an Azure Synapse Analytics dedicated SQL pool, complete Exercise 8.6.

Configure and Perform a Data Asset Scan Using Microsoft Purview – Keeping Data Safe and Secure-2

If you have not created the SUBJECTS table on your dedicated SQL pool, create the table using the SUBJECTS.sql file located in the Chapter08 directory on GitHub.

The first action you took after accessing the Azure portal was to add the Microsoft Purview account identity to the Reader role of the Azure Synapse Analytics workspace. Note that adding this role assignment at the workspace level results in the Reader permissions being granted to all resources that exist in the workspace. This is a good level of access for Microsoft Purview to perform proper governance and auditing activities. It is also possible to provide this level of access specifically to a SQL or Spark pool using the same approach via the Access control (IAM) role assignments feature while those analytics pools are in focus. Next, you navigated to the Manage hub on the Azure Synapse Analytics workspace and bound the Microsoft Purview account together with the workspace. This provided easy access to the Microsoft Purview Governance portal.

 Until you configure the new credential, as shown in Figure 8.21, you may receive a Failed to load serverless databases from Synapse workspace error message. Once you select the new credential (for example, sqladminuser), the error will go away. In this example, the username and password are the same for both the serverless and dedicated SQL pools.

Once in the Microsoft Purview Governance portal, you registered a collection named ASA‐csharpguitar into the R&D parent collection. After the collection that targeted your Azure Synapse Analytics workspace was completed, you began with the configuration of an asset scan. A credential that can access both the serverless and dedicated SQL pool is required at this point. Selecting the + New item from the Credential drop‐down list box provided the option to do this. You added a connection to the Azure Key Vault connection that targets the secret created in Exercise 8.1. The secret contains the password of your dedicated SQL pool, which is, in this example, the same as the built‐in serverless database SQL pool. Once configured and selected from the Credential drop‐down list box, you were able to select the dedicated SQL pool as the target data source of the scan.

When you selected to use the System Default scan rule set, you chose to use all the supported classification rules. While configuring the scan, you might have noticed the View Details link below that value. Clicking the View Details link currently renders a list of 208 classification rules grouped together with names such as Government, Financial, Base, Personal, Security, and Miscellaneous. You also have the option to create a custom rule that allows you to include your own additional set of items to scan for. The Security scan checks for passwords that match common patterns; the Government scan checks for values that match an ID; and the Personal scan checks for birth dates, email addresses, and phone numbers, for example. If you didn’t look at that, go back and check it out for the full set of attributes that are searched for when running an asset scan. The next window gives you the option to schedule the audit scan weekly or monthly. In a live scenario, where you have a lot of activity on your data sources, this would be a good idea. Lastly, you ran the scan, viewed the results shown in Figure 8.22, and then stopped the dedicated SQL pool. In Exercise 8.5 you will use those results to classify and curate the data existing in the SUBJECTS table.

Azure Synapse Analytics includes an Auditing feature for dedicated SQL pools. Complete Exercise 8.4 to configure and implement Auditing on an Azure Synapse Analytics dedicated SQL pool.

Configure and Perform a Data Asset Scan Using Microsoft Purview – Keeping Data Safe and Secure-1

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the Access Control (IAM) navigation menu item ➢ click the + Add menu button ➢ select Add Role Assignment from the drop‐down list ➢ select Reader from the Role list ➢ click Next ➢ select the Managed Identity radio button ➢ click the + Select member link ➢ select Microsoft Purview Account from the Managed Identity drop‐down list box ➢ select the Microsoft Purview account you created in Exercise 8.2 ➢ click the Select button ➢ click the Review + Assign button ➢ navigate to the Overview blade ➢ click the Open link in the Open Synapse Studio tile ➢ select the Manage hub ➢ select SQL Pools from the menu list ➢ and then start the dedicated SQL pool.
  2. Select the Manage hub ➢ select the Microsoft Purview item in the External Connections section ➢ click the Connect to a Purview Account button ➢ select the Purview account created in Exercise 8.2 (for example, brainjammer) ➢ click Apply ➢ and then select the link to your Microsoft Purview account tab, as shown in Figure 8.20.

FIGURE 8.20 Connecting Microsoft Purview to Azure Synapse Analytics workspace

  1. Select the Data Map hub ➢ select Sources from the navigation menu ➢ select the Register menu item ➢ select Azure Synapse Analytics from the Register Source window ➢ click Continue ➢ enter a name (I used ASA‐csharpguitar) ➢ select the workspace you configured in step 2 from the Workspace Name drop‐down list box ➢ select the R&D collection from the Select a Collection drop‐down list box ➢ and then click Register.
  2. Select the View Details link on the just registered source in the Map view ➢ select the New Scan menu item ➢ enter a name (I used ScanDedicatedSQLPool) ➢ select + New from the Credential drop‐down list box ➢ enter a name (I used sqladminuser) ➢ enter the user ID/name of your Azure Synapse Analytics dedicated SQL pool (I used sqladminuser) ➢ select the + New from the Key Vault Connection drop‐down list box ➢ enter a name (I used brainjammerKV) ➢ select the Key Vault you created in Exercise 8.1 ➢ click Create ➢ enter the Azure Key Vault secret name that stores your Azure Synapse Analytics password (I used azureSynapseSQLPool) ➢ click Create ➢ select your dedicated SQL pool from the SQL Database drop‐down list box ➢ select the Test Connection link ➢ and then click Continue. The configuration resembles Figure 8.21.

FIGURE 8.21 Configuring scanning in Microsoft Purview

  1. Click the Continue button to perform the scan using the default scan rule set ➢ select the Once radio button ➢ click Continue ➢ and then click the Save and Run button. When the scan is complete, you will see something like Figure 8.22.

FIGURE 8.22 The result of a Microsoft Purview scan

  1. Stop your Azure Synapse Analytics dedicated SQL pool.

Audit an Azure Synapse Analytics Dedicated SQL Pool – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ enter Log Analytics Workspaces in the Search box and select it from the drop‐down ➢ click the + Create button ➢ select a subscription ➢ select a resource group ➢ enter a name (I used brainjammer) ➢ select a region ➢ click the Review + Create button ➢ and then click Create.
  2. Navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu item ➢ select your dedicated SQL pool ➢ start the dedicated SQL pool ➢ select the Auditing navigation menu item ➢ set the Enable Azure Auditing toggle switch to on ➢ select the Log Analytics check box ➢ select the subscription where you placed the Log Analytics workspace from step 1 ➢ select the workspace from the Log Analytics drop‐down list box ➢ and then click Save. The configuration should resemble Figure 8.23.

FIGURE 8.23 Dedicated SQL pool auditing configuration

  1. After some minutes, navigate to the Diagnostic Settings navigation menu item for the dedicated SQL pool ➢ notice the setting prefixed with SQLSecurityAuditEvents_ ➢ click the Edit Setting link associated with that diagnostic setting ➢ check the Sql Requests check box ➢ and then click Save. The configuration should resemble Figure 8.24.

FIGURE 8.24 Dedicated SQL pool Diagnostic setting configuration

  1. After some minutes, execute the following query on the dedicated SQL pool configured in the previous steps:
    SELECT * FROM [dbo].[SUBJECTS]
  2. After some minutes, select the View Audit Logs menu button (refer to Figure 8.23) ➢ click the Log Analytics menu button ➢ and then execute the default query. An example query is available on GitHub in the Chapter08/Ch08Ex04 directory. The output should resemble Figure 8.25.

FIGURE 8.25 View dedicated SQL pool audit logs in Log Analytics.

  1. Stop the dedicated SQL pool.

The first step taken in Exercise 8.4 resulted in the provisioning of an Azure Log Analytics workspace. This is the recommended location to store, analyze, and manage audit logs. Chapter 9, “Monitoring Azure Data Storage and Processing,” includes detailed coverage of Azure Log Analytics, including monitoring.

The provisioning was very straightforward, so at this point all you need to know is that Log Analytics is an optimal datastore for storing audit and diagnostic logs generated by Azure products. After selecting the Auditing navigation menu item bound to the dedicated SQL pool, you were presented with a blade that looked similar to Figure 8.23. There are a few points to call out on this blade.

The exercise instructed you to configure auditing for the dedicated SQL pool; however, there is another option that enables auditing for all SQL pools in the Azure Synapse Analytics workspace. This level of auditing was not needed to experience the desired learning from the exercise, but in a real enterprise implementation of a data solution that uses multiple SQL pools, you should consider enabling auditing for all SQL pools.

Diagnostic Settings is a very helpful feature for most of the Azure products. Like Log Analytics, a full description of this feature is coming in Chapter 9. When you enabled auditing, a default setting was created to capture SQL security audit events. You modified the setting to add SQL requests to the events that are captured and stored into Log Analytics. After completing the configuration and running a SQL statement on any table hosted in the dedicated SQL pool, you reviewed the logs. When you clicked the View Audit Logs menu button, as shown in Figure 8.23, you were navigated to the Log Analytics workspace.
The Log Analytics query editor enabled you to execute queries against the audit data being stored into it. Exercise 8.4 used the following query:
SQLSecurityAuditEvents
| where Category == ‘SQLSecurityAuditEvents’
| project EventTime, Statement, Succeeded, AffectedRows, ResponseRows,
ServerPrincipalName, ClientIp, ApplicationName, AdditionalInformation,
DataSensitivityInformation, DurationMs, ClientTlsVersion,
IsServerLevelAudit, IsColumnPermission
| order by EventTime desc
| take 100

The query retrieves audit records from a Log Analytics table named SQLSecurityAuditEvents, where the value stored in the Category column is equal to SQLSecurityAuditEvents.
The SQLSecurityAuditEvents table includes many columns. By using project, the returned dataset is limited by the columns identified following that keyword. The dataset is ordered by EventTime and limited to 100 rows. Some interesting data included in the result is the SQL statement that was executed, who executed it, and the IP address from the machine that performed the query. There will be more discussion around these columns later, after you set some sensitivity labels and data classifications for data columns.

Manage Sensitive Information

The tool capable of discovering and auditing your data is Microsoft Purview, which can search all your data real estate hosted on Azure, on‐premises, and even on other cloud provider platforms. In Exercise 8.3, you used Microsoft Purview to scan a dedicated SQL pool to identify the assets within it. Figure 8.26 shows the result of that scan. The result is accessible on the Data Estate Insights hub within the Microsoft Purview Governance portal.

FIGURE 8.26 Scanning a dedicated SQL pool with Microsoft Purview

The scan found 49 total assets that have neither sensitivity labels nor classifications. This is because neither of those activities has been performed yet. Something like what you saw previously in Figure 8.11 needs to be carried out on the data. That example of data classification was illustrated from another tool for managing sensitive data, Data Discovery & Classification, which, when used in combination with Auditing, can be used to target dedicated SQL pools running in an Azure Synapse Analytics workspace. Complete Exercise 8.5, where you will apply sensitivity labels and data classifications.

Implement a Data Retention Policy – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu link ➢ select your dedicated SQL pool ➢ start the SQL pool ➢ select the Overview blade ➢ and then click the Open link in the Open Synapse Studio tile.
  2. Navigate to the Data hub ➢ select the ellipse (…) to the right of your dedicated SQL pool ➢ select New SQL script from the pop‐up menu ➢ select Empty Script ➢ and then execute the following command. The command text is available in the uspApply90DayRetentionPolicySubjects.sql file in the Chapter08/Ch08Ex06 directory on GitHub.
    CREATE PROCEDURE dbo.uspApply90DayRetentionPolicySubjects

AS DELETE FROM dbo.SUBJECTS WHERE CREATE_DATE>DATEADD(DAY, -90,GETDATE()) GO

  1. Expand your dedicated SQL pool ➢ expand the Programmability folder ➢ expand the Stored Procedures folder ➢ select the ellipse to the right of the stored procedure you created in step 2 ➢ select Add to Pipeline from the pop‐up menu ➢ select New Pipeline from the pop‐out menu ➢ enter a name (I used Subjects 90 day Retention Policy) ➢ select the Settings tab ➢ click Commit ➢ click Publish ➢ and then click OK. The configuration should resemble Figure 8.32.

FIGURE 8.32 Implement a data retention policy in Azure Synapse Analytics.

  1. Click the Add Trigger button ➢ select New/Edit ➢ select + New from the Add Triggers drop‐down list box ➢ configure the scheduled task to run on the first day of every month, similar to that shown in Figure 8.33 ➢ click the Commit button ➢ and then publish the trigger.

FIGURE 8.33 Implement a data retention policy schedule pipeline trigger.

  1. Stop the dedicated SQL pool.

Exercise 8.6 begins with the creation of a stored procedure that removes data from the SUBJECTS table. The WHERE clause checks if the date contained in the CREATE_DATE column is older than 90 days. If it is, then the row is deleted. You then added that stored procedure to an Azure Synapse Analytics pipeline using the SQL pool Stored Procedure activity. Once committed and published, you configured a schedule trigger to run the stored procedure once per month. This results in the deletion of data based on a 90‐day retention. You might consider not committing and publishing the trigger, unless you really want to implement the retention policy.

Implement Column‐level Security – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu link ➢ select your dedicated SQL pool ➢ start the SQL pool ➢ after the SQL pool is running, select the Overview blade ➢ and then click the Open link in the Open Synapse Studio tile.
  2. Navigate to the Data hub ➢ select the ellipse (…) to the right of your dedicated SQL pool ➢ select New SQL Script from the pop‐up menu ➢ select Empty Script ➢ select master from the Use Database drop‐down list box ➢ and then execute the following:
    CREATE LOGIN brainjammer WITH PASSWORD = ‘csh@rpgu1tar’
  3. Select your dedicated SQL pool from the Use Database drop‐down list box, and then execute the following. The SQL is located in the column‐level.sql file in the Chapter08/Ch08Ex07 directory on GitHub.
    CREATE USER brainjammer FROM LOGIN brainjammer WITH DEFAULT_SCHEMA = dbo
    EXEC sp_addrolemember ‘db_datareader’, ‘brainjammer’
    DENY SELECT ON dbo.SUBJECTS TO brainjammer
    GRANT SELECT ON SUBJECTS
    (ID, FIRSTNAME, LASTNAME, EMAIL, COUNTRY, CREATE_DATE) TO brainjammer
  4. Log in to the dedicated SQL pool using Azure Data Studio; for example, executing the following with the user created in step 2 will give you the error shown in Figure 8.37.
    SELECT * FROM [dbo].[SUBJECTS]

FIGURE 8.37 Column‐level security enforcement exception

  1. Stop the dedicated SQL pool.

Before you can create a user for a specific database, the login must first exist on the master database. Executing the following SQL command while targeting the master database hosted on your dedicated SQL pool accomplishes this:
CREATE LOGIN brainjammer WITH PASSWORD = ‘csh@rpgu1tar’

Redirecting focus back to the SQLPool database, you then create a user who uses the login credentials created on the master database. You accomplished this with the following SQL statement. In addition to the user account creation, the default schema is set to dbo, which is where the SUBJECTS table is located.
CREATE USER brainjammer FROM LOGIN brainjammer WITH DEFAULT_SCHEMA = dbo

Adding the user account to a read‐only group, as follows, provides read access permissions to all the tables on that database:
EXEC sp_addrolemember ‘db_datareader’, ‘brainjammer’

To view all roles and the members of those roles, execute the following SQL command. This SQL text is located in the databaseRole.sql file in the Chapter08/Ch08Ex07 directory on GitHub. Figure 8.38 shows the result of the query.

FIGURE 8.38 Role and membership details on a dedicated SQL pool database

When the user is granted read access to the tables, it defaults to all the columns on that table. In order to remove access to all the columns for the SUBJECTS table, you first execute the following SQL statement, which removes SELECT access to the table:
DENY SELECT ON dbo.SUBJECTS TO brainjammer

Then you use the following SQL command to grant access only to the columns that the user should be permitted to retrieve:
GRANT SELECT ON SUBJECTS
(ID, FIRSTNAME, LASTNAME, EMAIL, COUNTRY, CREATE_DATE) TO brainjammer

From that point on, when the user, brainjammer, accesses the SQLPool database and retrieves data from the SUBJECTS table, only the columns that have been granted permission can be retrieved. The following is the SQL query, followed by the output:
SELECT ID, FIRSTNAME, LASTNAME, EMAIL, COUNTRY, CREATE_DATE FROM dbo.SUBJECTS

Always shut down the dedicated SQL pool when not in use, to reduce costs.

Implement Data Masking

The primary difference between column‐level security and dynamic data masking is that column‐level security prevents the retrieval of the column completely. For example, if a user does not have permission on a column named BIRTHDATE and they attempt to retrieve it, a permission denied exception is rendered. Data masking, however, does allow the retrieval of the column data, although the data is partially concealed. Consider, for example, the EMAIL column from Exercise 8.7, which contains an individual’s email address. This is not data you would want everyone to have access to, for numerous reasons. For example, someone writes a program that is intended to notify your customers of downtime or outage events. Consider that there are hundreds of thousands of customers in your database. During testing, the data is selected and emails sent to all your customers. Placing a mask on the EMAIL column would change the retrieved value from [email protected] to [email protected]. Doing so would prevent sending all your customers an email on error. At the same time, however, you might want to see part of the column value for validation, perhaps the last four digits of a credit card number or identification number. Complete Exercise 8.8, where you will configure a dynamic data mask on the EMAIL column of the SUBJECTS table.

Design a Data Masking Strategy – Keeping Data Safe and Secure

A mask is an object that partially conceals what is behind it. From a data perspective, a mask would conceal a particular piece of the data but not all of it. Consider, for example, email addresses, names, credit card numbers, and telephone numbers. Those classifications of data can be helpful if there is ever a need to validate a person’s identity. However, you would not want all of the data rendered in a query; instead, you can show only the last four digits of the credit card number or the first letter of an email address and the top level domain value like .com, .net, or .org, like the following:

 [email protected]

There is a built‐in capability for this masking in the Azure portal related to an Azure Synapse Analytics dedicated SQL pool. As shown in Figure 8.13 navigating to the Dynamic Data Masking blade renders masking capabilities.

The feature will automatically scan your tables and find columns that may contain data that would benefit from masking. You apply the mask by selecting the Add Mask button, selecting the mask, and saving it. Then, when a user who is not in the excluded list, as shown in Figure 8.13, accesses the data, the mask is applied to the resulting dataset. Finally, the primary objective of masking is to conceal enough of the data in a column so that it can be used but not exploited. That partial data visibility demonstrates the difference between masking and encryption. When data in a column is encrypted, none of it is readable, whereas a mask can be configured to allow partial data recognition.

FIGURE 8.13 Dynamic Data Masking dedicated SQL pool

Design Access Control for Azure Data Lake Storage Gen2

There are four authorization methods for Azure storage accounts. The method you have been using in most scenarios up to now has been through access keys. An access key resembles the following:

 4jRwk0Ho7LB+si85ax…yuZP+AKrr1FbWbQ==

The access key is used in combination with the protocol and storage account name to build the connection string. Clients can then use this to access the storage account and the data within it. The connections string resembles the following:

 DefaultEndpointsProtocol=https;AccountName=<name>;AccountKey=<account-key>

On numerous occasions you have created linked services in the Azure Synapse Analytics workspace. When configuring a linked service for an Azure storage account, you might remember seeing that shown in Figure 8.14, which requests the information required to build the Azure storage account connection string.

FIGURE 8.14 ADLS access control access keys

Notice that the options request the authentication type, which is set to access key, sometimes also referred to as an account key, to be used as part of a connection string, followed by the storage account name and the storage account key. Those values are enough for the client—in this case, an Azure Synapse Analytics linked service—to successfully make the connection to the storage account. HTTPS is used as default, which enforces data encryption‐in‐transit; therefore, an authentication type is not requested. Another authorization method similar to access keys is called shared access signature (SAS) authorization. This authorization method gives you a bit more control over what services, resources, and actions a client can access on the data stored in the account. Figure 8.15 shows the Shared Access Signature blade in the Azure portal for Azure storage accounts.

When you use either an access key or a SAS URL, any client with that token will get access to your storage account. There is no identity associated with either of those authorization methods; therefore, protecting that token key is very important. This is a reason that offering the retrieval of the account key from an Azure key vault is also an option, as you saw in Figure 8.14. Storing the access key and/or the SAS URL in an Azure key vault would remove the need to store the key within the realm of an Azure Synapse Analytics workspace. Although this is safe, reducing the number of clients who have possession of your authorization keys is a good design. Any entity that needs these keys can be granted access to the Azure key vault and the keys for making the connection to your storage account. The other two remaining authorization methods are RBAC and ACL, which are covered in the following sections. As an introduction to those sections, Table 8.2 provides some details about both the Azure RBAC and ACL authorization methods.

TABLE 8.2 Azure storage account authorization methods

MethodScopeRequire an identityGranularity level
Azure RBACStorage account, containerYesHigh
ACLFile, directoryYesLow

FIGURE 8.15 ADLS Access control shared access signature

Authorization placed upon an Azure storage account using an RBAC is achieved using a role assignment at the storage account or container level. ACLs are implemented by assigning read, write, or delete permissions on a file or directory. As shown in Figure 8.16, if the identity of the person or service performing the operation is associated with an RBAC group with the assignment allowing file deletion, then that access is granted, regardless of the ACL permission.

However, if the identity associated with the group that is assigned RBAC permissions does not have the authorization to perform a delete but does have the ACL permission, then the file can be deleted. The following sections describe these authentication methods in more detail.

Optimize Pipelines for Analytical or Transactional Purposes – Design and Implement a Data Stream Processing Solution

There are numerous approaches for optimizing data stream pipelines. Two such approaches are parallelization and compute resource management. You have learned that setting a partition key in the date message results in the splitting of messages across multiple nodes. By doing this your data streams are processed in parallel. You can see how this looks in Figure 7.31 and the discussion around it. Managing the compute resources available for processing the data stream will have a big impact on the speed and availability of your data to downstream consumers. As shown in Figure 7.48, the increase in the watermark delay was caused by the CPU utilization on the processor nodes reaching 100 percent.

FIGURE 7.48 Azure Stream Analytics job metrics, CPU at 99 percent utilization

When the CPU is under such pressure, event messages get queued, which causes a delay in processing and providing output. Increasing the number of SUs allocated to the Azure Stream Analytics job will have a positive impact on the pipeline, making it more optimal for analytical and transactional purposes.

Scale Resources

As mentioned in the previous section, adding more resources to the Azure Stream Analytics Job will result in faster processing of the data stream. This assumes that you have noticed some processing delays and see that the current level of allocated resources is not sufficient. To increase the amount of compute power allocated to an Azure Stream Analytics job, select the Scale navigation link for the given job, as shown in Figure 7.49.

FIGURE 7.49 Azure Stream Analytics job scaling

Once compute power is allocated, when you start the job, instead of seeing 3, as shown in Figure 7.18, you will see the number described in the Manual Scale Streaming Unit configuration. In this case, the number of SUs would be 60.