Category: Implement Data Masking

Implement Data Masking – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu link ➢ select your dedicated SQL pool ➢ start the SQL pool ➢ after the SQL pool is running, select the Dynamic Data Masking blade ➢ select the + Add Mask menu item ➢ make the configurations as shown in Figure 8.39 ➢ and then click the Add button.

FIGURE 8.39 Implement data masking and masking rule.

  1. Click the Save button ➢ log in to the dedicated SQL pool using Azure Data Studio, with the credentials created in Exercise 8.7 ➢ and then execute the following:
    SELECT ID, FIRSTNAME, LASTNAME, EMAIL, COUNTRY, CREATE_DATE FROM dbo.SUBJECTS
  2. Notice that the data in the EMAIL column has the mask configured in step 1 and illustrated in Figure 8.39. Stop the dedicated SQL pool.

It is possible to configure a custom mask instead of using the preconfigured masks. This requires that you provide the number of starting characters to show, followed by the padding string (something like xxxxx.), followed by the number of ending characters to display. Using a prefix and suffix value of three and the padding on the EMAIL column would result in benxxxxx.net, for example, which is a bit more useful than what is provided using the default.

Manage Identities, Keys, and Secrets Across Different Data Platform Technologies

Protecting credentials has historically been very challenging. Developers and data engineers need to access data, and that data is protected by an ID and password. As the complexity and size of your organization grows, it is easy to lose control over who has what credentials. Add that loss of control to the potential impact changing a password can have on a production environment. This scenario is commonly referred to as credential leakage. An initial solution to credential leakage was to store connection details in a managed credential store, something like Azure Key Vault. However, access to the credential store also requires credentials, so you are back in the same place as before the implementation of the credential store. The ultimate solution is to use a combination of Azure Key Vault and managed identities. Instead of using a credential to make a connection to a storage account or a database from application code, you instead reference the Azure Key Vault endpoint. An Azure Key Vault secret endpoint resembles the following:

https://<accountName>.vault.azure.net/secrets/<secretName>/5db1a9b5…

The code that uses that endpoint must implement the DefaultAzureCredential class from the Azure Identity library. The library works with all popular programming languages: .NET, Python, Go, Java, etc. Passing a new DefaultAzureCredential class to the SecretClient class results in the acquisition of the managed identity credential, which is a token. The client then stores all necessary attributes to perform the retrieval of a secret from the Azure Key Vault endpoint. The following C# code performs this activity:

 var kvUri = “https://” + accountName + “.vault.azure.net”;
 var client = new SecretClient(new Uri(kvUri), new DefaultAzureCredential());

You can use the client to get a secret by using the following C# syntax:

 var secret = await client.GetSecretAsync(secretName);

Now you know how a managed identity can avoid credential leakage, but you might be wondering what exactly are managed identities and what important aspects must you know in order to implement them safely and securely? Table 8.3 compares the two types of managed identities: system‐assigned and user‐assigned.

TABLE 8.3 Managed identity types

CharacteristicSystem‐assigned managed identityUser‐assigned managed identity
ProvisioningAzure resources receive an identity by default, where supported.Created manually
RemovalThe identity is deleted when the associated Azure resource is deleted.Deleted manually
SharingThe identity cannot be shared among Azure resources.Can be shared

A system‐assigned managed identity is created during the provisioning of the Azure resource. For example, an Azure Synapse Analytics workspace and a Microsoft Purview account both have a system‐assigned identity by default. Azure products that are generally used to make connections to other Azure products or features have this managed identity created by default. In contrast, an Azure storage account receives data but does not commonly push data out to other systems, that would need an identity to do so. This is why you see no managed identities for Azure storage accounts. A system‐assigned managed identity can be used only by the Azure resource to which it is bound, and it is deleted when the Azure resource is deleted. A user‐assigned managed identity can be shared across Azure products and is a separate resource in itself and can have its own lifecycle. Perform Exercise 8.9, where you create a user‐assigned managed identity.

Apply Sensitivity Labels and Data Classifications Using Microsoft Purview and Data Discovery – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Microsoft Purview Governance Portal you provisioned in Exercise 8.2 ➢ select the Data Estate Insights hub ➢ select the link under the Assets heading on the Data Stewardship blade (the link is represented by the number 49 in Figure 8.26) ➢ scroll down and select the SUBJECTS table link ➢ select the Schema tab ➢ and then click the Edit button. The configuration should resemble Figure 8.27.

FIGURE 8.27 Microsoft Purview Data estate insights schema data classification

  1. Click the Save button ➢ observe the results ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu link ➢ select your dedicated SQL pool ➢ start the SQL pool ➢ and then select Data Discovery & Classification. The following message will appear at the top of the Data Discovery & Classification blade: “Currently using SQL Information Protection policy. We have found 6 columns with classification recommendations,” as shown in Figure 8.28. Click that message. The six columns and the recommendations should resemble Figure 8.28.

FIGURE 8.28 SQL Information Protection policy classification recommendations

  1. Click the + Add Classification menu button ➢ select dbo from the Schema Name drop‐down ➢ select SUBJECTS from the Table Name drop‐down ➢ select COUNTRY from the Column Name drop‐down ➢ select Contact Info from the Information Type drop‐down ➢ select General from the Sensitivity Label drop‐down➢ and then click Add Classification. The configuration should resemble Figure 8.29.

FIGURE 8.29 Data Discovery & Classification, Add classification 2

  1. Select the Select All check box ➢ click the Accept Selected Recommendations button ➢ and then click Save. The Overview tab will display something similar to Figure 8.30.

FIGURE 8.30 Data Discovery & Classification overview

  1. Execute the following two SQL statements on your dedicated SQL pool. The statements are in the auditSubjects.sql file in the Chapter08/Ch08Ex05 directory on GitHub.
  2. Navigate back to your Dedicated SQL Pool blade in the Azure portal ➢ select the Auditing navigation menu item ➢ select View Audit Logs ➢ select Log Analytics ➢ and then execute the following query; the query is in the Chapter08/Ch08Ex05 directory on GitHub.
  3. Notice the contents added to the DataSensitivityInformation column. Consider stopping the dedicated SQL pool.

Microsoft Purview is an extensive tool, and many of its capabilities are outside the scope of this book. A note in Figure 8.30 nicely summarizes Microsoft Purview: “For advanced classification capabilities, use Azure Purview.” This is because Microsoft Purview can span a much greater scope of data sources when compared to the Auditing capabilities available for the dedicated SQL pool and an Azure Synapse Analytics workspace. The exercises that included Microsoft Purview are meant as an introduction to get you started. In step 1 of Exercise 8.5, you added column level classification values to the SUBJECTS table. In step 3, you added information type values (aka column‐level classification values) and sensitivity labels to the SUBJECTS table again. You also added an additional classification on the COUNTRY column of the SUBJECTS table with a sensitivity label of General.

After these data labeling activities were completed, and because Auditing is enabled on this dedicated SQL pool, the INSERT and SELECT statements were logged. Navigating to your Log Analytics workspace and executing the query that searches the SQLSecurityAuditEvents table, you notice some new results being populated into the DataSensitivityInformation column. The following is a summary of the result. The full value from that column is in the DataSensitivityInformation.xml file in the Chapter08/Ch08Ex05 directory on GitHub.

The information contained in the DataSensitivityInformation column describes the security label and type of information being retrieved by the SELECT statement. Remember that the SELECT statement is stored in the Statement column. Using the data in this table in combination with the user identity stored in the ServerPrincipalName column provides good information about who accessed what information and how often. There are many methods for adding sensitivity labels. Consider, for example, the following SQL statement, which sets the CREATE_DATE column on the SUBJECTS table to Public:
ADD SENSITIVITY CLASSIFICATION TO dbo.SUBJECTS.CREATE_DATE
WITH ( LABEL=’Public’, INFORMATION_TYPE=’Administrative’, RANK=LOW )

The Azure CLI also provides many options for managing data classifications. To view the sensitivity label that was placed on the CREATE_DATE column by the SQL statement, execute the following Azure CLI cmdlet, which is followed by the output:
az synapse sql pool classification show –name sqlpool \

You can also create sensitivity labels using Azure CLI cmdlets. The last topic to cover concerning managing sensitive information has to do with the management of files. Up to this point the context has been on tables within a relational database. However, while working on a data analytics solution, you will very likely come across the scenario of sensitive data sent and received within files.

To protect sensitive data, you can create directory structures like the following, which include a directory named Confidential, for example:
EMEA\brainjammer\raw-files\Confidential\YYYY\MM\DD\HH
EMEA\brainjammer\cleansed-data\Confidential\YYYY\MM\DD
EMEA\brainjammer\business-data\Confidential\YYYY\MM

Then, as shown in Figure 8.31, the directories are protected using ACLs.

FIGURE 8.31 Protecting sensitive data in files

Figure 8.31 is taken from Microsoft Azure Storage Explorer and illustrates that one individual and the Azure Synapse Analytics service principal identity have ACL access to the Confidential directory. Consider creating a folder for each sensitivity label—for example, Public, General, Highly Confidential, and GDPR—and granting the necessary permissions to groups and service principals based on your business requirements.

Implement a Data Retention Policy

In Exercise 4.5, you implemented an Azure Storage account lifecycle management policy. By adding the deleteAfter90Days policy definition, as discussed previously, you would realize the implementation of a data retention policy in this context. To implement a data retention policy that applies to data stored on a relational database, for example, an Azure Synapse Analytics dedicated SQL pool, complete Exercise 8.6.

Design a Data Masking Strategy – Keeping Data Safe and Secure

A mask is an object that partially conceals what is behind it. From a data perspective, a mask would conceal a particular piece of the data but not all of it. Consider, for example, email addresses, names, credit card numbers, and telephone numbers. Those classifications of data can be helpful if there is ever a need to validate a person’s identity. However, you would not want all of the data rendered in a query; instead, you can show only the last four digits of the credit card number or the first letter of an email address and the top level domain value like .com, .net, or .org, like the following:

 [email protected]

There is a built‐in capability for this masking in the Azure portal related to an Azure Synapse Analytics dedicated SQL pool. As shown in Figure 8.13 navigating to the Dynamic Data Masking blade renders masking capabilities.

The feature will automatically scan your tables and find columns that may contain data that would benefit from masking. You apply the mask by selecting the Add Mask button, selecting the mask, and saving it. Then, when a user who is not in the excluded list, as shown in Figure 8.13, accesses the data, the mask is applied to the resulting dataset. Finally, the primary objective of masking is to conceal enough of the data in a column so that it can be used but not exploited. That partial data visibility demonstrates the difference between masking and encryption. When data in a column is encrypted, none of it is readable, whereas a mask can be configured to allow partial data recognition.

FIGURE 8.13 Dynamic Data Masking dedicated SQL pool

Design Access Control for Azure Data Lake Storage Gen2

There are four authorization methods for Azure storage accounts. The method you have been using in most scenarios up to now has been through access keys. An access key resembles the following:

 4jRwk0Ho7LB+si85ax…yuZP+AKrr1FbWbQ==

The access key is used in combination with the protocol and storage account name to build the connection string. Clients can then use this to access the storage account and the data within it. The connections string resembles the following:

 DefaultEndpointsProtocol=https;AccountName=<name>;AccountKey=<account-key>

On numerous occasions you have created linked services in the Azure Synapse Analytics workspace. When configuring a linked service for an Azure storage account, you might remember seeing that shown in Figure 8.14, which requests the information required to build the Azure storage account connection string.

FIGURE 8.14 ADLS access control access keys

Notice that the options request the authentication type, which is set to access key, sometimes also referred to as an account key, to be used as part of a connection string, followed by the storage account name and the storage account key. Those values are enough for the client—in this case, an Azure Synapse Analytics linked service—to successfully make the connection to the storage account. HTTPS is used as default, which enforces data encryption‐in‐transit; therefore, an authentication type is not requested. Another authorization method similar to access keys is called shared access signature (SAS) authorization. This authorization method gives you a bit more control over what services, resources, and actions a client can access on the data stored in the account. Figure 8.15 shows the Shared Access Signature blade in the Azure portal for Azure storage accounts.

When you use either an access key or a SAS URL, any client with that token will get access to your storage account. There is no identity associated with either of those authorization methods; therefore, protecting that token key is very important. This is a reason that offering the retrieval of the account key from an Azure key vault is also an option, as you saw in Figure 8.14. Storing the access key and/or the SAS URL in an Azure key vault would remove the need to store the key within the realm of an Azure Synapse Analytics workspace. Although this is safe, reducing the number of clients who have possession of your authorization keys is a good design. Any entity that needs these keys can be granted access to the Azure key vault and the keys for making the connection to your storage account. The other two remaining authorization methods are RBAC and ACL, which are covered in the following sections. As an introduction to those sections, Table 8.2 provides some details about both the Azure RBAC and ACL authorization methods.

TABLE 8.2 Azure storage account authorization methods

MethodScopeRequire an identityGranularity level
Azure RBACStorage account, containerYesHigh
ACLFile, directoryYesLow

FIGURE 8.15 ADLS Access control shared access signature

Authorization placed upon an Azure storage account using an RBAC is achieved using a role assignment at the storage account or container level. ACLs are implemented by assigning read, write, or delete permissions on a file or directory. As shown in Figure 8.16, if the identity of the person or service performing the operation is associated with an RBAC group with the assignment allowing file deletion, then that access is granted, regardless of the ACL permission.

However, if the identity associated with the group that is assigned RBAC permissions does not have the authorization to perform a delete but does have the ACL permission, then the file can be deleted. The following sections describe these authentication methods in more detail.

Design for Data Privacy – Keeping Data Safe and Secure

Keeping data private depends primarily on two factors. The first factor is the implementation of a security model, like the layered approach illustrated in Figure 8.1. Implementing a security model has the greatest probability of preventing malicious behaviors and bad actors from having an impact on your data and your IT solution in general. The tools and features that can help prevent those scenarios include VNets, private endpoints, Azure Active Directory, RBAC, and Azure threat protection functionality. The second factor has to do with the classification and labelling of the data, which needs to be kept private and requires an additional level of protection. A very helpful tool is linked to your SQL pools and other Azure SQL products: Data Discovery & Classification. As shown in Figure 8.10, when Data Discovery & Classification. is selected, the database is analyzed for content that might benefit with some additional form of classification and sensitivity labeling.

FIGURE 8.10 Data Discovery & Classification

Notice that selecting the Data Discovery & Classification navigation menu item resulted in a scanning of the dbo schema on the Azure Synapse Analytics dedicated SQL Pool. A table named SUBJECTS was discovered to potentially contain PII data, and the tool recommended the sensitivity label and the classification of the data in the Information Type column. The SUBJECTS schema is located on GitHub in the Chapter08 directory. Recognize that you are the expert when it comes to the data hosted on your database. Information stored in columns may be named in a way that results in it not being recognized as requiring additional protection. For this scenario, you can use the + Add Classification menu button to add both the classification and sensitivity label. When you select the + Add Classification menu button, the pop‐up window in Figure 8.11 is rendered.

FIGURE 8.11 Data Discovery & Classification, Add Classification window

This feature exposes all the schemas, tables, and columns that exist on the database. After selecting add classification, you can then set the classification and sensitivity labels. In this case, the SUBJECTS table contains the column named BIRTHDATE and is classified as Date Of Birth related. The Sensitivity Label drop‐down is set to Confidential, which means that some kind of protective measure should be taken to control who can access the data. The other sensitivity levels are Public, General, and Highly Confidential. Reading through those examples, you can conclude that a sensitivity level of Public would likely not need any masking or encryption. As you progress through those different levels, the amount of security also increases with the maximum level of auditing, encryption, and access restrictions applied to the dataset as the Highly Confidential sensitivity level.

Data privacy also has to do with compliance regulations. There are many scenarios where data that is captured in one country cannot be stored outside the boundaries of that country. Data residency is an important point to consider when you are configuring redundancies into your data analytics solution. It is very common on Azure that datacenters are paired together and act as a disaster recovery option in case of catastrophic events. By default, for example, the West Europe region’s paired region is North Europe, and both are physically located in different countries. If your scenario requires that your data must remain in the country where it is collected in order to be compliant with your industry regulations, then take those appropriate actions. This capability is built into the platform for many Azure products, and when paired regions are located in different countries, you will be presented the option to disable this.

Design and Create Tests for Data Pipelines – Design and Implement a Data Stream Processing Solution

The data pipeline used in most examples in this chapter has included a data producer, an Event Hubs endpoint, and an Azure Stream Analytics job. Those components create the data, ingest the data, and then process the data stream. The processed data then flows into datastores like ADLS, an Azure Synapse Analytics SQL pool, Azure Cosmos DB, and Power BI. All those different parts make up the data pipeline.

To design a test for the data pipeline, you must identity the different components, which were mentioned in the preceding paragraph. The next step is to analyze each component to determine exactly what the data input format is, what is used to process that data, and what the output should be. If you take the BCI that produces brain wave readings, for example, the input consists of analog vibrations originating from a brain. The BCI is connected to a computer via Bluetooth, and a program converts the analog reading to a numeric value, which is then formatted into a JSON document and streamed to an event hub. Therefore, changes to the code that captures, transforms, and streams the brain wave readings to the event bub must be tested through the entire pipeline. Data is not modified as it flows through the event hub, so the next step in the pipeline to analyze is the Azure Stream Analytics job query. If the format of the incoming data stream had been changed, a change to the query would be required. For example, the addition of a new column to the data event message would require a change to the query. The final step is to validate that the output of the stream processing had an expected result on all downstream datastores that receive the processed data content.

In most mid‐ to large‐size projects, you would perform these tests in a testing environment that has an exact replica of the production data pipeline. As you learned in Chapter 6, you can use an Azure DevOps component named Azure Test Plans for more formal new features and regression testing capabilities. You will learn more about Azure Test Plans in Chapter 9, “Monitoring Azure Data Storage and Processing,” and Chapter 10, “Troubleshoot Data Storage Processing.”

Monitor for Performance and Functional Regressions

The “Configure Checkpoints/Watermarking During Processing” section discussed some metrics that are useful for monitoring performance, such as Watermark Delay, Resource Utilization, and Events Count (refer to Figure 7.44). Many other metrics are available. The following are a few of the more interesting metrics:

  • Backlogged Input Events
  • CPU % Utilization
  • Data Conversion Errors
  • Early Input Events
  • Input Events
  • Last Input Events
  • Out of order Events
  • Output Events
  • Runtime errors
  • SU (Memory) % Utilization

Each of these metrics can potentially help you find out the cause of issues your Azure Stream Analytics job is having. A regression means that a bug in your code that was previously fixed has been reintroduced into the application. From an Azure Stream Analytics perspective, this would happen in the query, or, if your job contains a function, the code may have been corrupted in that area of the processing logic. To help determine when this happened, you can review the Activity Log blade, which provides a list of changes made over a given time frame. If you have your queries and function code in a source code repository, then you could also take a look in the file history to see who changed and merged the code, when, and how.

Exam Essentials – Design and Implement a Data Stream Processing Solution

Azure Event Hubs, Azure Stream Analytics, and Power BI. When you are designing your stream processing solution, one consideration is interoperability. Azure Event Hubs, Azure Stream Analytics, and Power BI are compatible with each other and can be used seamlessly to implement your data stream processing design. Other products are available on the Azure platform for streaming, such as HDInsight 3.6, Hadoop, Azure Databricks, Apache Storm, and WebJobs.

Windowed aggregates. Windowing is provided through temporal features like tumbling, hopping, sliding, session, and snapshot windows. Aggregate functions are methods that can calculate averages, maximums, minimums, and medians. Windowed aggregates enable you to aggregate temporal windows.

Partitions. Partitioning is the grouping of similar data together in close physical proximity in order to gain more efficient storage and query execution speed. Both efficiency and speed of execution are attained when data with matching partition keys is stored and retrieved from a single node. Data queries that pull data from remote datastores, different partition keys, or data that is located on more than a single node take longer to complete.

Time management. The tracking of the time when data is streaming into an ingestion point is crucial when it comes to recovering from a disruption. The timestamps linked to an event message, such as event time, arrival time, and the watermark, all help in this recovery. The event time identifies when the data message was created on the data‐producing IoT device. The arrival time is the enqueued time and reflects when the event message arrived at the ingestion endpoint, like Event Hubs.

Watermark. As shown in Figure 7.41, the watermark is a time that reflects the temporal time frame in which the data was processed by the stream processor. If the time window is 5 seconds, all event messages processed within that time window will receive the same watermark.

Create an Azure Key Vault Resource – Keeping Data Safe and Secure-2

The Networking tab enables you to configure a private endpoint or the binding to a virtual network (VNet). The provisioning and binding of Azure resources to a VNet and private endpoints are discussed and performed in a later section. After the provisioning the key vault, you created a key, secret, and a certificate. The keys in an Azure key vault refer to those used in asymmetric/public key cryptography. As shown in Figure 8.2, two types of keys are available with Standard tier: Rivest‐Shamir‐Adleman (RSA) and elliptic curve (EC). These keys are used to encrypt and decrypt data (i.e., public‐key cryptography). Both keys are asymmetric, which means they both have a private and public key. The private key must be protected, as any client can use the public key to encrypt, but only the private key can decrypt it. Each of these keys has multiple options for the strength of encryption. For example, RSA has 2,048, 3,072, and 4,096 bits. The higher the number, the higher the level of security. High RSA numbers have caveats concerning speed and compatibility. The higher level of encryption requires more time to decrypt, and not all platforms can comply with such high levels of encryption. Therefore, you need to consider which level of security is best for your use case and security compliance requirements.

A secret is something like a password, connection string, or any text up to 25k bytes that needs protection. Connection strings and passwords commonly are stored in configuration files or hard coded into application code. This is not secure, because anyone who has access to the code or the server hosting the configuration file has access to the credentials and therefore the resources protected by them. Instead of that approach, applications can be coded to retrieve the secret from a key vault, then use it for making the required connections. In a production environment, the secret can be authenticated against a managed identity or service principal when requested.

Secrets stored in a key vault are encrypted when added and decrypted when retrieved. The certificate support in Azure Key Vault provides the management of x509 certificates. If you have ever worked on an Internet application that uses the HTTP protocol, you have likely used an x509 certificate. When this type of certificate is applied to that protocol, it secures communication between the entities engaged in the conversation. To employ the certificate, you must use HTTPS, and this is commonly referred to as Transport Layer Security (TLS). In addition to securing communication over the Internet, certificates can also be used for authentication and for signing software. Consider the certificate you created in Exercise 8.1. When you click the certificate, you will notice it has a Certificate Version that resembles a GUID but without the dashes. Associated to that Certificate Version is a Certificate Identifier, which is a URL that gives you access to the certificate details. When you enter the following command using the Azure CLI, you will see information like the base‐64–encoded certificate, the link to the private key, and the link to the public key, similar to that shown in Figure 8.6.

FIGURE 8.6 Azure Key Vault x509 certificate details

The link to the private key and public keys can be used to retrieve the details in a similar fashion. The first Azure CLI cmdlet retrieves the private key of the x509 certificate, which is identified by the kid attribute. The second retrieves the public key using the endpoint identified by the sid attribute.

The ability to list, get, and use keys, secrets, and certificates is controlled by the permissions you set up while creating the key vault. Put some thought into who and what gets which kind of permissions to these resources.

Summary – Design and Implement a Data Stream Processing Solution

This chapter focused on the design and development of a stream processing solution. You learned about data stream producers, which are commonly IoT devices that send event messages to an ingestion endpoint hosted in the cloud. You learned about stream processing products that read, transform, and write the data stream to a location for consumers to access. The location of the data storage depends on whether the data insights are required in real time or near real time. Both scenarios flow through the speed layer, where real‐time insights flow directly into a consumer like Power BI and near real‐time data streams flow into the serving layer. While the insights are in the serving layer, additional transformation can be performed by batch processing prior to consumption. In addition to the time demands on your streaming solution, other considerations, such as the data stream format, programming paradigm, programming language, and product interoperability, are all important when designing your data streaming solution.

Azure Stream Analytics has the capacity to process data streams in parallel. Performing work in parallel increases the speed in which the transformation is completed. The result is a faster gathering of business insights. This is achieved using partition keys. Partition keys provide the platform with information that is used to group together the data and process it on a dedicated partition. The concept of time is very important in data stream solutions. Arrival time, event time, checkpoints, and watermarks all play a very important role when interruptions to the data stream occur. You learned that when an OS upgrade, node exception, or product upgrade happens, the platform uses these time management properties to get your stream back on track without losing any of the data. The replaying of data streams is possible if you have created or stored the data required to replay them. There are no such data archival features on the data streaming platform to achieve this.

There are many metrics you can use to monitor the performance of your Azure Stream analytics job. For example, the Resource Utilization, Event Counts, and Watermark Delay metrics can help you determine why the stream results are not being processed as expected or at all. Diagnostic settings, alerts, and Activity logs can also help determine why your stream processing is not achieving the expected results. Once you determine the cause of the problem, you can increase the capacity by scaling, configuring the error policy, or changing the query to fix a bug.

Handle Interruptions – Design and Implement a Data Stream Processing Solution

As previously mentioned, the platform will handle data stream interruptions when caused by a node failure, an OS upgrade, or product upgrades. However, how can you handle an exception that starts happening unexpectedly? Look back at Figure 7.51 and notice an option named + New Alert Rule, just above the JSON tab. When you select that link, a page will render that walks you through the configuration of actions to be taken when an operation completes with a Failed status. This is covered in more detail in Chapter 10.

Ingest and Transform Data

Chapter 5 covered data ingestion and transformation in detail. In this chapter you learned how a data stream is ingested and how it can then be transformed. Technologies like Azure Event Hubs, Azure IoT Hub, and Azure Data Lake Storage containers are all very common products used for data stream ingestion. Azure Stream Analytics and Azure Databricks receive the data stream and then process the data. The processed data stream is then passed along to a downstream consumer.

Transform Data Using Azure Stream Analytics

Exercise 7.5 is a very good example of transforming data that is ingested from a stream. You created a tumbling window size of 5 seconds, and all the brain wave readings that were received in that time window were transformed. The query that transformed those brain wave readings calculated the median for each frequency. Once calculated, the median values were compared against estimated frequency values for the meditation scenario, and the result was then passed to Power BI for real‐time visualization.

Monitor Data Storage and Data Processing

The monitoring capabilities concerning stream processing are covered in Chapter 9. This section is added to help you navigate through the book while referring to the official online DP‐203 exam objectives/Exam Study Guide, currently accessible from https://learn.microsoft.com/en-us/certifications/exams/dp-203.

Monitor Stream Processing

You will find some initial information about monitoring stream processing in the “Monitor for Performance and Functional Regressions” section in this chapter. You will find a more complete discussion of this topic in Chapter 9, in the section with the same name, “Monitor for Performance and Functional Regressions.”

Implement a Data Auditing Strategy – Keeping Data Safe and Secure

Before you can design and implement a data security solution, you need to discover and classify your data. As you learned, Microsoft Purview has features for discovering, classifying, and proposing a sensitivity level. In Exercise 8.2 you provisioned a Microsoft Purview account, viewed the Collection Admins role assignments, and added a few collections. In Exercise 8.3 you will perform a scan that discovers data assets within the targeted collection and identifies whether they meet basic classification and sensitivity levels. Before you begin Exercise 8.3, it is important to call out three security actions you took in the previous two exercises that are required for Exercise 8.3 to work. Recall step 4 in Exercise 8.1, where you created an AKV secret named azureSynapseSQLPool that contains the password for your Azure Synapse Analytics dedicated SQL pool.

You will configure Microsoft Purview to use this Azure Key Vault secret to access and analyze the assets within that dedicated SQL pool. In Exercise 8.2, step 2, you validated that your account was in the Collection Admins group on the Role Assignments tab for the root collection. Additionally, in step 5 of Exercise 8.2 you granted Get and List permissions to the Azure Key Vault secret to your Microsoft Purview account identity. As you will configure in Exercise 8.3, one more permission is required to make this work for your Azure Synapse Analytics dedicated SQL pool. The same Microsoft Purview account identity that you granted access to Azure Key Vault must be added to the Reader role via Access control (IAM) on your Azure Synapse Analytics workspace.

Note that each Azure product that you want to perform a scan on from Microsoft Purview will likely have its own set of permissions and role access requirements. You will need to find this out using online documentation on a product‐by‐product basis. Exercise 8.3 and previous exercises provide the instructions to perform a scan on an Azure Synapse Analytics dedicated SQL pool. Complete Exercise 8.3 to gain hands‐on experience with this product and feature.