Category: Design for Data Privacy

Implement Data Masking – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu link ➢ select your dedicated SQL pool ➢ start the SQL pool ➢ after the SQL pool is running, select the Dynamic Data Masking blade ➢ select the + Add Mask menu item ➢ make the configurations as shown in Figure 8.39 ➢ and then click the Add button.

FIGURE 8.39 Implement data masking and masking rule.

  1. Click the Save button ➢ log in to the dedicated SQL pool using Azure Data Studio, with the credentials created in Exercise 8.7 ➢ and then execute the following:
    SELECT ID, FIRSTNAME, LASTNAME, EMAIL, COUNTRY, CREATE_DATE FROM dbo.SUBJECTS
  2. Notice that the data in the EMAIL column has the mask configured in step 1 and illustrated in Figure 8.39. Stop the dedicated SQL pool.

It is possible to configure a custom mask instead of using the preconfigured masks. This requires that you provide the number of starting characters to show, followed by the padding string (something like xxxxx.), followed by the number of ending characters to display. Using a prefix and suffix value of three and the padding on the EMAIL column would result in benxxxxx.net, for example, which is a bit more useful than what is provided using the default.

Manage Identities, Keys, and Secrets Across Different Data Platform Technologies

Protecting credentials has historically been very challenging. Developers and data engineers need to access data, and that data is protected by an ID and password. As the complexity and size of your organization grows, it is easy to lose control over who has what credentials. Add that loss of control to the potential impact changing a password can have on a production environment. This scenario is commonly referred to as credential leakage. An initial solution to credential leakage was to store connection details in a managed credential store, something like Azure Key Vault. However, access to the credential store also requires credentials, so you are back in the same place as before the implementation of the credential store. The ultimate solution is to use a combination of Azure Key Vault and managed identities. Instead of using a credential to make a connection to a storage account or a database from application code, you instead reference the Azure Key Vault endpoint. An Azure Key Vault secret endpoint resembles the following:

https://<accountName>.vault.azure.net/secrets/<secretName>/5db1a9b5…

The code that uses that endpoint must implement the DefaultAzureCredential class from the Azure Identity library. The library works with all popular programming languages: .NET, Python, Go, Java, etc. Passing a new DefaultAzureCredential class to the SecretClient class results in the acquisition of the managed identity credential, which is a token. The client then stores all necessary attributes to perform the retrieval of a secret from the Azure Key Vault endpoint. The following C# code performs this activity:

 var kvUri = “https://” + accountName + “.vault.azure.net”;
 var client = new SecretClient(new Uri(kvUri), new DefaultAzureCredential());

You can use the client to get a secret by using the following C# syntax:

 var secret = await client.GetSecretAsync(secretName);

Now you know how a managed identity can avoid credential leakage, but you might be wondering what exactly are managed identities and what important aspects must you know in order to implement them safely and securely? Table 8.3 compares the two types of managed identities: system‐assigned and user‐assigned.

TABLE 8.3 Managed identity types

CharacteristicSystem‐assigned managed identityUser‐assigned managed identity
ProvisioningAzure resources receive an identity by default, where supported.Created manually
RemovalThe identity is deleted when the associated Azure resource is deleted.Deleted manually
SharingThe identity cannot be shared among Azure resources.Can be shared

A system‐assigned managed identity is created during the provisioning of the Azure resource. For example, an Azure Synapse Analytics workspace and a Microsoft Purview account both have a system‐assigned identity by default. Azure products that are generally used to make connections to other Azure products or features have this managed identity created by default. In contrast, an Azure storage account receives data but does not commonly push data out to other systems, that would need an identity to do so. This is why you see no managed identities for Azure storage accounts. A system‐assigned managed identity can be used only by the Azure resource to which it is bound, and it is deleted when the Azure resource is deleted. A user‐assigned managed identity can be shared across Azure products and is a separate resource in itself and can have its own lifecycle. Perform Exercise 8.9, where you create a user‐assigned managed identity.

Encrypt Data at Rest and in Motion – Keeping Data Safe and Secure

With very few exceptions, data stored on the Azure platform is encrypted at rest by default. This means that if a hard drive that contains your data is stolen or the server where your data is stored is unlawfully accessed, the data cannot be deciphered. Keep in mind that when you give a group, individual, or a service principal access to the data, the unencrypted data is available to them. From an Azure Synapse Analytics dedicated SQL pool perspective, the means for encrypting server‐side data files, data backups, and log files at rest rely on the Transparent Data Encryption (TDE) feature. As shown in Figure 8.34, TDE can be enabled on the Transparent Data Encryption blade in the Azure portal.

FIGURE 8.34 Encrypt data at rest, TDE, dedicated SQL pool.

Encryption of data in motion is achieved by using the strongest possible version of TLS when accessing your data using the HTTP protocol. When HTTPS is utilized, a secure connection is made between the data consumer and the server. When the data is transferred through the network, it is encrypted using the strongest cypher that both the client and server support. If you want to make the network transmission even more secure, you might consider isolating the network communication channel using a virtual private network (VPN). This can be achieved by implementing the Azure VPN Gateway product or an Azure ExpressRoute. There are three common VPN models: point‐to‐point, point‐to‐site, and site‐to‐site. A point‐to‐point connection is restricted to a single connection between two machines, whereas a point‐to‐site connection means that a single machine has access to all the machines within a given site. A site in this context means an on‐premises network. A site‐to‐site model would allow all machines contained in two networks to connect with each other. An ExpressRoute connection is a dedicated connection from Azure to your on‐premises datacenter. It is configured with the help of an Internet service provider and is costly; however, it is the most secure means for transferring data, as the transmission does not traverse the Internet at all.

An additional concept, encryption‐in‐use, was explained earlier in the “Design Data Encryption for Data at Rest and in Transit” section. Recall that encryption‐in‐use is enabled using a product called Always Encrypted. As mentioned previously, TDE is used to encrypt data on the server‐side; however, there may be a need to encrypt data client‐side within an application. Always Encrypted is useful for encrypting data client‐side, directly within an application, before storing it on the Azure platform. Encrypting data client‐side ensures that individuals with highly privileged server‐side credentials cannot view or decrypt the data without being specifically granted permission. This gives you complete control over your data, while allowing a third party, like Microsoft, to perform the database administration responsibilities.

Implement Row‐Level and Column‐Level Security

Row‐level security (RLS) is implemented at the information protection level of the layered security model (refer to Figure 8.1). Row‐level security restricts access to rows in a database table and is realized using the CREATE SECURITY POLICY command and predicates. Figure 8.35 illustrates row‐level security.

FIGURE 8.35 Row‐level security

Using the EXECUTE AS statement prior to the SELECT statement, as follows, results in the result set being filtered based on the user permissions:

 EXECUTE AS USER = ‘BrainwaveBrainjammer1’

 SELECT * FROM BRAINWAVES



Column‐level security restricts access to specific columns on a database table, as illustrated in Figure 8.36.

To get hands‐on experience implementing column‐level security, complete Exercise 8.7.

FIGURE 8.36 Column‐level security

Configure and Perform a Data Asset Scan Using Microsoft Purview – Keeping Data Safe and Secure-2

If you have not created the SUBJECTS table on your dedicated SQL pool, create the table using the SUBJECTS.sql file located in the Chapter08 directory on GitHub.

The first action you took after accessing the Azure portal was to add the Microsoft Purview account identity to the Reader role of the Azure Synapse Analytics workspace. Note that adding this role assignment at the workspace level results in the Reader permissions being granted to all resources that exist in the workspace. This is a good level of access for Microsoft Purview to perform proper governance and auditing activities. It is also possible to provide this level of access specifically to a SQL or Spark pool using the same approach via the Access control (IAM) role assignments feature while those analytics pools are in focus. Next, you navigated to the Manage hub on the Azure Synapse Analytics workspace and bound the Microsoft Purview account together with the workspace. This provided easy access to the Microsoft Purview Governance portal.

 Until you configure the new credential, as shown in Figure 8.21, you may receive a Failed to load serverless databases from Synapse workspace error message. Once you select the new credential (for example, sqladminuser), the error will go away. In this example, the username and password are the same for both the serverless and dedicated SQL pools.

Once in the Microsoft Purview Governance portal, you registered a collection named ASA‐csharpguitar into the R&D parent collection. After the collection that targeted your Azure Synapse Analytics workspace was completed, you began with the configuration of an asset scan. A credential that can access both the serverless and dedicated SQL pool is required at this point. Selecting the + New item from the Credential drop‐down list box provided the option to do this. You added a connection to the Azure Key Vault connection that targets the secret created in Exercise 8.1. The secret contains the password of your dedicated SQL pool, which is, in this example, the same as the built‐in serverless database SQL pool. Once configured and selected from the Credential drop‐down list box, you were able to select the dedicated SQL pool as the target data source of the scan.

When you selected to use the System Default scan rule set, you chose to use all the supported classification rules. While configuring the scan, you might have noticed the View Details link below that value. Clicking the View Details link currently renders a list of 208 classification rules grouped together with names such as Government, Financial, Base, Personal, Security, and Miscellaneous. You also have the option to create a custom rule that allows you to include your own additional set of items to scan for. The Security scan checks for passwords that match common patterns; the Government scan checks for values that match an ID; and the Personal scan checks for birth dates, email addresses, and phone numbers, for example. If you didn’t look at that, go back and check it out for the full set of attributes that are searched for when running an asset scan. The next window gives you the option to schedule the audit scan weekly or monthly. In a live scenario, where you have a lot of activity on your data sources, this would be a good idea. Lastly, you ran the scan, viewed the results shown in Figure 8.22, and then stopped the dedicated SQL pool. In Exercise 8.5 you will use those results to classify and curate the data existing in the SUBJECTS table.

Azure Synapse Analytics includes an Auditing feature for dedicated SQL pools. Complete Exercise 8.4 to configure and implement Auditing on an Azure Synapse Analytics dedicated SQL pool.

Design a Data Masking Strategy – Keeping Data Safe and Secure

A mask is an object that partially conceals what is behind it. From a data perspective, a mask would conceal a particular piece of the data but not all of it. Consider, for example, email addresses, names, credit card numbers, and telephone numbers. Those classifications of data can be helpful if there is ever a need to validate a person’s identity. However, you would not want all of the data rendered in a query; instead, you can show only the last four digits of the credit card number or the first letter of an email address and the top level domain value like .com, .net, or .org, like the following:

 [email protected]

There is a built‐in capability for this masking in the Azure portal related to an Azure Synapse Analytics dedicated SQL pool. As shown in Figure 8.13 navigating to the Dynamic Data Masking blade renders masking capabilities.

The feature will automatically scan your tables and find columns that may contain data that would benefit from masking. You apply the mask by selecting the Add Mask button, selecting the mask, and saving it. Then, when a user who is not in the excluded list, as shown in Figure 8.13, accesses the data, the mask is applied to the resulting dataset. Finally, the primary objective of masking is to conceal enough of the data in a column so that it can be used but not exploited. That partial data visibility demonstrates the difference between masking and encryption. When data in a column is encrypted, none of it is readable, whereas a mask can be configured to allow partial data recognition.

FIGURE 8.13 Dynamic Data Masking dedicated SQL pool

Design Access Control for Azure Data Lake Storage Gen2

There are four authorization methods for Azure storage accounts. The method you have been using in most scenarios up to now has been through access keys. An access key resembles the following:

 4jRwk0Ho7LB+si85ax…yuZP+AKrr1FbWbQ==

The access key is used in combination with the protocol and storage account name to build the connection string. Clients can then use this to access the storage account and the data within it. The connections string resembles the following:

 DefaultEndpointsProtocol=https;AccountName=<name>;AccountKey=<account-key>

On numerous occasions you have created linked services in the Azure Synapse Analytics workspace. When configuring a linked service for an Azure storage account, you might remember seeing that shown in Figure 8.14, which requests the information required to build the Azure storage account connection string.

FIGURE 8.14 ADLS access control access keys

Notice that the options request the authentication type, which is set to access key, sometimes also referred to as an account key, to be used as part of a connection string, followed by the storage account name and the storage account key. Those values are enough for the client—in this case, an Azure Synapse Analytics linked service—to successfully make the connection to the storage account. HTTPS is used as default, which enforces data encryption‐in‐transit; therefore, an authentication type is not requested. Another authorization method similar to access keys is called shared access signature (SAS) authorization. This authorization method gives you a bit more control over what services, resources, and actions a client can access on the data stored in the account. Figure 8.15 shows the Shared Access Signature blade in the Azure portal for Azure storage accounts.

When you use either an access key or a SAS URL, any client with that token will get access to your storage account. There is no identity associated with either of those authorization methods; therefore, protecting that token key is very important. This is a reason that offering the retrieval of the account key from an Azure key vault is also an option, as you saw in Figure 8.14. Storing the access key and/or the SAS URL in an Azure key vault would remove the need to store the key within the realm of an Azure Synapse Analytics workspace. Although this is safe, reducing the number of clients who have possession of your authorization keys is a good design. Any entity that needs these keys can be granted access to the Azure key vault and the keys for making the connection to your storage account. The other two remaining authorization methods are RBAC and ACL, which are covered in the following sections. As an introduction to those sections, Table 8.2 provides some details about both the Azure RBAC and ACL authorization methods.

TABLE 8.2 Azure storage account authorization methods

MethodScopeRequire an identityGranularity level
Azure RBACStorage account, containerYesHigh
ACLFile, directoryYesLow

FIGURE 8.15 ADLS Access control shared access signature

Authorization placed upon an Azure storage account using an RBAC is achieved using a role assignment at the storage account or container level. ACLs are implemented by assigning read, write, or delete permissions on a file or directory. As shown in Figure 8.16, if the identity of the person or service performing the operation is associated with an RBAC group with the assignment allowing file deletion, then that access is granted, regardless of the ACL permission.

However, if the identity associated with the group that is assigned RBAC permissions does not have the authorization to perform a delete but does have the ACL permission, then the file can be deleted. The following sections describe these authentication methods in more detail.

Optimize Pipelines for Analytical or Transactional Purposes – Design and Implement a Data Stream Processing Solution

There are numerous approaches for optimizing data stream pipelines. Two such approaches are parallelization and compute resource management. You have learned that setting a partition key in the date message results in the splitting of messages across multiple nodes. By doing this your data streams are processed in parallel. You can see how this looks in Figure 7.31 and the discussion around it. Managing the compute resources available for processing the data stream will have a big impact on the speed and availability of your data to downstream consumers. As shown in Figure 7.48, the increase in the watermark delay was caused by the CPU utilization on the processor nodes reaching 100 percent.

FIGURE 7.48 Azure Stream Analytics job metrics, CPU at 99 percent utilization

When the CPU is under such pressure, event messages get queued, which causes a delay in processing and providing output. Increasing the number of SUs allocated to the Azure Stream Analytics job will have a positive impact on the pipeline, making it more optimal for analytical and transactional purposes.

Scale Resources

As mentioned in the previous section, adding more resources to the Azure Stream Analytics Job will result in faster processing of the data stream. This assumes that you have noticed some processing delays and see that the current level of allocated resources is not sufficient. To increase the amount of compute power allocated to an Azure Stream Analytics job, select the Scale navigation link for the given job, as shown in Figure 7.49.

FIGURE 7.49 Azure Stream Analytics job scaling

Once compute power is allocated, when you start the job, instead of seeing 3, as shown in Figure 7.18, you will see the number described in the Manual Scale Streaming Unit configuration. In this case, the number of SUs would be 60.

Design to Purge Data Based on Business Requirements – Keeping Data Safe and Secure

The primary difference between purging and deleting has to do with whether or not the data is gone for good. When you purge data, it means there is no way to recover it. If something called a soft delete is enabled, it means that the data can be recovered during a preconfigured timeframe. After that timeframe, the data will be purged. Soft‐deleted data continues to consume storage space in your database or on your datastore, like an Azure storage container. The storage consumption is only freed when the data is purged. Like all scenarios related to retention and data deletion discussed up to now, you need to first decide which data has a sensitivity level that must adhere to a retention policy. Once you determine which data must be deleted, you need to determine at what age the data should be removed. After identifying those two pieces of information, you might consider deleting the data from your database using the DELETE SQL command. The following command removes all the data from the SUBJECTS table where the CREATE_DATE value is 3 months old from the current date:

 DELETE FROM SUBJECTS WHERE CREATE_DATE < DATEADD(month, -3, GETDATE())

When the amount of data is large, this kind of query can have a significant impact on performance. The impact can result in latency experienced by other data clients inserting, updating, or reading data from the same database. A very fast procedure for removing data is to place the data onto a partition that is defined by the column that defines the lifecycle of the data, for example, using the CREATE_DATE in the SUBJECTS table as the basis for a partition. When the data on that partition has breached the retention threshold, remove the partition, and the data is removed. Another approach is to select the data you want to keep, use the result to insert it into another table, and then switch the tables. This is achieved using CTAS, which was introduced in Chapter 2, “CREATE DATABASE dbName; GO,” along with the partitioning concept mentioned previously. The following SQL snippet is an example of how to achieve the purging of data without using the DELETE SQL command:

 SELECT * INTO SUBJECTS_NEW FROM SUBJECTS 
 WHERE CREATE_DATE> DATEADD(month, -3, GETDATE())
 RENAME OBJECT SUBJECTS TO SUBJECTS_OLD
 RENAME OBJECT SUBJECTS_NEW TO SUBJECTS
 DROP TABLE SUBJECTS_OLD

The SELECT statement retrieves the data with a creation date that is not older than 3 months and places the data into a new table. The existing primary table named SUBJECTS is renamed by appending _OLD to the end. Then the newly populated table that was appended with _NEW is renamed to the primary table name of SUBJECTS. Lastly, the table containing data that is older than 3 months is dropped, resulting in its deletion.

Replay an Archived Stream Data in Azure Stream Analytics – Design and Implement a Data Stream Processing Solution

  1. Log in to the Azure portal at https://portal.azure.com➢ navigate to the Azure Stream Analytics job you created in Exercise 3.17➢ select Input from the navigation menu ➢ select the + Add Stream Input drop‐down menu ➢ select Blob Sorage/ADLS Gen2 ➢ provide an input alias name (I used archive) ➢ select Connection String from the Authentication Mode drop‐down list box ➢ select the ADLS container where you stored the output in Exercise 7.9➢ and then enter the path to the file into the Path Pattern text box, perhaps like the following. Note that the filename is shortened for brevity and is not the actual filename of the output from Exercise 7.9.
    EMEA/brainjammer/in/2022/09/09/10/0_bf325906d_1.json
  2. The configuration should resemble Figure 7.46. Click the Save button.

FIGURE 7.46 Configurating an archive input alias

  1. Select Query from the navigation menu ➢ enter the following query into the query window ➢ click the Save Query button ➢ and then start the job. The query is available in the StreamAnalyticsQuery.txt file located in the Chapter07/Ch07Ex10 directory on GitHub.
    SELECT Scenario, ReadingDate, ALPHA, BETA_H, BETA_L, GAMMA, THETA

INTO ADLS

FROM archive

  1. Wait until the job has started ➢ navigate to the ADLS container and location you configured for the archive input alias in step 1 ➢ download the 0_bf325906d_1.json file from the Chapter07/Ch07Ex10 directory on GitHub ➢ upload that file to the ADLS container ➢ and then navigate to the location configured for your ADLS Output alias. A duplicate of the file is present, as shown in Figure 7.47.

FIGURE 7.47 Archive replay data result

  1. Stop the Azure Stream Analytics job.

Exercise 7.10 illustrates how a file generated by the output of a data stream can be replayed and stored on a targeted datastore. You may be wondering why something like this would be needed. Testing new features, populating testing environments with data, or taking a backup of the data are a few reasons for archiving and replaying data streams. One additional reason to replay data is due to downstream systems not receiving the output. The reason for missing data might be an outage on the downstream system or a timing issue. If an outage happened downstream and data was missing from that store, it is likely easy to understand why: the data is missing because the datastore was not available to receive the data.

The timing issue that can result in missing data can be caused by the timestamp assigned to the data file. The timestamp for a data file stored in an ADLS container is in an attribute named BlobLastModifiedUtcTime. Consider the action you took in step 4 of Exercise 7.10, where you uploaded a new file into the location you configured as the input for your Azure Stream Analytics job. Nothing happened when you initially started the job, for example, at 12:33 PM. This is because files that already exist in that location will not be processed, because their timestamp is earlier than the start time of the Azure Stream Analytics job. When you start a job with a Job Output Start Time of Now (refer to Figure 7.18), only files that arrive after the time are processed. Once you added the file with a timestamp after the job had already been started, for example, 12:40 PM, it got processed.

The same issue could exist for downstream systems, in that the data file could arrive at a datastore but the processor is experiencing an outage of some kind. When it starts back up and is online, it may be configured to start processing files only from the current time forward, which would mean the files received during the downtime will not be processed. In some cases, it might be better and safer to replay the data stream instead of reconfiguring and restarting all the downstream systems to process data received while having availability problems. Adding the file in step 4 is not the only approach for getting the timestamp updated to the current time. If you open the file, make a small change that will not have any impact on downstream processing, and then save it, the timestamp is updated, and it will be processed. Perform Exercise 7.10 again but instead of uploading the 0_bf325906d_1.json file, make a small change to the file already in that directory and notice that it is processed and passed into the output alias ADLS container.

Azure Role‐Based Access Control – Keeping Data Safe and Secure

RBAC has been discussed in Chapter 1, “Gaining the Azure Data Engineer Associate Certification,” Chapter 3, “Data Sources and Ingestion,” and Chapter 5. What RBAC is, its purpose, and how it achieves controlling access to Azure resources should be somewhat clear at this point. This section will go into a bit more detail in the context of an Azure storage account that includes an ADLS container. The implementation of RBAC assignments for a specific resource is performed on the Access Control (IAM) blade for the given resource, as you have seen previously in Figure 1.28 and Figure 5.47. Figure 8.17 provides an example of the permission related to an Azure storage account.

FIGURE 8.16 RBAC and ACL permission evaluation

FIGURE 8.17 RBAC Access Control (IAM) Azure storage account

Notice that the Role Assignments tab is selected and that there are numerous roles and users within them. The roles highlighted are Owner, Storage Blob Data Contributor, and Storage Blob Data Reader. An account that is part of the Owner role has full access to manage all resources within the subscription. In addition to that, members with an Owner role can assign roles to other members. Notice that the Owner group does not grant access to the data layer. The user in the Owner group will not be able to read the data in the storage account with just the Owner permission, although it can look this way in the portal if the Account Key Authentication option is enabled, since the owner does have access to those keys. The permissions section of the JSON role resembles the following, which supports the preceding statement of having full access to all resources:

 “permissions”: [{ “actions”: [ “*” …]

The Storage Blob Data Contributor role concentrates specifically on the storage accounts in the subscription. The following permissions are granted for this role. Notice that members of this role can delete, read, and write blobs to all the containers within the storage account. You might notice the account of type app named csharpguitar. That is the identity linked to the Azure Synapse Analytics workspace and is the way in which it has been granted permission to the ADLS container used for the workspace.

 “permissions”: [{ “actions”: [
  “Microsoft.Storage/storageAccounts/blobServices/containers/delete”,
  “Microsoft.Storage/storageAccounts/blobServices/containers/read”,
  “Microsoft.Storage/storageAccounts/blobServices/containers/write”,
  “Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action” …]

The ability to receive a SAS key for accessing data within the container is also granted to the Storage Blob Data Contributor role. This is achieved via the generateUserDelegationKey permission. The permissions for Storage Blob Data Reader role are as follows.

 “permissions”: [{ “actions”: [
  “Microsoft.Storage/storageAccounts/blobServices/containers/read”,
  “Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action” …]

Looking at the permission details, it is easy to see the difference between the Storage Blob Data Contributor role and Storage Blob Data Reader role. There are no permissions for delete or write operations on any container in the storage account for the reader. As illustrated in more granular detail by Figure 8.18, when the member attempts to delete a blob, the platform checks the RBAC role. If the role has permission to delete the blob, then the operation is allowed. If the role does not have the delete permission but the member does have the ACL permission, then the operation will be performed. Without an ACL delete permission, the operation is denied due to access restriction.

FIGURE 8.18 RBAC role and ACL permission evaluation

Remember that if a built‐in RBAC role does not meet your requirements, you can create a custom RBAC role to include or exclude permissions. As shown in Figure 8.17, a group named BRAINJAMMER has been added to a custom RBAC role named Storage Account Custom. The group is an Azure Active Directory security group that contains members who are also part of the Azure Active Directory tenant. Those members receive the permissions associated with the RBAC custom permission list. As you know, adding individual members to RBAC roles at the resource level is the most inefficient approach. Instead, you should create groups that effectively describe the role, add members to it, and then add the group to the RBAC role. You will create this group in Exercise 8.12.

Exam Essentials – Design and Implement a Data Stream Processing Solution

Azure Event Hubs, Azure Stream Analytics, and Power BI. When you are designing your stream processing solution, one consideration is interoperability. Azure Event Hubs, Azure Stream Analytics, and Power BI are compatible with each other and can be used seamlessly to implement your data stream processing design. Other products are available on the Azure platform for streaming, such as HDInsight 3.6, Hadoop, Azure Databricks, Apache Storm, and WebJobs.

Windowed aggregates. Windowing is provided through temporal features like tumbling, hopping, sliding, session, and snapshot windows. Aggregate functions are methods that can calculate averages, maximums, minimums, and medians. Windowed aggregates enable you to aggregate temporal windows.

Partitions. Partitioning is the grouping of similar data together in close physical proximity in order to gain more efficient storage and query execution speed. Both efficiency and speed of execution are attained when data with matching partition keys is stored and retrieved from a single node. Data queries that pull data from remote datastores, different partition keys, or data that is located on more than a single node take longer to complete.

Time management. The tracking of the time when data is streaming into an ingestion point is crucial when it comes to recovering from a disruption. The timestamps linked to an event message, such as event time, arrival time, and the watermark, all help in this recovery. The event time identifies when the data message was created on the data‐producing IoT device. The arrival time is the enqueued time and reflects when the event message arrived at the ingestion endpoint, like Event Hubs.

Watermark. As shown in Figure 7.41, the watermark is a time that reflects the temporal time frame in which the data was processed by the stream processor. If the time window is 5 seconds, all event messages processed within that time window will receive the same watermark.

Handle Interruptions – Design and Implement a Data Stream Processing Solution

As previously mentioned, the platform will handle data stream interruptions when caused by a node failure, an OS upgrade, or product upgrades. However, how can you handle an exception that starts happening unexpectedly? Look back at Figure 7.51 and notice an option named + New Alert Rule, just above the JSON tab. When you select that link, a page will render that walks you through the configuration of actions to be taken when an operation completes with a Failed status. This is covered in more detail in Chapter 10.

Ingest and Transform Data

Chapter 5 covered data ingestion and transformation in detail. In this chapter you learned how a data stream is ingested and how it can then be transformed. Technologies like Azure Event Hubs, Azure IoT Hub, and Azure Data Lake Storage containers are all very common products used for data stream ingestion. Azure Stream Analytics and Azure Databricks receive the data stream and then process the data. The processed data stream is then passed along to a downstream consumer.

Transform Data Using Azure Stream Analytics

Exercise 7.5 is a very good example of transforming data that is ingested from a stream. You created a tumbling window size of 5 seconds, and all the brain wave readings that were received in that time window were transformed. The query that transformed those brain wave readings calculated the median for each frequency. Once calculated, the median values were compared against estimated frequency values for the meditation scenario, and the result was then passed to Power BI for real‐time visualization.

Monitor Data Storage and Data Processing

The monitoring capabilities concerning stream processing are covered in Chapter 9. This section is added to help you navigate through the book while referring to the official online DP‐203 exam objectives/Exam Study Guide, currently accessible from https://learn.microsoft.com/en-us/certifications/exams/dp-203.

Monitor Stream Processing

You will find some initial information about monitoring stream processing in the “Monitor for Performance and Functional Regressions” section in this chapter. You will find a more complete discussion of this topic in Chapter 9, in the section with the same name, “Monitor for Performance and Functional Regressions.”