Category: Microsoft Certifications

Implement Data Masking – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu link ➢ select your dedicated SQL pool ➢ start the SQL pool ➢ after the SQL pool is running, select the Dynamic Data Masking blade ➢ select the + Add Mask menu item ➢ make the configurations as shown in Figure 8.39 ➢ and then click the Add button.

FIGURE 8.39 Implement data masking and masking rule.

  1. Click the Save button ➢ log in to the dedicated SQL pool using Azure Data Studio, with the credentials created in Exercise 8.7 ➢ and then execute the following:
    SELECT ID, FIRSTNAME, LASTNAME, EMAIL, COUNTRY, CREATE_DATE FROM dbo.SUBJECTS
  2. Notice that the data in the EMAIL column has the mask configured in step 1 and illustrated in Figure 8.39. Stop the dedicated SQL pool.

It is possible to configure a custom mask instead of using the preconfigured masks. This requires that you provide the number of starting characters to show, followed by the padding string (something like xxxxx.), followed by the number of ending characters to display. Using a prefix and suffix value of three and the padding on the EMAIL column would result in benxxxxx.net, for example, which is a bit more useful than what is provided using the default.

Manage Identities, Keys, and Secrets Across Different Data Platform Technologies

Protecting credentials has historically been very challenging. Developers and data engineers need to access data, and that data is protected by an ID and password. As the complexity and size of your organization grows, it is easy to lose control over who has what credentials. Add that loss of control to the potential impact changing a password can have on a production environment. This scenario is commonly referred to as credential leakage. An initial solution to credential leakage was to store connection details in a managed credential store, something like Azure Key Vault. However, access to the credential store also requires credentials, so you are back in the same place as before the implementation of the credential store. The ultimate solution is to use a combination of Azure Key Vault and managed identities. Instead of using a credential to make a connection to a storage account or a database from application code, you instead reference the Azure Key Vault endpoint. An Azure Key Vault secret endpoint resembles the following:

https://<accountName>.vault.azure.net/secrets/<secretName>/5db1a9b5…

The code that uses that endpoint must implement the DefaultAzureCredential class from the Azure Identity library. The library works with all popular programming languages: .NET, Python, Go, Java, etc. Passing a new DefaultAzureCredential class to the SecretClient class results in the acquisition of the managed identity credential, which is a token. The client then stores all necessary attributes to perform the retrieval of a secret from the Azure Key Vault endpoint. The following C# code performs this activity:

 var kvUri = “https://” + accountName + “.vault.azure.net”;
 var client = new SecretClient(new Uri(kvUri), new DefaultAzureCredential());

You can use the client to get a secret by using the following C# syntax:

 var secret = await client.GetSecretAsync(secretName);

Now you know how a managed identity can avoid credential leakage, but you might be wondering what exactly are managed identities and what important aspects must you know in order to implement them safely and securely? Table 8.3 compares the two types of managed identities: system‐assigned and user‐assigned.

TABLE 8.3 Managed identity types

CharacteristicSystem‐assigned managed identityUser‐assigned managed identity
ProvisioningAzure resources receive an identity by default, where supported.Created manually
RemovalThe identity is deleted when the associated Azure resource is deleted.Deleted manually
SharingThe identity cannot be shared among Azure resources.Can be shared

A system‐assigned managed identity is created during the provisioning of the Azure resource. For example, an Azure Synapse Analytics workspace and a Microsoft Purview account both have a system‐assigned identity by default. Azure products that are generally used to make connections to other Azure products or features have this managed identity created by default. In contrast, an Azure storage account receives data but does not commonly push data out to other systems, that would need an identity to do so. This is why you see no managed identities for Azure storage accounts. A system‐assigned managed identity can be used only by the Azure resource to which it is bound, and it is deleted when the Azure resource is deleted. A user‐assigned managed identity can be shared across Azure products and is a separate resource in itself and can have its own lifecycle. Perform Exercise 8.9, where you create a user‐assigned managed identity.

Audit an Azure Synapse Analytics Dedicated SQL Pool – Keeping Data Safe and Secure

  1. Log in to the Azure portal at https://portal.azure.com ➢ enter Log Analytics Workspaces in the Search box and select it from the drop‐down ➢ click the + Create button ➢ select a subscription ➢ select a resource group ➢ enter a name (I used brainjammer) ➢ select a region ➢ click the Review + Create button ➢ and then click Create.
  2. Navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ select the SQL Pools navigation menu item ➢ select your dedicated SQL pool ➢ start the dedicated SQL pool ➢ select the Auditing navigation menu item ➢ set the Enable Azure Auditing toggle switch to on ➢ select the Log Analytics check box ➢ select the subscription where you placed the Log Analytics workspace from step 1 ➢ select the workspace from the Log Analytics drop‐down list box ➢ and then click Save. The configuration should resemble Figure 8.23.

FIGURE 8.23 Dedicated SQL pool auditing configuration

  1. After some minutes, navigate to the Diagnostic Settings navigation menu item for the dedicated SQL pool ➢ notice the setting prefixed with SQLSecurityAuditEvents_ ➢ click the Edit Setting link associated with that diagnostic setting ➢ check the Sql Requests check box ➢ and then click Save. The configuration should resemble Figure 8.24.

FIGURE 8.24 Dedicated SQL pool Diagnostic setting configuration

  1. After some minutes, execute the following query on the dedicated SQL pool configured in the previous steps:
    SELECT * FROM [dbo].[SUBJECTS]
  2. After some minutes, select the View Audit Logs menu button (refer to Figure 8.23) ➢ click the Log Analytics menu button ➢ and then execute the default query. An example query is available on GitHub in the Chapter08/Ch08Ex04 directory. The output should resemble Figure 8.25.

FIGURE 8.25 View dedicated SQL pool audit logs in Log Analytics.

  1. Stop the dedicated SQL pool.

The first step taken in Exercise 8.4 resulted in the provisioning of an Azure Log Analytics workspace. This is the recommended location to store, analyze, and manage audit logs. Chapter 9, “Monitoring Azure Data Storage and Processing,” includes detailed coverage of Azure Log Analytics, including monitoring.

The provisioning was very straightforward, so at this point all you need to know is that Log Analytics is an optimal datastore for storing audit and diagnostic logs generated by Azure products. After selecting the Auditing navigation menu item bound to the dedicated SQL pool, you were presented with a blade that looked similar to Figure 8.23. There are a few points to call out on this blade.

The exercise instructed you to configure auditing for the dedicated SQL pool; however, there is another option that enables auditing for all SQL pools in the Azure Synapse Analytics workspace. This level of auditing was not needed to experience the desired learning from the exercise, but in a real enterprise implementation of a data solution that uses multiple SQL pools, you should consider enabling auditing for all SQL pools.

Diagnostic Settings is a very helpful feature for most of the Azure products. Like Log Analytics, a full description of this feature is coming in Chapter 9. When you enabled auditing, a default setting was created to capture SQL security audit events. You modified the setting to add SQL requests to the events that are captured and stored into Log Analytics. After completing the configuration and running a SQL statement on any table hosted in the dedicated SQL pool, you reviewed the logs. When you clicked the View Audit Logs menu button, as shown in Figure 8.23, you were navigated to the Log Analytics workspace.
The Log Analytics query editor enabled you to execute queries against the audit data being stored into it. Exercise 8.4 used the following query:
SQLSecurityAuditEvents
| where Category == ‘SQLSecurityAuditEvents’
| project EventTime, Statement, Succeeded, AffectedRows, ResponseRows,
ServerPrincipalName, ClientIp, ApplicationName, AdditionalInformation,
DataSensitivityInformation, DurationMs, ClientTlsVersion,
IsServerLevelAudit, IsColumnPermission
| order by EventTime desc
| take 100

The query retrieves audit records from a Log Analytics table named SQLSecurityAuditEvents, where the value stored in the Category column is equal to SQLSecurityAuditEvents.
The SQLSecurityAuditEvents table includes many columns. By using project, the returned dataset is limited by the columns identified following that keyword. The dataset is ordered by EventTime and limited to 100 rows. Some interesting data included in the result is the SQL statement that was executed, who executed it, and the IP address from the machine that performed the query. There will be more discussion around these columns later, after you set some sensitivity labels and data classifications for data columns.

Manage Sensitive Information

The tool capable of discovering and auditing your data is Microsoft Purview, which can search all your data real estate hosted on Azure, on‐premises, and even on other cloud provider platforms. In Exercise 8.3, you used Microsoft Purview to scan a dedicated SQL pool to identify the assets within it. Figure 8.26 shows the result of that scan. The result is accessible on the Data Estate Insights hub within the Microsoft Purview Governance portal.

FIGURE 8.26 Scanning a dedicated SQL pool with Microsoft Purview

The scan found 49 total assets that have neither sensitivity labels nor classifications. This is because neither of those activities has been performed yet. Something like what you saw previously in Figure 8.11 needs to be carried out on the data. That example of data classification was illustrated from another tool for managing sensitive data, Data Discovery & Classification, which, when used in combination with Auditing, can be used to target dedicated SQL pools running in an Azure Synapse Analytics workspace. Complete Exercise 8.5, where you will apply sensitivity labels and data classifications.

Optimize Pipelines for Analytical or Transactional Purposes – Design and Implement a Data Stream Processing Solution

There are numerous approaches for optimizing data stream pipelines. Two such approaches are parallelization and compute resource management. You have learned that setting a partition key in the date message results in the splitting of messages across multiple nodes. By doing this your data streams are processed in parallel. You can see how this looks in Figure 7.31 and the discussion around it. Managing the compute resources available for processing the data stream will have a big impact on the speed and availability of your data to downstream consumers. As shown in Figure 7.48, the increase in the watermark delay was caused by the CPU utilization on the processor nodes reaching 100 percent.

FIGURE 7.48 Azure Stream Analytics job metrics, CPU at 99 percent utilization

When the CPU is under such pressure, event messages get queued, which causes a delay in processing and providing output. Increasing the number of SUs allocated to the Azure Stream Analytics job will have a positive impact on the pipeline, making it more optimal for analytical and transactional purposes.

Scale Resources

As mentioned in the previous section, adding more resources to the Azure Stream Analytics Job will result in faster processing of the data stream. This assumes that you have noticed some processing delays and see that the current level of allocated resources is not sufficient. To increase the amount of compute power allocated to an Azure Stream Analytics job, select the Scale navigation link for the given job, as shown in Figure 7.49.

FIGURE 7.49 Azure Stream Analytics job scaling

Once compute power is allocated, when you start the job, instead of seeing 3, as shown in Figure 7.18, you will see the number described in the Manual Scale Streaming Unit configuration. In this case, the number of SUs would be 60.

Design for Data Privacy – Keeping Data Safe and Secure

Keeping data private depends primarily on two factors. The first factor is the implementation of a security model, like the layered approach illustrated in Figure 8.1. Implementing a security model has the greatest probability of preventing malicious behaviors and bad actors from having an impact on your data and your IT solution in general. The tools and features that can help prevent those scenarios include VNets, private endpoints, Azure Active Directory, RBAC, and Azure threat protection functionality. The second factor has to do with the classification and labelling of the data, which needs to be kept private and requires an additional level of protection. A very helpful tool is linked to your SQL pools and other Azure SQL products: Data Discovery & Classification. As shown in Figure 8.10, when Data Discovery & Classification. is selected, the database is analyzed for content that might benefit with some additional form of classification and sensitivity labeling.

FIGURE 8.10 Data Discovery & Classification

Notice that selecting the Data Discovery & Classification navigation menu item resulted in a scanning of the dbo schema on the Azure Synapse Analytics dedicated SQL Pool. A table named SUBJECTS was discovered to potentially contain PII data, and the tool recommended the sensitivity label and the classification of the data in the Information Type column. The SUBJECTS schema is located on GitHub in the Chapter08 directory. Recognize that you are the expert when it comes to the data hosted on your database. Information stored in columns may be named in a way that results in it not being recognized as requiring additional protection. For this scenario, you can use the + Add Classification menu button to add both the classification and sensitivity label. When you select the + Add Classification menu button, the pop‐up window in Figure 8.11 is rendered.

FIGURE 8.11 Data Discovery & Classification, Add Classification window

This feature exposes all the schemas, tables, and columns that exist on the database. After selecting add classification, you can then set the classification and sensitivity labels. In this case, the SUBJECTS table contains the column named BIRTHDATE and is classified as Date Of Birth related. The Sensitivity Label drop‐down is set to Confidential, which means that some kind of protective measure should be taken to control who can access the data. The other sensitivity levels are Public, General, and Highly Confidential. Reading through those examples, you can conclude that a sensitivity level of Public would likely not need any masking or encryption. As you progress through those different levels, the amount of security also increases with the maximum level of auditing, encryption, and access restrictions applied to the dataset as the Highly Confidential sensitivity level.

Data privacy also has to do with compliance regulations. There are many scenarios where data that is captured in one country cannot be stored outside the boundaries of that country. Data residency is an important point to consider when you are configuring redundancies into your data analytics solution. It is very common on Azure that datacenters are paired together and act as a disaster recovery option in case of catastrophic events. By default, for example, the West Europe region’s paired region is North Europe, and both are physically located in different countries. If your scenario requires that your data must remain in the country where it is collected in order to be compliant with your industry regulations, then take those appropriate actions. This capability is built into the platform for many Azure products, and when paired regions are located in different countries, you will be presented the option to disable this.

Design a Data Auditing Strategy – Keeping Data Safe and Secure

When you take an audit of something, it means that you analyze it and gather data about the data from the results. Many times the findings result in actions necessary to resolve inefficient or incorrect scenarios. What you analyze, what you are looking for, and what kind of analysis you need to perform are based on the requirements of the object being audited. The data management perspective (refer to Figure 5.41) includes disciplines such as quality, governance, and security. Each of those are good examples of scenarios to approach when creating a data auditing strategy. From a data quality perspective, you have been exposed to cleansing, deduplicating, and handling missing data using the MAR, MCAR, and MNAR principles, as discussed in Chapter 5, “Transform, Manage, and Prepare Data,” and Chapter 6, “Create and Manage Batch Processing and Pipelines.” This chapter focuses on the governance and security of data and how you can learn to design and implement strategies around those topics.

Governance encompasses a wide range of scenarios. You can optimize the scope of governance by identifying what is important to you, your business, and your customers. The necessary aspects of data governance include maintaining an inventory of data storage, enforcing policies, and knowing who is accessing what data and how often. The Azure platform provides products to achieve these aspects of data governance (refer to Figure 1.10). Microsoft Purview, for example, is used to discover and catalog your cloud‐based and estate‐based data estate. Azure Policy provides administrators the ability to control who and how cloud resources are provisioned, with Azure Blueprints helping to enforce that compliance. Compliance is a significant area of focus concerning data privacy, especially when it comes to PII, its physical location, and how long it can be persisted before purging. In addition to those products, you can find auditing capabilities built into products like Azure Synapse Analytics and Azure Databricks. When auditing is enabled on those two products specifically, failed and successful login attempts, SQL queries, and stored procedures are logged by default. The audit logs are stored into Log Analytics workspace for analysis, and alerts can be configured in Azure Monitor when certain behaviors or activities are recognized. Auditing is applied across the entire workspace, when enabled, and can be extended to log any action performed that affects the workspace.

Microsoft Azure provides policy guidelines for many compliance standards, including ISO, GDPR, PCI DSS, SOX, HIPPA, and FISMA, to name just a few of the most common standards. From a security perspective, you have seen the layered approach (refer to Figure 8.1) and have learned about some of the information protection layer features, with details about other layers coming later. Data sensitivity levels, RBAC, data encryption, Log Analytics, and Azure Monitor are all tools for protecting, securing, and monitoring your data hosted on the Azure platform.

Microsoft Purview

Microsoft Purview is especially useful for automatically discovering, classifying, and mapping your data estate. You can use it to catalog your data across multiple cloud providers and on‐premises datastores. You can also use it to discover, monitor, and enforce policies, and classify sensitive data types. Purview consists of four components: a data map, a data catalog, data estate insights, and data sharing. A data map graphically displays your datastores along with their relationships across your data estate. A data catalog provides the means for browsing your data assets, which is helpful with data discovery and classification. Data estate insights present an overview of all your data resources and are helpful for discovering where your data is and what kind of data you have. Finally, data sharing provides the necessary features to securely share your data internally and with business customers. To get some hands‐on experience with Microsoft Purview, complete Exercise 8.2, where you will provision a Microsoft Purview account.

Design to Purge Data Based on Business Requirements – Keeping Data Safe and Secure

The primary difference between purging and deleting has to do with whether or not the data is gone for good. When you purge data, it means there is no way to recover it. If something called a soft delete is enabled, it means that the data can be recovered during a preconfigured timeframe. After that timeframe, the data will be purged. Soft‐deleted data continues to consume storage space in your database or on your datastore, like an Azure storage container. The storage consumption is only freed when the data is purged. Like all scenarios related to retention and data deletion discussed up to now, you need to first decide which data has a sensitivity level that must adhere to a retention policy. Once you determine which data must be deleted, you need to determine at what age the data should be removed. After identifying those two pieces of information, you might consider deleting the data from your database using the DELETE SQL command. The following command removes all the data from the SUBJECTS table where the CREATE_DATE value is 3 months old from the current date:

 DELETE FROM SUBJECTS WHERE CREATE_DATE < DATEADD(month, -3, GETDATE())

When the amount of data is large, this kind of query can have a significant impact on performance. The impact can result in latency experienced by other data clients inserting, updating, or reading data from the same database. A very fast procedure for removing data is to place the data onto a partition that is defined by the column that defines the lifecycle of the data, for example, using the CREATE_DATE in the SUBJECTS table as the basis for a partition. When the data on that partition has breached the retention threshold, remove the partition, and the data is removed. Another approach is to select the data you want to keep, use the result to insert it into another table, and then switch the tables. This is achieved using CTAS, which was introduced in Chapter 2, “CREATE DATABASE dbName; GO,” along with the partitioning concept mentioned previously. The following SQL snippet is an example of how to achieve the purging of data without using the DELETE SQL command:

 SELECT * INTO SUBJECTS_NEW FROM SUBJECTS 
 WHERE CREATE_DATE> DATEADD(month, -3, GETDATE())
 RENAME OBJECT SUBJECTS TO SUBJECTS_OLD
 RENAME OBJECT SUBJECTS_NEW TO SUBJECTS
 DROP TABLE SUBJECTS_OLD

The SELECT statement retrieves the data with a creation date that is not older than 3 months and places the data into a new table. The existing primary table named SUBJECTS is renamed by appending _OLD to the end. Then the newly populated table that was appended with _NEW is renamed to the primary table name of SUBJECTS. Lastly, the table containing data that is older than 3 months is dropped, resulting in its deletion.

Replay an Archived Stream Data in Azure Stream Analytics – Design and Implement a Data Stream Processing Solution

  1. Log in to the Azure portal at https://portal.azure.com➢ navigate to the Azure Stream Analytics job you created in Exercise 3.17➢ select Input from the navigation menu ➢ select the + Add Stream Input drop‐down menu ➢ select Blob Sorage/ADLS Gen2 ➢ provide an input alias name (I used archive) ➢ select Connection String from the Authentication Mode drop‐down list box ➢ select the ADLS container where you stored the output in Exercise 7.9➢ and then enter the path to the file into the Path Pattern text box, perhaps like the following. Note that the filename is shortened for brevity and is not the actual filename of the output from Exercise 7.9.
    EMEA/brainjammer/in/2022/09/09/10/0_bf325906d_1.json
  2. The configuration should resemble Figure 7.46. Click the Save button.

FIGURE 7.46 Configurating an archive input alias

  1. Select Query from the navigation menu ➢ enter the following query into the query window ➢ click the Save Query button ➢ and then start the job. The query is available in the StreamAnalyticsQuery.txt file located in the Chapter07/Ch07Ex10 directory on GitHub.
    SELECT Scenario, ReadingDate, ALPHA, BETA_H, BETA_L, GAMMA, THETA

INTO ADLS

FROM archive

  1. Wait until the job has started ➢ navigate to the ADLS container and location you configured for the archive input alias in step 1 ➢ download the 0_bf325906d_1.json file from the Chapter07/Ch07Ex10 directory on GitHub ➢ upload that file to the ADLS container ➢ and then navigate to the location configured for your ADLS Output alias. A duplicate of the file is present, as shown in Figure 7.47.

FIGURE 7.47 Archive replay data result

  1. Stop the Azure Stream Analytics job.

Exercise 7.10 illustrates how a file generated by the output of a data stream can be replayed and stored on a targeted datastore. You may be wondering why something like this would be needed. Testing new features, populating testing environments with data, or taking a backup of the data are a few reasons for archiving and replaying data streams. One additional reason to replay data is due to downstream systems not receiving the output. The reason for missing data might be an outage on the downstream system or a timing issue. If an outage happened downstream and data was missing from that store, it is likely easy to understand why: the data is missing because the datastore was not available to receive the data.

The timing issue that can result in missing data can be caused by the timestamp assigned to the data file. The timestamp for a data file stored in an ADLS container is in an attribute named BlobLastModifiedUtcTime. Consider the action you took in step 4 of Exercise 7.10, where you uploaded a new file into the location you configured as the input for your Azure Stream Analytics job. Nothing happened when you initially started the job, for example, at 12:33 PM. This is because files that already exist in that location will not be processed, because their timestamp is earlier than the start time of the Azure Stream Analytics job. When you start a job with a Job Output Start Time of Now (refer to Figure 7.18), only files that arrive after the time are processed. Once you added the file with a timestamp after the job had already been started, for example, 12:40 PM, it got processed.

The same issue could exist for downstream systems, in that the data file could arrive at a datastore but the processor is experiencing an outage of some kind. When it starts back up and is online, it may be configured to start processing files only from the current time forward, which would mean the files received during the downtime will not be processed. In some cases, it might be better and safer to replay the data stream instead of reconfiguring and restarting all the downstream systems to process data received while having availability problems. Adding the file in step 4 is not the only approach for getting the timestamp updated to the current time. If you open the file, make a small change that will not have any impact on downstream processing, and then save it, the timestamp is updated, and it will be processed. Perform Exercise 7.10 again but instead of uploading the 0_bf325906d_1.json file, make a small change to the file already in that directory and notice that it is processed and passed into the output alias ADLS container.

POSIX‐like Access Control Lists – Keeping Data Safe and Secure

There was an extensive discussion about ACLs in Chapter 1. Two concepts need to be summarized again: the types of ACLs and the permission levels. The two kinds of ACLs are Access and Default, and the three permission levels are Execute (X), Read (I), and Write (W). Access ACLs control the access to both directories and files, as shown in Table 8.2. Default ACLs are templates that determine access ACLs for child items created below a directory with applied access ACLs. Files do not have default ACLs, and changing the ACL of a parent does not affect the default or access ACLs of the child items. The feature to configure and manage ACLs is available on the Manage ACL blade for a given ADLS container, as shown in Figure 8.19.

FIGURE 8.19 The Manage ACL blade

Figure 8.10 shows two tabs. The Access Permissions tab is in focus. The Azure AD security group BRAINJAMMER has been granted Access‐Read and Access‐Execute permissions, which means the content within the ADLS container directory can be listed. Listing the contents of a directory requires both Read (R) and Execute (X) permissions. The individual who was added to the Storage Blob Data Reader RBAC group shown in Figure 8.17 has been granted Access‐Write and Access‐Execute. Write (W) and Execute (X) ACL permissions are required to create items in the targeted directory. The other tab, Default Permissions, is where you can configure permissions that will be applied to the child items created below the root directory, which is the one in focus, as shown under the Set and Manage Permissions For heading. Now that you have some insights into data security concepts, features, and products, continue to the next section, where you will implement some of what you just learned.

Implement Data Security

Until your security objectives and requirements are finalized, you should not proceed with any form of implementation. You need to first know specifically what your goal is before you begin taking action. If your business model requires that your data complies with industry regulations, the requirements to meet those rules must be part of your design. Remember that Microsoft Purview and Azure Policy are helpful tools for guiding you through regulatory compliance. Those tools are also helpful for discovering your data sources and determining which sensitivity levels they require. Those sensitivity levels provide guidance into the level of security to apply to the dataset. After completing those steps, use something like the layered security diagram shown in Figure 8.1 as a guide for implementing security. The following sections cover the information protection, access management, and network security layers. The threat protection layer, which includes features like anomaly activity detection and malware detection, is best designed and implemented by security professionals. Note, however, that Microsoft Defender for Cloud is specifically designed for securing, detecting, alerting, and responding to bad actors and malicious activities preventing them from doing harm.

Azure Policy – Keeping Data Safe and Secure

One of the first experiences with a policy in this book came in Exercise 4.5, where you implemented data archiving. You used a lifecycle management policy and applied it directly to an Azure storage account. The policy, which you can view in the Chapter04/Ch04Ex04 directory on GitHub, identifies when blobs should be moved to Cold or Archived tier and when the data should be purged. Azure Policy, on the other hand, enables an Azure administrator to define and assign policies at the Subscription and/or Management Group level. Figure 8.9 illustrates how the Azure Policy Overview blade may look.

FIGURE 8.9 The Azure Policy Overview blade

Each policy shown in Figure 8.9 links to the policy definition constructed using a JSON document, as follows. You can also assign the policy to the Subscription and a Resource group.

The policy rule applies to all resources with an ARM resource ID of Microsoft.Storage and applies to the minimumTlsVersion attribute. When this policy is applied to the Subscription, TLS 1.2 will be the default value when provisioned, but the policy will also allow TLS 1.1; however, TLS 1.0 is not allowed. The provisioning of an Azure storage account that uses TLS 1.0 would fail because of this policy.

Design a Data Retention Policy

Data discovery and classification are required for determining the amount of time you need to retain your data. Some regulations require maximum and/or minimum data retention timelines, depending on the kind of data, which means you need to know what your data is before designing the retention policy. The policy needs to include not only your live, production‐ready data but also backups, snapshots, and archived data. You can achieve this discovery and classification using what was covered in the previous few sections of this chapter. You might recall the mention of Exercise 4.5, where you created a lifecycle management policy that included data archiving logic. The concept and approach are the same here. The scenario in Exercise 4.5 covered the movement and deletion of a blob based on the number of days it was last accessed. However, in this scenario the context is the removal of data based on a retention period. Consider the following policy example, which applies to the blobs and snapshots stored in Azure storage account containers. The policy states that the data is to be deleted after 90 days from the date of creation. This represents a retention period of 90 days.

When it comes to defining retention periods in relational databases like Azure SQL and Azure Synapse Analytics SQL pools, there are numerous creative approaches. One approach might be to add a column to each data row that contains a timestamp that identifies its creation date. You then can run a stored procedure executed from a cron scheduler or triggered using a Pipeline activity. As this additional information can be substantial when your datasets are large, you need to apply it only to datasets that are required to adhere to such policies. It is common to perform backups of relation databases, so remember that the backups, snapshots, and restore points need to be managed and bound to retention periods, as required. In the context of Azure Databricks, you are most commonly working with files or delta tables. Files contain metadata that identifies the creation and last modified date, which can be referenced and used for managing their retention. Delta tables can also include a column for the data’s creation date; this column is used for the management of data retention. When working with delta tables, you can use a vacuum to remove backed up data files. Any data retention policy you require in that workspace should include the execution of the vacuum command.

A final subject that should be covered in the context of data retention is something called time‐based retention policies for immutable blob data. “Immutable” means that once a blob is created, it can be read but not modified or deleted. This is often referred to as a write once, read many (WORM) tactic. The use case for such a feature is to store critical data and protect it against removal or modification. Numerous regulatory and compliance laws require documents to be stored for a given amount of time in their original state.

Exam Essentials – Design and Implement a Data Stream Processing Solution

Azure Event Hubs, Azure Stream Analytics, and Power BI. When you are designing your stream processing solution, one consideration is interoperability. Azure Event Hubs, Azure Stream Analytics, and Power BI are compatible with each other and can be used seamlessly to implement your data stream processing design. Other products are available on the Azure platform for streaming, such as HDInsight 3.6, Hadoop, Azure Databricks, Apache Storm, and WebJobs.

Windowed aggregates. Windowing is provided through temporal features like tumbling, hopping, sliding, session, and snapshot windows. Aggregate functions are methods that can calculate averages, maximums, minimums, and medians. Windowed aggregates enable you to aggregate temporal windows.

Partitions. Partitioning is the grouping of similar data together in close physical proximity in order to gain more efficient storage and query execution speed. Both efficiency and speed of execution are attained when data with matching partition keys is stored and retrieved from a single node. Data queries that pull data from remote datastores, different partition keys, or data that is located on more than a single node take longer to complete.

Time management. The tracking of the time when data is streaming into an ingestion point is crucial when it comes to recovering from a disruption. The timestamps linked to an event message, such as event time, arrival time, and the watermark, all help in this recovery. The event time identifies when the data message was created on the data‐producing IoT device. The arrival time is the enqueued time and reflects when the event message arrived at the ingestion endpoint, like Event Hubs.

Watermark. As shown in Figure 7.41, the watermark is a time that reflects the temporal time frame in which the data was processed by the stream processor. If the time window is 5 seconds, all event messages processed within that time window will receive the same watermark.