Category: Handle Interruptions

Design and Configure Exception Handling – Design and Implement a Data Stream Processing Solution

An exception is an unexpected event that occurs in the execution of your program code or query. Exceptions do happen and need to be managed because, in most cases, unmanaged exceptions stop the execution of your code at that point. This can leave your data and your program in an undesirable state, which can result in further data corruption. You need to understand which kinds of known exceptions can happen in your Azure Stream Analytics processing. Table 7.8 provides a list of Azure Stream Analytics exceptions.

TABLE 7.8 Azure Stream Analytics exceptions

ExceptionDescription
InputDeserializationErrorUnable to deserialize input data.
InputEventTimestampNotFoundUnable to retrieve a timestamp for a resource.
InputEventTimestampByOverValueNotFoundCannot get the value of the TIMESTAMP BY OVER COLUMN.
InputEventLateBeyondThresholdAn event was sent later than the configured tolerance.
InputEventEarlyBeyondThresholdAn event has an arrival time earlier than the application timestamp.
AzureFunctionMessageSizeExceededThe output message exceeds the size limit for Azure Functions.
EventHubOutputRecordExceedsSizeLimitThe record exceeds the message size limit for Event Hubs.
CosmosDBOutputInvalidIdThe type or value of a column is invalid.
CosmosDBOutputInvalidIdCharacterThere is an invalid character in the document ID for the record.
CosmosDBOutputMissingIdThe record does not contain an ID to use as a primary key.
CosmosDBOutputMissingIdColumnThe record is missing a document ID property.
CosmosDBOutputMissingPartitionKeyThe record is missing the partition key property.
CosmosDBOutputSingleRecordTooLargeA single record is too large to write.
SQLDatabaseOutputDataErrorCannot write to Azure SQL Database due to data issues.

When you begin processing your data stream and nothing is happening, the reason is likely one of the exceptions listed in Table 7.8. There are numerous locations where you can view exceptions that get thrown during the stream processing. The first one is diagnostic logging, which is configurable from the Diagnostic Settings blade in the Azure portal. There is more on this feature in Chapter 9; however, look at Figure 7.50 to get an idea about what this looks like.

FIGURE 7.50 Azure Stream Analytics Diagnostics Setting

Notice the different categories and that logging performance metrics are also possible. The locations where you can store the logs are listed in the Destination Details column. The options are a Log Analytics workspace, a storage account, an event hub, or a partner solution, which are endpoints provided by Microsoft Azure partners. The other location to view exceptions is on the Activity Log blade, as shown in Figure 7.51.

FIGURE 7.51 Azure Stream Analytics Activity log warnings and errors

Line 21 of the Send Events operation warning message shows a CosmosDBOutputInvalidIdCharacter error. After some analysis, it turns out Azure Cosmos DB does not like the timestamp used as the document ID in that format. The Azure Stream Analytics query had to be changed to get the data into the correct format so that the output could handle it. If you look back at the query for Exercise 7.9, you will notice some special handling of the ReadingDate column in the query. The reason for that query pattern was a result of the solution to this exception.

The last part of the topics necessary for designing and configuring exception handling is just that. When an exception happens, Azure Stream Analytics offers two options, Retry and Drop. Look again at Figure 7.51; there is a navigation menu option named Error Policy. On that blade there are two options, Retry and Drop. Retry is the default and means the data stream processor will retry writing the message to the output until it succeeds, indefinitely. This setting will ultimately block the output of that and any other message streaming to that point. If you experience a scenario where the data stream has stopped flowing, then there is a blockage, and you need to find and resolve the exception before it begins working again. The other option is to drop the message and not process it. If you choose this option, you need to realize that the message cannot be recovered or replayed, so it will be purged and lost.

Summary – Design and Implement a Data Stream Processing Solution

This chapter focused on the design and development of a stream processing solution. You learned about data stream producers, which are commonly IoT devices that send event messages to an ingestion endpoint hosted in the cloud. You learned about stream processing products that read, transform, and write the data stream to a location for consumers to access. The location of the data storage depends on whether the data insights are required in real time or near real time. Both scenarios flow through the speed layer, where real‐time insights flow directly into a consumer like Power BI and near real‐time data streams flow into the serving layer. While the insights are in the serving layer, additional transformation can be performed by batch processing prior to consumption. In addition to the time demands on your streaming solution, other considerations, such as the data stream format, programming paradigm, programming language, and product interoperability, are all important when designing your data streaming solution.

Azure Stream Analytics has the capacity to process data streams in parallel. Performing work in parallel increases the speed in which the transformation is completed. The result is a faster gathering of business insights. This is achieved using partition keys. Partition keys provide the platform with information that is used to group together the data and process it on a dedicated partition. The concept of time is very important in data stream solutions. Arrival time, event time, checkpoints, and watermarks all play a very important role when interruptions to the data stream occur. You learned that when an OS upgrade, node exception, or product upgrade happens, the platform uses these time management properties to get your stream back on track without losing any of the data. The replaying of data streams is possible if you have created or stored the data required to replay them. There are no such data archival features on the data streaming platform to achieve this.

There are many metrics you can use to monitor the performance of your Azure Stream analytics job. For example, the Resource Utilization, Event Counts, and Watermark Delay metrics can help you determine why the stream results are not being processed as expected or at all. Diagnostic settings, alerts, and Activity logs can also help determine why your stream processing is not achieving the expected results. Once you determine the cause of the problem, you can increase the capacity by scaling, configuring the error policy, or changing the query to fix a bug.

Design Data Encryption for Data at Rest and in Transit – Keeping Data Safe and Secure

Encryption is a very scientific, mathematics‐heavy concept. The internals are outside the scope of this book, but in simple terms when data is encrypted, it looks like a bunch of scrambled letters and numbers that are of no value. The following is an example of the word csharpguitar using the key created in Exercise 8.1:

 p0syrFCPufrCr+9dN7krpFe7wuwIeVwQNFtySX0qaX3UcqzlRifuNdnaxiTu1XgZoKwKmeu6LTfrH
 rGQHq4lDClbo/KoqjgSm+0d0Ap/y2HR34TFgoxTeN0KVCoVKAtu35jZ52xeZgj1eYZ9dww2n6psGG
 nMRlux/z3ZDvm4qlvrv55eAoSawbCGWOql3mhdfHFZZxLBCN2eZzvBpaTSNaramME54ELMr6ScIJI
 ITq6XJYTFH8BGvPaqhfTTO4MbizwenpijIFZvdn3bzQGbnPElht0j+EQ7aLvWOOyzJjlKcR8MN4jO
 oYNULCZTBi/BVvlhYpUsKxxN+YW27POMAw==

There is no realistic method for anyone or any computer to revert that set of characters back into the original word. That is the power of encryption implemented using public and private keys. Only by having access to the private key can one make sense of that character sequence. The only means for decryption is to use the az keyvault key decrypt Azure CLI cmdlet or a REST API that has access to the private key. This leads well into two very important security concepts that pertain greatly to the storage of data on Azure: encryption‐at‐rest and encryption‐in‐transit.

Data stored in an Azure storage account is encrypted by default. No action is required by you to encrypt your data that is stored in a container. It is encrypted even if it is not used, which is where the name encryption‐at‐rest comes from. The data is simply stored, idle, doing nothing, but is secured by encryption. This kind of protection is intended to defend against a bad actor getting access to the physical hard drive that contains data. When the bad actor attempts to access the data, they will see only the scrambled characters. If they do not have the associated keys, which should only be accessible in a key vault, there is no chance of decrypting the data. Therefore, your data is safe, even when it is resting and not being used. Back in Exercise 3.1 where you created an Azure storage account and an ADLS container, there was a tab named Encryption. That tab includes two radio buttons, as shown in Figure 8.12. The default was to use a Microsoft‐Managed Key (MMK) for the encryption‐at‐rest operation; the other optiom is named Customer‐Managed Key (CMK). If you select CMK, then you can reference a key you have created in an Azure Key Vault to use as a default encryption key.

Storage account encryption is available for customers who need the maximum amount of security due to compliance or regulations. Also notice the Enable Infrastructure Encryption check box. When this box is selected, the data stored in the account is doubly encrypted. Double encryption is available for both data at rest and data in transit. Instead of being encrypted with just one key, the data is encrypted with two separate keys, the second key being implemented at the infrastructure level. This is done for scenarios where one of the encryption keys or algorithms is compromised. When Enable Infrastructure Encryption is selected and one of the encryption keys is compromised, your data is still encrypted with 256‐bit AES encryption by the other key. The data remains safe in this scenario. Another common encryption technology on the Azure platform that is targeted towards databases is Transparent Data Encryption (TDE). TDE protects data at rest on SQL Azure databases, Azure SQL data warehouses, and Azure Synapse Analytics SQL pools. The entire database, data files, and database backups are encrypted using an AES encryption algorithm by default, but like Azure Storage, the encryption key can be managed by the customer or by Microsoft and stored in an Azure key vault.

FIGURE 8.12 Azure storage account encryption type

The opposite of resting is active, which can be inferred to data being retrieved from some remote consumer. As the data moves from the location where it is stored to the consumer, the data can be vulnerable to traffic capture. This is where the concept of encryption‐in‐transit comes into scope. You encrypt data in transit by using TLS 1.2, which is currently the most secure and widely supported version. As previously mentioned, TLS is achieved by using an x509 certificate in combination with the HTTP protocol. Consider the following common Azure product endpoints:

In all cases, the transfer of data happens using HTTPS, meaning the data is encrypted while in transit between the service that hosts it and the consumer who has authorization to retrieve it. When working with Linux, the protocol to use is secure shell (SSH), which ensures the encryption of data in transit; HTTPS is also a supported protocol. An additional encryption concept should be mentioned here: encryption‐in‐use. This concept is implemented using a feature named Always Encrypted and is focused on the protection of sensitive data stored in specific columns of a database. Identification numbers, credit card numbers, PII, and need‐to‐know data are examples of data that typically resides in the columns of a database. This kind of encryption, which is handled client‐side, is intended to prevent DBAs or administrators from viewing sensitive information when there is no business justification to do so.

The final topic to discuss in the section has to do with the WITH ENCRYPTION SQL statement. In Exercise 2.3 you created a view using a statement similar to the following:

 CREATE VIEW [views].[PowThetaClassicalMusic]

In Exercise 5.1 you created a stored procedure using the following command:

 CREATE PROCEDURE brainwaves.uspCreateAndPopulateFactReading

Each of those statements can be used by placing the WITH ENCRYPTION SQL directory after the CREATE command, like the following:

CREATE VIEW [views].[PowThetaClassicalMusic] WITH ENCRYPTION
 CREATE PROCEDURE brainwaves.uspCreateAndPopulateFactReading WITH ENCRYPTION

If you then attempt to view the text for the stored procedure, you will not see it; instead, you will see a message explaining that it is encrypted. Using the WITH ENCRYPTION statement provides a relatively low level of security. It is relatively easy to decrypt for technically savvy individuals; however, it is quick and simple to implement, making it worthy of consideration.