Month: July 2023

Design and Create Tests for Data Pipelines – Design and Implement a Data Stream Processing Solution

The data pipeline used in most examples in this chapter has included a data producer, an Event Hubs endpoint, and an Azure Stream Analytics job. Those components create the data, ingest the data, and then process the data stream. The processed data then flows into datastores like ADLS, an Azure Synapse Analytics SQL pool, Azure Cosmos DB, and Power BI. All those different parts make up the data pipeline.

To design a test for the data pipeline, you must identity the different components, which were mentioned in the preceding paragraph. The next step is to analyze each component to determine exactly what the data input format is, what is used to process that data, and what the output should be. If you take the BCI that produces brain wave readings, for example, the input consists of analog vibrations originating from a brain. The BCI is connected to a computer via Bluetooth, and a program converts the analog reading to a numeric value, which is then formatted into a JSON document and streamed to an event hub. Therefore, changes to the code that captures, transforms, and streams the brain wave readings to the event bub must be tested through the entire pipeline. Data is not modified as it flows through the event hub, so the next step in the pipeline to analyze is the Azure Stream Analytics job query. If the format of the incoming data stream had been changed, a change to the query would be required. For example, the addition of a new column to the data event message would require a change to the query. The final step is to validate that the output of the stream processing had an expected result on all downstream datastores that receive the processed data content.

In most mid‐ to large‐size projects, you would perform these tests in a testing environment that has an exact replica of the production data pipeline. As you learned in Chapter 6, you can use an Azure DevOps component named Azure Test Plans for more formal new features and regression testing capabilities. You will learn more about Azure Test Plans in Chapter 9, “Monitoring Azure Data Storage and Processing,” and Chapter 10, “Troubleshoot Data Storage Processing.”

Monitor for Performance and Functional Regressions

The “Configure Checkpoints/Watermarking During Processing” section discussed some metrics that are useful for monitoring performance, such as Watermark Delay, Resource Utilization, and Events Count (refer to Figure 7.44). Many other metrics are available. The following are a few of the more interesting metrics:

  • Backlogged Input Events
  • CPU % Utilization
  • Data Conversion Errors
  • Early Input Events
  • Input Events
  • Last Input Events
  • Out of order Events
  • Output Events
  • Runtime errors
  • SU (Memory) % Utilization

Each of these metrics can potentially help you find out the cause of issues your Azure Stream Analytics job is having. A regression means that a bug in your code that was previously fixed has been reintroduced into the application. From an Azure Stream Analytics perspective, this would happen in the query, or, if your job contains a function, the code may have been corrupted in that area of the processing logic. To help determine when this happened, you can review the Activity Log blade, which provides a list of changes made over a given time frame. If you have your queries and function code in a source code repository, then you could also take a look in the file history to see who changed and merged the code, when, and how.