Big Data Architecture with Azure


Diagrams






Book Recommendations

Mastering Azure Analytics, by Zoiner Tejada.

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud, by Robert Ilijason. I haven't read this but have had it recommended to me.

Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions, by Sudhir Rawat and Abhishek Narain. I haven't read this but have had it recommended to me.


Architecture

There are a number of different high-level architecture diagrams available for big data processing, with various names for the phases.

The most common version has nine phases: Data Sources, Data Storage, Real-Time Message Ingestion, Batch Processing, Stream Processing, Machine Learning, Analytics & Reporting, Orchestration.

https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/






Some Microsoft docs simplify it into four phases: Load & Ingest, Store, Process, Serve. Annoyingly, they often name them differently. For example, course DP-201 and its exam use the terms Ingestion, Data Storage, Analysis, and Virtualization. Except where they use Ingest, Process, Store, and Analyse/Report. Courses DP-200 and DP-203 (and their associated exams) use Ingest, Store, Prep & Train, and Model & Serve. Sheesh.

https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/

https://docs.microsoft.com/en-us/azure/architecture/example-scenario/dataplate2e/data-platform-end-to-end


Choice of Batch Processing services

Azure Databricks, Azure Synapse Analytics and Azure HDInsight have a lot of overlap between their use cases (the Batch Processing section of the big picture). I guess Fabric is also going to be a choice, when Microsoft release it in a working state. :-)

https://adatis.co.uk/databricks-vs-synapse-spark-pools-what-when-and-where/

https://docs.microsoft.com/en-us/answers/questions/587071/differnce-between-synapse-and-databricks.html

https://www.clearpeaks.com/cloud-analytics-on-azure-databricks-vs-hdinsight-vs-data-lake-analytics/

https://stackoverflow.com/questions/50679909/azure-data-lake-vs-azure-hdinsight

https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/

We could also mention Azure Batch, though it is more an HPC service than a BI service.
https://azure.microsoft.com/en-us/services/batch/

Note that Azure Data Lake Analytics hasn't seen any updates for a couple of years (and its query language, U-SQL, doesn't support Data Lake Storage Gen2). It seems to have been abandoned.