Best Practice Data Architectures in 2017

Best Practice Data Architectures in 2017

Clive Skinner, Thu 28 September 2017

With an ever-increasing number of technologies available for data processing and three highly competitive cloud platform vendors, we at Dativa have to stay on top of exactly what are the best technology choices for our clients.

Most of the work we undertake involves processing and loading data into a data lake, provide post-processing on top of that, and then reporting on it. We've developed a standard data processing pipeline architecture that covers both historical data analysis in a data lake and also real-time data processing through a separate stream.

A standard data pipeline architecture

Standard data pipeline architecture

There are many options for how we can implement this in 2017, but they broadly fall into four categories:

  • Amazon-centric using the AWS platform
  • Google-centric using the Google cloud platform
  • Microsoft-centric using the Azure platform
  • Platform independent using open source software

Most of our customers tend to align themselves with one of the cloud vendors and then take some components from the platform-independent option.

Best of breed options for each platform

Each of the components is covered below together with the options for implementing that component using the standard tools.

ComponentDescriptionContextTechnology Options
Amazon Web ServicesGoogle CloudMicrosoft AzurePlatform Independant
Data SourcePoint of delivery, or source, of batched or streamed data to be ingested into the system. Any data transfer undertaken is via secure a mechanism e.g. HTTPS, SFTP, SCP etc.Pull small batch AWS LambdaCloud FunctionsAzure Functions Python
HTTPS, SFTP, SCP etc.
Pull large batch AWS Batch-Azure Batch
Push batch Amazon S3Cloud StorageAzure Storage
Streamed Amazon Kinesis Cloud Pub/Sub
Cloud Dataflow
Azure Stream Analytics
Azure Event Hub
Discover and map AWS GlueCloud DataPrepAzure Data Catalog
OrchestrationOptimised automation of the processes and workflow required to extract, clean, transform, aggregate and load the Source Data into the Data Lake or Warehouse.Distributed services AWS Step FunctionsApp EngineMicrosoft FlowApache Airflow
Complex processing AWS Data PipelineCloud DataflowAzure Data Factory
Managed service AWS GlueCloud DataPrepAzure Data Catalog
Pre-processing and LoaderLoading of the extracted, cleansed and transformed data into the Data Lake or Warehouse. The frequency and strategy (e.g. append, replace, etc.) used will vary depending on the business case.Scripted loading AWS LambdaCloud FunctionsAzure Functions Python
Apache Spark
Apache Hive
Managed loading AWS GlueCloud DataPrepAzure Data Catalog
Cleansing ServiceFiltering of data to improve reliability and quality by correcting errors, checking for consistency and testing against pre-defined acceptance criteria.Scripted AWS LambdaCloud FunctionsAzure Functions Python
Pipeline API
Managed service Pipeline API
Fact GenerationAggregation and processing of data to improve Historical Reporting and Other Tools query efficiency.Small scale AWS LambdaCloud FunctionsAzure Functions SQL/HiveQL
Python
Presto
Large scale AWS Batch-Azure Batch
Data LakeStorage repository that can hold large amounts of structured, semi-structured and/or unstructured data. Where only structured data is stored a data warehouse may be used instead to improve query efficiency.Structured data Amazon RedshiftBigQueryAzure SQL Datawarehouse Apache Hadoop
Apache Hive
Semi-structured, unstructured data Amazon S3Cloud StorageAzure Storage
Any data Amazon EMR Cloud Dataflow
Cloud Dataproc
Azure HDInsight (Hadoop)
Machine Learning Data EnrichmentDiscovering known properties of data sets, or learning from and making predictions about previously unknown properties of data sets.Managed service Amazon Machine LearningCloud Machine Learning EngineAzure Machine LearningApache Spark MLib
Scripted AWS Batch
Amazon EMR
Cloud Dataflow
Cloud Dataproc
Azure Batch
Azure HDInsight (Hadoop)
Historical ReportingAnalytics tools allowing visualisation and reporting of historical (e.g. non-real time) data.Platform Amazon QuickSightData StudioPowerBI Tableau
Periscope
Qlik
Third-party Tableau
Periscope
Qlik
Other ToolsInterfaces allowing access to third-party tools and services to historical data.Structured data APIs Amazon API GatewayCloud EndpointsAzure API Management Apache Impala
Apache Spark SQL
Semi-structured and unstructured data queries Amazon Athena
Amazon Redshift Spectrum
BigQueryAzure Data Lake Analytics
Real Time Data ShippingEither batched or streamed information collected and delivered in near real-time.Streamed data Amazon Kinesis Cloud Pub/Sub
Cloud Dataflow
Azure Stream Analytics
Azure Event Hub
Apache Spark Streaming
Logstash/Beats
Apache Storm
Log data Amazon CloudWatch Cloud Monitoring
Cloud Logging
Azure Application Insights
Azure Log Analytics
Managed service Logstash/Beats
Real Time Data StoreStorage repository for real-time data, either shared with historical data, or stored separately.Structured data Amazon RedshiftBigQueryAzure SQL Datawarehouse Apache Hadoop
Elasticsearch/Beats
Splunk
Semi-structured, unstructured data Amazon S3Cloud StorageAzure Storage
Managed service Elasticsearch/Beats
Splunk
Real Time ReportingTypically 'live' dashboards or reports based on the most recently delivered real-time data.Streamed data Amazon Kinesis Analytics Cloud Pub/Sub
Cloud Dataflow
Cloud DataPrep
Azure Stream Analytics
Azure Event Hub
Kibana
Splunk
Managed service Kibana
Splunk
ContainersSoftware packaged to run in isolation on a shared operating system. Guarantees that software will always run the same, regardless of where it’s deployed.Managed serviceEC2 Container Service Container Engine
Container Registry
Azure Container ServiceDocker
AutoscalingAllows automated scaling up or down of services based on definable conditions to ensure optimum performance versus use of resources.Managed serviceAuto ScalingAutoscalerAzure Autoscale Platform specific (e.g. Apache Cloudstack, Docker Swarm etc.)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you

Other articles about Data Engineering


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Genie House
Burchetts Green Road
Maidenhead
SL6 6QS

Registered in England & Wales, number 10202531