Pocket Dictionary of Data Engineering
Data engineering is a crucial aspect of data science that entails the design, construction, integration, and management of data pipelines tailored for analytical purposes.
- Data Pipeline — a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Pipeline can be batch (historical) or streaming (real time).
Cycle : Data Ingestion — Data Processing — Data Storing — Data Analysis - Ingestion — is the process of obtaining and importing data for immediate use or storage in a database. To ingest something is to “take something in or absorb something.” Data can be streamed in real time or ingested in batches.
- Extract Transform Load (ETL)— a process of data collection, transforming it and loading into the data warehouse to make it easier to analyze.
- Data Warehouse (DW) — Storing current and historical data in one single place used for reporting and data analysis.
- Business Intelligence Platforms — enable enterprises to build BI applications by providing capabilities in three categories: analysis, information delivery and platform integration.
Batch vs Streaming Processing
Batch processing deals with non-continuous data. It’s good at handling data sets quickly but doesn’t really get near the real-time requirements of most of today’s business.
Stream processing does deal with continuous data and is really the golden key to turning big data into fast data.
Hadoop
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
It is comprised of 3 main components:
- HDFS: the bottom layer component for storage.
- YARN: for job scheduling and cluster resource management.
- MapReduce: for parallel processing.
HDFS: The Hadoop Distributed File System — is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. HDFS breaks up files into chunks and distributes them across the nodes of the cluster.
YARN: Yet Another Resource Negotiator — is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.
MapReduce: is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
Hadoop Ecosystem
Apache Hive: is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Apache Flink: is an open-source stream-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner.
Apache Pig: is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Apache HBase: is an open-source, non-relational, distributed database modeled after Google’s Bigtable and written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop.
Apache Beam: is a big data processing standard created by Google in 2016. It provides unified DSL to process both batch and stream data, and can be executed on popular platforms like Spark, Flink, and of course Google’s commercial product Dataflow. Beam’s model is based on previous works known as FlumeJava and Millwheel , and addresses solutions for data processing tasks like ETL, analysis, and stream processing . Currently it provides SDK in two languages, Java and Python.
Apache Kafka: is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Apache Spark
Apache Spark is an open-source distributed cluster-computing framework. Spark is a data processing engine developed to provide faster and ease-of-use analytics than Hadoop MapReduce.
Concepts and terminologies :
RDD: Resilient Distributed Datasets is Spark’s core abstraction as a distributed collection of objects. It is an immutable dataset which cannot change with time. This data can be stored in memory or disk across the cluster. The data is logically partitioned over the cluster. It offers in-parallel operation across the cluster. As RDDs cannot be changed it can be transformed using several operations. Furthermore, RDDs are fault tolerant in nature. If any failure occurs it can rebuild lost data automatically through lineage graph.
Dataframe: It is an immutable distributed data collection, like RDD. We can organize data into names, columns, tables etc. in the database. This design makes large datasets processing even easier. It allows developers to impose distributed collection into a structure and high-level abstraction.
Dataset: To express transformation on domain objects, Datasets provides an API to users. It also enhances the performance and advantages of robust Spark SQL execution engine.
SparkSQL: It is a spark module which works with structured data. Also, supports workloads, even combine SQL queries with the complicated algorithm based analytics.
Spark Streaming: uses Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, thus facilitating easy implementation of lambda architecture.
PySpark: is the collaboration of Apache Spark and Python. The Spark Python API (PySpark) exposes the Spark programming model to Python.
Pandas: is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Data Storage Format
AVRO: A remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.
JSON: is a language-independent data format. It was derived from JavaScript, but as of 2017 many programming languages include code to generate and parse JSON-format data.
Parquet: Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
ORC: A self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.
Protobuf: Protocol Buffers is a method of serializing structured data. It is useful in developing programs to communicate with each other over a wire or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates source code from that description for generating or parsing a stream of bytes that represents the structured data.
Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query.
Data Orchestration
Apache Airflow: Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
- DAG: In Airflow, a DAG or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies ; A description of the order in which work should take place.
- Operator: While DAGs describe how to run a workflow, Operators determine what actually gets done ; A class that acts as a template for carrying out some work.
- Task: Once an operator is instantiated, it is referred to as a “task”. The instantiation defines specific values when calling the abstract operator, and the parameterized task becomes a node in a DAG; Aparameterized instance of an operator.
- Task Instance: A task instance represents a specific run of a task and is characterized as the combination of a dag, a task, and a point in time. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc.
Databases
PostgreSQL: is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance. PostgreSQL is ACID-compliant, transactional, has updatable and materialized views, triggers, and foreign keys. It also supports functions and stored procedures.
MongoDB: uses JSON-like documents to store schema-free data. In MongoDB, collections of documents do not require a predefined structure and columns can vary for different documents. MongoDB has many of the features of a relational database, including an expressive query language and strong consistency. However, since it is schema-free MongoDB allows you to create documents without having to create the structure for the document first.
DynamoDB: uses tables, items and attributes as the core components that you work with. A table is a collection of items, and each item is a collection of attributes. DynamoDB uses primary keys to uniquely identify each item in a table and secondary indexes to provide more querying flexibility.
Cassandra: one of Cassandra’s biggest strengths is being able to handle massive amounts of unstructured data. In cases where your database needs to rapidly scale with minimal increase of administrative work, Cassandra may be a good choice.
MySQL: is an open-source relational database management system (RDBMS). Just like all other relational databases, MySQL uses tables, constraints, triggers, roles, stored procedures and views as the core components that you work with. A table consists of rows, and each row contains a same set of columns. MySQL uses primary keys to uniquely identify each row (a.k.a record) in a table, and foreign keys to assure the referential integrity between two related tables.
Row vs Columnar Database
Row Oriented Database: A common method of storing a table is to serialize each row of data
Column Oriented Database: A column-oriented DBMS (or columnar database management system) is a database management system (DBMS) that stores data tables by column rather than by row.
Practical use of a column store versus a row store differs little in the relational DBMS world. Both columnar and row databases can use traditional database query languages like SQL to load data and perform queries. Both row and columnar databases can become the backbone in a system to serve data for common extract, transform, load (ETL) and data visualization tools. However, by storing data in columns rather than rows, the database can more precisely access the data it needs to answer a query rather than scanning and discarding unwanted data in rows. Query performance is increased for certain workloads.
OLAP vs OLTP
Data Warehouse can be broken down into two main database systems:
- Online Analytical Processing (OLAP): primary objective is data analysis. It is an online analysis and data retrieving process, characterized by a large volume of data and complex queries, uses data warehouses
- Online Transactional Processing (OLTP): primary objective is data processing, manages database modification, characterized by large numbers of short online transactions, simple queries, and traditional DBMS.
Data Model
Three perspectives of data model are:
- Conceptual data model: describes the semantics of a domain, being the scope of the model. For example, it may be a model of the interest area of an organization or industry. This consists of entity classes, representing kinds of things of significance in the domain, and relationship assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial ‘language’ with a scope that is limited by the scope of the model.
- Logical data model: describes the semantics, as represented by a particular data manipulation technology. This consists of descriptions of tables and columns, object oriented classes, and XML tags, among other things.
- Physical data model: describes the physical means by which data are stored. This is concerned with partitions, CPUs, tablespaces, and the like.
ER-Diagram: An entity-relationship model (ERM), sometimes referred to as an entity-relationship diagram (ERD), could be used to represent an abstract conceptual data model (or semantic data model or physical data model) used in software engineering to represent structured data.
Star schema: The simplest style of data warehouse schema. The star schema consists of a few “fact tables” (possibly only one, justifying the name) referencing any number of “dimension tables”.
Snowflake schema: a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. “Snowflaking” is a method of normalizing the dimension tables in a star schema. When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. The principle behind snowflaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.
The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema’s dimensions are denormalized with each dimension represented by a single table.
Star and snowflake schemas are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schemas are not normalized much, and are frequently designed at a level of normalization short of third normal form.
Normalization: splits up data to avoid redundancy (duplication) by moving commonly repeating groups of data into new tables. Normalization therefore tends to increase the number of tables that need to be joined in order to perform a given query, but reduces the space required to hold the data and the number of places where it needs to be updated if the data changes.
Slowly Changing Dimension: Dimensions in data management and data warehousing contain relatively static data about such entities as geographical locations, customers, or products. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule.
Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
- Type 0: retain original
- Type 1: overwrite
- Type 2: add new row
- Type 3: add new attribute
- Type 4: add history table
- Type 6: combined approach of types 1, 2 and 3 (1 + 2 + 3 = 6)
Data Structure: A data structure is a way of storing data in a computer so that it can be used efficiently. It is an organization of mathematical and logical concepts of data. Often a carefully chosen data structure will allow the most efficient algorithm to be used. The choice of the data structure often begins from the choice of an abstract data type (Array, Hash Table, List, Stack, Struct, Union, Object).
Cloud Computing
The practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer. There are cloud providers like Amazon Web Services, Google Cloud Platform, IBM Cloud and Microsoft Azure — with their analytics products.
Google Cloud Platform (GCP)
- Google Cloud Storage: A RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure.
- BigQuery: A fully managed, highly scalable data warehouse with built-in ML.
- Dataflow: Real-time batch and stream data processing.
- Dataproc: Managed Spark and Hadoop service.
- Dataprep: Cloud data service to explore, clean, and prepare data for analysis.
- Pub/Sub: A fully-managed real-time messaging service to send and receive messages between independent applications.
- Cloud Composer: A fully managed workflow orchestration service built on Apache Airflow.
Amazon Web Services (AWS)
- Amazon S3: Amazon Simple Storage Service (Amazon S3) is an object storage service for scalability, data availability, security, and performance.
- Amazon Redshift: A fully managed petabyte-scale data warehouse service.
- AWS Glue: A fully managed ETL service.
- AWS Lambda: An event-driven, serverless computing platform.
References :
- https://github.com/ml874/Data-Engineering-on-GCP-Cheatsheet/blob/master/data_engineering_on_GCP.pdf
- https://www.gartner.com/it-glossary/bi-platforms
- https://blog.usejournal.com/data-engineering-101-adfa9610247e
- https://medium.com/@mangatmodi/rowise-vs-columnar-database-theory-and-in-practice-53f54c8f6505
- https://cloud.google.com/products/#data-analytics
- https://aws.amazon.com/products/
- https://techvidvan.com/tutorials/apache-spark-terminologies/
- https://searchdatamanagement.techtarget.com/definition/Hadoop-Distributed-File-System-HDFS
- https://hortonworks.com/apache/yarn/
- https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
- https://www.talend.com/resources/what-is-mapreduce/
- https://mapr.com/products/apache-hadoop/
- https://dzone.com/refcardz/apache-kafka
- https://en.wikipedia.org
- http://ju.outofmemory.cn/entry/331338
- https://www.datio.com/iaas/google-dataflow-and-apache-beam-ii/
- https://stackoverflow.com/questions/31508083/difference-between-dataframe-dataset-and-rdd-in-spark
- https://databricks.com/glossary/what-is-rdd
- https://parquet.apache.org/
- https://orc.apache.org/
- https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
- https://airflow.apache.org/concepts.html
- https://airflow.apache.org/ui.html
- https://www.talend.com/blog/2018/02/01/batch-vs-stream-processing/
- https://medium.com/@gowthamy/big-data-battle-batch-processing-vs-stream-processing-5d94600d8103
- https://blog.panoply.io/
- https://www.edureka.co/blog/pyspark-programming/
- https://spark.apache.org/docs/0.9.0/python-programming-guide.html
- https://pandas.pydata.org/
- https://cloud.google.com/dataflow/
- https://aws.amazon.com/blogs/big-data/automating-analytic-workflows-on-aws/
- https://www.geeksforgeeks.org/dbms-olap-vs-oltp/
- https://whatis.techtarget.com/definition/data-ingestion
- https://www.shutterstock.com/video/clip-25242350-male-engineer-works-on-laptop-big-data