What are data pipelines?
Data pipelines are processes (typically represented as DAGs) that result in the production of data products, including datasets and models.
Types of data pipelines
Until recently, ETL pipelines were the dominant form of data pipeline. ETL stands for Extract, Transform, and Load. Data would be extracted from its original source, ‘transformed’ into a form in which it might be consumable, and then ‘loaded’ into a database.
ELT pipelines re-order the Transform and Load operations, instead loading all extracted data from a source into a data warehouse first, and allowing users to decide what transformations to run later.
ELT pipelines are often configured to capture all raw data of interest that a source emits, and the products of transformations are appended as additional columns within the warehouse in which that data is stored. The raw data extracted meanwhile remains accessible in its original form. This has a number of benefits:
- Provenance: It becomes possible to view the original source data that was used in the production of a finalized data product.
- Robustness: Compared to ETL pipelines, ELT pipelines are typically less brittle and prone to breakage.
- Speed: Data loading is quicker, as transformations are executed once the raw data is in the warehouse.
- Agile and accessible: ELT pipelines have the advantage of not requiring their users to know ahead of time exactly how they’ll use the data. Querying, slicing and transforming data can be conducted at any time. When it comes to enabling self-service analytics, this is particularly important.
- Differing cost profile: All these things typically mean that data storage and processing costs are higher than in ETL solutions, but data bandwidth costs are reduced with ELTs as data is only ever loaded once.
Whereas ETL and ELT pipelines result in the production of data files and tables, Machine Learning (ML) pipelines produce ML models. An ML model represents what was learned by executing an ML algorithm on a dataset, and represents the process (rules, parameters, and other data structures) required to make predictions or decisions.
Data pipeline tools
Different software packages exist for orchestrating data pipelines such as Airflow, Prefect, and Dagster. Tools also exist for transforming data within pipelines such as dbt, and dataform. Then there are utilities for testing data such as Great Expectations. You can setup, manage and maintain all of these yourself, or you can use the HASH Platform to access pre-integrated, best-in-class, always-maintained, secure versions of these tools. In some cases, support may be coming soon. Check our roadmap for more.
How data pipelines work in HASH
There are three ways to utilize data pipelines in HASH:
- With an external data warehouse and pipelines, using HASH solely as a modeling engine
- With an external data warehouse, using HASH to manage data pipelines as well as models
- Using HASH as your data warehouse, pipeline, and modeling engine
Users are free to create, run and maintain their data pipelines outside of HASH, both operating on and feeding external data warehouses which can then be accessed from or imported into HASH.
hCore offers the ability to set up HASH Flows which combine tooling like Airflow, dbt, and Great Expectations into a single easy-to-use interface, usable by anybody with the correct permissions. With zero of the usual setup required, complexity, maintenance and monitoring concerns are abstracted away allowing data analysts as well as engineers to focus on the important work of simulation modeling.
HASH Flows are currently in closed beta. If you’d like to beta test HASH Flows in hCore, please get in touch.