Yesterday, Dremio hosted the Subsurface Conference, the first conference on cloud data lakes. More than 5000 people registered, and more than 2500 attended. If one had doubts that cloud data lakes are a strategic area for many in the data ecosystem, those figures should quash them.
I delivered a presentation at the end of the day that I’ll share here. Entitled 5 Data Trends You Should Know, the presentation covers the major trends we observe in the data world. Here’s a quick narrative of the talk.
There is a mega-trend underpinning the changes in data design philosophy and tooling: the rise of the data engineer. Data engineers are the people who move, shape, and transform data from the source to the tools that extract insight. We believe data engineers are the change agents in a decade-long process that will revolutionize data.
Data systems used to be purchased by IT. But in the last 20 years, individual departments started to purchase their own data systems. Each team, using their data systems, develops their proprietary data products: analyses, dashboards, machine learning systems, even new product features.
Data systems rely on data from other teams. So all of these teams share data. And just like that, the company has built a data mesh: a network of producers and consumers of data who share data via standard APIs or open-source formats like Apache Arrow & Parquet. When the data is stored in the cloud, we call it a cloud data lake.
Data engineers stand on the shoulders of 70 years of software development experience and take many of the learnings from that discipline. One example is developing a data engineering lifecycle. This is our current understanding of a typical data engineering software development lifecycle.
There are six steps:
- Ingesting data from the systems that produce it and writing it into open formats in the cloud
- Planning the software to build
- Querying data using a compute engine which runs across the cloud data lake
- Modeling the data to ensure there is one centralized definition of every metric with an owner, a lineage, and a status
- Developing the data product which could be analyses, BI reports, machine learning models, production features
- Monitoring and testing the data to ensure data consistency & integrity over time
As the profession of data engineering matures, engineers need new tools to help them with each step in the process. The five trends that we are observing within the data world are the rise of those tools at each step. Here are those 5:
- New data pipelines that use modern computer languages to create reusable abstractions for data processing, to monitor data pipelines, and to visualize the flow of data, the DAG (directed acyclic graph). Innovators here are Dagster, Airflow, and Prefect.
- Compute engines query data in the cloud without having to move it. They leverage the separation of data and compute to accelerate queries, enable secure and compliant access, .and future proof the infrastructure to new advances in tools and use cases which you haven’t been built. Innovators are Dremio and Databricks.
- Data modeling curates a data catalog for all the metrics within a company. When metrics are modeled, they are defined once, accurately, and everyone uses that definition. Innovators are Transform Data and Looker (with LookML).
- Data products are analyses, experiments, reports, and machine learning models/products built on data. Innovators in this category include Preset, Streamlit, and Tecton among others.
- Data quality tools monitor data streams, identify anomalies, create testing harnesses to ensure data is always accurate. Data quality innovators include MonteCarlo, SodaData, Great Expectations, and Data Gravity.
All the tools need to be synthesized to achieve the vision of a modern data match, and data engineers will pioneer that change.