11 minute read / Jan 22, 2025 /

Top Themes in Data Transcript

Slide 1

Clearing: While data world consolidates, capabilities have exploded with AI.

Content:

AI is rewriting every rule about what’s possible with data
Those two forces in tension will make for an exciting 2025

Slide 2

Clearing: My name is Tomasz Tunguz, founder and general partner at Theory.

Content:

I’ve been investing in data for the last 17 years and have worked with companies like Looker, Monte Carlo, Hex, Omni, Tobiko Data and Mother Duck
I founded Theory, a venture firm managing $700M with the idea that all modern software companies will be underpinned by data and AI
We run a research-oriented firm, formed by 200 buyers of data and AI software

Transition:

These are the themes that we predict within the world of data

Slide 3

Clearing: Every transformation follows a pattern. Today, three powerful movements are reshaping how enterprises work with data.

Content:

First, we’re witnessing the Great Consolidation. After a decade of expanding complexity in the modern data stack, companies are dramatically simplifying their architectures - and getting better results
Second, we’re seeing a renaissance of scale-up computing. The distributed systems that dominated the 2010s are giving way to powerful single machines and Python-first workflows
Third, we’re entering the age of agentic data - where AI doesn’t just analyze data, but actively manages it. Production AI systems are transforming both how we operate our data systems and how we extract insights from them

Transition:

These aren’t isolated trends. They’re converging to create a fundamentally new way of working with data

Slide 4

Clearing: Let’s talk about the great consolidation.

Content:

We’ve seen the modern data stack explode in the last years
There’s a tool for everything

Transition:

But this has led to a lot of complexity

Slide 5

Clearing: Buyers are overwhelmed. I’m hearing more and more of them say, “Don’t sell me another tool!”

Content:

They want simplification, not more point solutions
Companies want to optimize costs. Fewer vendors mean fewer licenses and less overhead
The office of the CFO is pressuring data leaders for ROI from billions invested over the last decade
We will see enterprises standardizing on particular technologies, particularly the broadest ones, even if the individual point solutions are not the best in that layer
Expect more mergers and acquisitions as companies try to assemble their versions of the most prized data layers

Transition:

This consolidation is pushing us towards more flexible and scalable data architectures, driven not only by cost and simplicity but also capabilities, which brings us to…

Slide 6

Clearing: That MacBook Pro should be called a mainframe pro. It’s just that powerful.

Content:

I use my MacBook Pro to run 70 billion parameter models, which are equivalent to GPT 3.5
With that kind of power, the vast majority of data workloads, I can develop on my local machine

Transition:

As a new generation of especially Python developers wants to start working with data, they prefer local first development and scale up architectures, allow them to start small and migrate their workloads to bigger machines which satisfy more than 80% of current workloads

Slide 7

Clearing: Decoupling storage and computers all about Unlocking flexibility.

Content:

We are not talking about this scale out architecture that separated storage and compute for Snowflake
Instead, we’re talking about a logical separation between the query engine and the data storage
Traditionally, these have been tightly coupled. But now, we’re seeing them decoupled, with technologies like Iceberg leading the way
This allows us to:
- Use different query engines for different tasks, optimizing for both price and performance
- Create intellectual property around AI by building proprietary models
- Improve data governance, access control, and privacy compliance
New query engines emerging:
- DuckDB is an in-process analytical database designed for efficient queries on larger datasets
- DataFusion is an extensible query engine written in Rust
We’re also seeing greater use of Python data wrangling tools:
- DLT is a powerful tool for building data transformation pipelines
- Polars is a fast and efficient DataFrame library similar to Dask

Transition:

Centralized control of data & built for purpose data engines enable AI

Slide 8

Clearing: AI is changing the way software and data engineering teams work together.

Content:

Jensen Huang, the CEO of NVIDIA, has a great way of putting it. He says the IT department of the future will be like the HR department for AI agents
We’ll be managing and ’training’ these agents to work with our data

Transition:

This change starts first within the engineering org

Slide 9

Clearing: Historically, there’s been a divide between software engineering and AI/ML teams.

Content:

AI teams often worked downstream of the application, building offline models for Analysis, clustering, and segmentation combined with the work of the financial analyst
Data engineering teams and software engineering teams are writing separate pipelines
Operating in separate environments with different technologies
Merging the two over the last decade has been extremely difficult
At the same time, Managing costs can be extremely expensive.

Transition:

AI changes this topology

Slide 10

Clearing: AI is a core part of many products, and in the future, every software company will be an AI company.

Content:

Data scientists are now building production models
Software engineers are hitting AI endpoints to build agents inside modern applications
Python has become the dominant language of AI and a popular language for software development
There’s an opportunity to fuse those two environments
Data teams need to adopt software engineering best practices including:
- Virtual development environments
- Regression and integration testing
- Cost optimization
Tobiko Data with SQLMesh reduces CDW costs by 50% while also enabling this transition to virtual development environments.
We’re seeing this occur within our startups

Transition:

Speaking of cost, let’s talk about the expense of AI

Slide 11

Clearing: In the 24 months after chatGPT3 was released, a parameter race was unleashed where the sizes of models became ever larger, culminating most recently with Lama 3.3 at 450 billion parameters.

Content:

These electron guzzling monoliths are incredibly powerful, containing a compressed version of the 20 trillion or so words written on the internet & an ability to process them
At the same time, there’s been parallel research efforts optimizing smaller and smaller models

Transition:

While large models are essential in use cases where the universe of inputs is infinite, Not every business workload needs a Wikipedia on every API call

Slide 12

Clearing: Databricks’ most recent state of data report published earlier this year. Small models are the most popular.

Content:

Small models now represent a majority of deployed AI models
Interviewing AI buyers, the pressure from the CFO is stark
In contrast to the decade of data which grew unabatedly for the 12 years before 2022, cost pressures on AI have started from day one
With financial pressure, resourceful data teams have resorted to smaller models

Transition:

But it is not performance at any price

Slide 13

Clearing: Plotting MMLU or high school equivalency over time, you can see that small, medium, and large models are converging around 70 to 80% accuracy.

Content:

This isn’t a one-time trend
Overall AI inference costs have fallen 1000x in the US in the last three years
Newer models might cost two orders of magnitude less to train
Jevons Paradox is in full force - OpenAI materially underestimated how much people would use their software

Transition:

With the performance relatively similar, no surprise enterprises are moving to smaller models. But it’s not just for performance equivalency

Slide 14

Clearing: In addition, smaller models offer significantly better latency.

Content:

Latency is three to four times better with a smaller model
Google found the linear relationship in user latency is significant on search results
It’s no different within modern software applications
Smaller models offer significantly better user experience

Transition:

And they do it Just how much is the cost difference?

Slide 15

Clearing: Docspot tracks these prices and plots them on a logarithmic chart.

Content:

Gemini’s 8 billion parameter flash model costs 10c
OpenAI’s GPT-4 costs more than $60
There’s two orders of magnitude of difference - 600x more expensive
Some new AI architectures run multiple queries for the same user workflow to ensure higher accuracy

Transition:

Smaller models of near equivalent levels of performance, significantly lower latency, and orders of magnitude lower cost. We believe they will be dominant within the enterprise. But smaller models do require one thing

Slide 16

Clearing: Data modeling isn’t just back - it’s become the foundation of reliable AI.

Content:

Without it, we’re building AI castles on sandy data
Our current AI models are text models, not numerical models
To drive maximum performance we need to model the data
This limits the universe of potential outcomes and dramatically improves quality
Data modeling significantly improves the developer experience for software engineers

Transition:

Let me show you what I mean

Slide 17

Clearing: Here I created a little TypeScript application that processes the famous FAA data. I did this in 15 minutes.

Content:

I recorded a video of my request to show me the busiest airports by total flights in 2023
The text-to-sequel model underpinning this is hitting a data model
The data model provides additional context to help translate the structure of the underlying database
For large enterprises with tens of thousands of tables, this is the only way to drive accuracy
This provides a great API endpoint for software engineers to hit

Transition:

The impact of enabling AI to work within data organizations is not trivial

Slide 18

Clearing: Many other organizations, the leading organizations are starting to use AI in a pretty meaningful way.

Content:

25% of new code at Google is written by AI
Microsoft and ServiceNow have both reported 50% developer productivity boosts
Amazon saved 275 million migrating one version of Java to another using AI
These productivity impacts will benefit data teams
Models need to understand the underlying data through data models
Once a data model is in place, we can build applications on top
This data model will basically be an ORM for the entire data stack

Transition:

Imagine being the first data team to save your company $10 million by producing the right analysis for the CFO or the board, especially in this environment of consolidation. That’s a surefire way to earn a promotion! One of the first applications of models is BI. BI is changing too

Slide 19

Clearing: Data governance isn’t about control anymore - it’s about enablement.

Content:

The best governance frameworks today are built on collaboration, not restriction
The core of BI is data governance
It may look like fancy charts, but the most important thing is providing accurate data
Data teams face a dilemma:
- Decentralized access means greater accessibility but more risk of misinterpretation
- Data centralization means higher quality data but less velocity

Transition:

We’re finally reaching a place where you can have both

Slide 20

Clearing: The business intelligence ecosystem has been a pendulum oscillating between centralized and decentralized control.

Content:

Early 2000s: The Era of Centralized BI
- Companies like MicroStrategy, Cognos, BusinessObjects, and Hyperion
- Powerful but slow and IT-dependent reporting solutions
- High accuracy, low agility
2003: The Rise of Self-Service Analytics
- Tableau revolutionized the industry
- Empowered business users to directly access and analyze data
The Cloud Data Warehouse Revolution:
- Cloud platforms like Snowflake and BigQuery enabled massive scalability
- Tools like Looker emerged for consistent and governed access
The Challenge of Balancing:
- Data democratization is crucial
- Centralized control is essential
Omni It enables a hybrid approach:
- Both centralized teams and individual marketers can define and share metrics
- Everyone uses the same trusted data while maintaining flexibility

Transition:

Underpinning BI, data models, and new architectures is observability

Slide 21

Clearing: I believe data pipelines are the backbone of any modern AI system.

Content:

They’re not just for analytics anymore; they’re essential for the entire machine learning lifecycle
Key functions of an intelligent pipeline:
- Ensures data quality through cleaning, transformation, and validation
- Enforces consistency using standardized formats
- Guarantees timely delivery
Data observability acts as a health monitor:
- Detect issues proactively
- Troubleshoot problems faster
- Build more trust in data
Pipelines are getting more complex:
- Data coming from everywhere
- Need for real-time processing is growing rapidly

Transition:

With reliable and observable data flowing, we can leverage powerful new techniques, like…

Slide 22

Clearing: This slide really captures the essence of why intelligent data pipelines are so vital.

Content:

They’re the backbone of any modern AI system
Key elements include:
- INPUTS: databases, APIs, streaming data, IoT sensors
- Processing: ensuring quality, consistency, and timely delivery
- OUTPUTS: machine learning models, dashboards, applications
Critical components:
- OBSERVABILITY and EVALS
- Constant monitoring
- Proactive issue detection
Growing demands for:
- Speed and accuracy
- Consistency across AI and BI systems
- Meeting regulatory requirements

Slide 23

Clearing: Every transformation follows a pattern. Today, three powerful movements are reshaping how enterprises work with data.

Content:

The Great Consolidation:
- After a decade of expanding complexity
- Companies are dramatically simplifying architectures
Renaissance of scale-up computing:
- Distributed systems giving way to powerful single machines
- Python-first workflows
Age of agentic data:
- AI actively manages data
- Production AI systems transform operations and insights

Transition:

These aren’t isolated trends. They’re converging to create a fundamentally new way of working with data

What DeepSeek's Newest Model Means for AI

Top Themes in Data Transcript

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Read More:

What DeepSeek's Newest Model Means for AI

Receive these posts by email like 150k+ others!