11 minute read / Jan 22, 2025 /
Top Themes in Data Transcript
Slide 1
Clearing: While data world consolidates, capabilities have exploded with AI.
Content:
- AI is rewriting every rule about what’s possible with data
- Those two forces in tension will make for an exciting 2025
Slide 2
Clearing: My name is Tomasz Tunguz, founder and general partner at Theory.
Content:
- I’ve been investing in data for the last 17 years and have worked with companies like Looker, Monte Carlo, Hex, Omni, Tobiko Data and Mother Duck
- I founded Theory, a venture firm managing $700M with the idea that all modern software companies will be underpinned by data and AI
- We run a research-oriented firm, formed by 200 buyers of data and AI software
Transition:
- These are the themes that we predict within the world of data
Slide 3
Clearing: Every transformation follows a pattern. Today, three powerful movements are reshaping how enterprises work with data.
Content:
- First, we’re witnessing the Great Consolidation. After a decade of expanding complexity in the modern data stack, companies are dramatically simplifying their architectures - and getting better results
- Second, we’re seeing a renaissance of scale-up computing. The distributed systems that dominated the 2010s are giving way to powerful single machines and Python-first workflows
- Third, we’re entering the age of agentic data - where AI doesn’t just analyze data, but actively manages it. Production AI systems are transforming both how we operate our data systems and how we extract insights from them
Transition:
- These aren’t isolated trends. They’re converging to create a fundamentally new way of working with data
Slide 4
Clearing: Let’s talk about the great consolidation.
Content:
- We’ve seen the modern data stack explode in the last years
- There’s a tool for everything
Transition:
- But this has led to a lot of complexity
Slide 5
Clearing: Buyers are overwhelmed. I’m hearing more and more of them say, “Don’t sell me another tool!”
Content:
- They want simplification, not more point solutions
- Companies want to optimize costs. Fewer vendors mean fewer licenses and less overhead
- The office of the CFO is pressuring data leaders for ROI from billions invested over the last decade
- We will see enterprises standardizing on particular technologies, particularly the broadest ones, even if the individual point solutions are not the best in that layer
- Expect more mergers and acquisitions as companies try to assemble their versions of the most prized data layers
Transition:
- This consolidation is pushing us towards more flexible and scalable data architectures, driven not only by cost and simplicity but also capabilities, which brings us to…
Slide 6
Clearing: That MacBook Pro should be called a mainframe pro. It’s just that powerful.
Content:
- I use my MacBook Pro to run 70 billion parameter models, which are equivalent to GPT 3.5
- With that kind of power, the vast majority of data workloads, I can develop on my local machine
Transition:
- As a new generation of especially Python developers wants to start working with data, they prefer local first development and scale up architectures, allow them to start small and migrate their workloads to bigger machines which satisfy more than 80% of current workloads
Slide 7
Clearing: Decoupling storage and computers all about Unlocking flexibility.
Content:
- We are not talking about this scale out architecture that separated storage and compute for Snowflake
- Instead, we’re talking about a logical separation between the query engine and the data storage
- Traditionally, these have been tightly coupled. But now, we’re seeing them decoupled, with technologies like Iceberg leading the way
- This allows us to:
- Use different query engines for different tasks, optimizing for both price and performance
- Create intellectual property around AI by building proprietary models
- Improve data governance, access control, and privacy compliance
- New query engines emerging:
- DuckDB is an in-process analytical database designed for efficient queries on larger datasets
- DataFusion is an extensible query engine written in Rust
- We’re also seeing greater use of Python data wrangling tools:
- DLT is a powerful tool for building data transformation pipelines
- Polars is a fast and efficient DataFrame library similar to Dask
Transition:
- Centralized control of data & built for purpose data engines enable AI
Slide 8
Clearing: AI is changing the way software and data engineering teams work together.
Content:
- Jensen Huang, the CEO of NVIDIA, has a great way of putting it. He says the IT department of the future will be like the HR department for AI agents
- We’ll be managing and ’training’ these agents to work with our data
Transition:
- This change starts first within the engineering org
Slide 9
Clearing: Historically, there’s been a divide between software engineering and AI/ML teams.
Content:
- AI teams often worked downstream of the application, building offline models for Analysis, clustering, and segmentation combined with the work of the financial analyst
- Data engineering teams and software engineering teams are writing separate pipelines
- Operating in separate environments with different technologies
- Merging the two over the last decade has been extremely difficult
- At the same time, Managing costs can be extremely expensive.
Transition:
- AI changes this topology
Slide 10
Clearing: AI is a core part of many products, and in the future, every software company will be an AI company.
Content:
- Data scientists are now building production models
- Software engineers are hitting AI endpoints to build agents inside modern applications
- Python has become the dominant language of AI and a popular language for software development
- There’s an opportunity to fuse those two environments
- Data teams need to adopt software engineering best practices including:
- Virtual development environments
- Regression and integration testing
- Cost optimization
- Tobiko Data with SQLMesh reduces CDW costs by 50% while also enabling this transition to virtual development environments.
- We’re seeing this occur within our startups
Transition:
- Speaking of cost, let’s talk about the expense of AI
Slide 11
Clearing: In the 24 months after chatGPT3 was released, a parameter race was unleashed where the sizes of models became ever larger, culminating most recently with Lama 3.3 at 450 billion parameters.
Content:
- These electron guzzling monoliths are incredibly powerful, containing a compressed version of the 20 trillion or so words written on the internet & an ability to process them
- At the same time, there’s been parallel research efforts optimizing smaller and smaller models
Transition:
- While large models are essential in use cases where the universe of inputs is infinite, Not every business workload needs a Wikipedia on every API call
Slide 12
Clearing: Databricks’ most recent state of data report published earlier this year. Small models are the most popular.
Content:
- Small models now represent a majority of deployed AI models
- Interviewing AI buyers, the pressure from the CFO is stark
- In contrast to the decade of data which grew unabatedly for the 12 years before 2022, cost pressures on AI have started from day one
- With financial pressure, resourceful data teams have resorted to smaller models
Transition:
- But it is not performance at any price
Slide 13
Clearing: Plotting MMLU or high school equivalency over time, you can see that small, medium, and large models are converging around 70 to 80% accuracy.
Content:
- This isn’t a one-time trend
- Overall AI inference costs have fallen 1000x in the US in the last three years
- Newer models might cost two orders of magnitude less to train
- Jevons Paradox is in full force - OpenAI materially underestimated how much people would use their software
Transition:
- With the performance relatively similar, no surprise enterprises are moving to smaller models. But it’s not just for performance equivalency
Slide 14
Clearing: In addition, smaller models offer significantly better latency.
Content:
- Latency is three to four times better with a smaller model
- Google found the linear relationship in user latency is significant on search results
- It’s no different within modern software applications
- Smaller models offer significantly better user experience
Transition:
- And they do it Just how much is the cost difference?
Slide 15
Clearing: Docspot tracks these prices and plots them on a logarithmic chart.
Content:
- Gemini’s 8 billion parameter flash model costs 10c
- OpenAI’s GPT-4 costs more than $60
- There’s two orders of magnitude of difference - 600x more expensive
- Some new AI architectures run multiple queries for the same user workflow to ensure higher accuracy
Transition:
- Smaller models of near equivalent levels of performance, significantly lower latency, and orders of magnitude lower cost. We believe they will be dominant within the enterprise. But smaller models do require one thing
Slide 16
Clearing: Data modeling isn’t just back - it’s become the foundation of reliable AI.
Content:
- Without it, we’re building AI castles on sandy data
- Our current AI models are text models, not numerical models
- To drive maximum performance we need to model the data
- This limits the universe of potential outcomes and dramatically improves quality
- Data modeling significantly improves the developer experience for software engineers
Transition:
- Let me show you what I mean
Slide 17
Clearing: Here I created a little TypeScript application that processes the famous FAA data. I did this in 15 minutes.
Content:
- I recorded a video of my request to show me the busiest airports by total flights in 2023
- The text-to-sequel model underpinning this is hitting a data model
- The data model provides additional context to help translate the structure of the underlying database
- For large enterprises with tens of thousands of tables, this is the only way to drive accuracy
- This provides a great API endpoint for software engineers to hit
Transition:
- The impact of enabling AI to work within data organizations is not trivial
Slide 18
Clearing: Many other organizations, the leading organizations are starting to use AI in a pretty meaningful way.
Content:
- 25% of new code at Google is written by AI
- Microsoft and ServiceNow have both reported 50% developer productivity boosts
- Amazon saved 275 million migrating one version of Java to another using AI
- These productivity impacts will benefit data teams
- Models need to understand the underlying data through data models
- Once a data model is in place, we can build applications on top
- This data model will basically be an ORM for the entire data stack
Transition:
- Imagine being the first data team to save your company $10 million by producing the right analysis for the CFO or the board, especially in this environment of consolidation. That’s a surefire way to earn a promotion! One of the first applications of models is BI. BI is changing too
Slide 19
Clearing: Data governance isn’t about control anymore - it’s about enablement.
Content:
- The best governance frameworks today are built on collaboration, not restriction
- The core of BI is data governance
- It may look like fancy charts, but the most important thing is providing accurate data
- Data teams face a dilemma:
- Decentralized access means greater accessibility but more risk of misinterpretation
- Data centralization means higher quality data but less velocity
Transition:
- We’re finally reaching a place where you can have both
Slide 20
Clearing: The business intelligence ecosystem has been a pendulum oscillating between centralized and decentralized control.
Content:
- Early 2000s: The Era of Centralized BI
- Companies like MicroStrategy, Cognos, BusinessObjects, and Hyperion
- Powerful but slow and IT-dependent reporting solutions
- High accuracy, low agility
- 2003: The Rise of Self-Service Analytics
- Tableau revolutionized the industry
- Empowered business users to directly access and analyze data
- The Cloud Data Warehouse Revolution:
- Cloud platforms like Snowflake and BigQuery enabled massive scalability
- Tools like Looker emerged for consistent and governed access
- The Challenge of Balancing:
- Data democratization is crucial
- Centralized control is essential
- Omni It enables a hybrid approach:
- Both centralized teams and individual marketers can define and share metrics
- Everyone uses the same trusted data while maintaining flexibility
Transition:
- Underpinning BI, data models, and new architectures is observability
Slide 21
Clearing: I believe data pipelines are the backbone of any modern AI system.
Content:
- They’re not just for analytics anymore; they’re essential for the entire machine learning lifecycle
- Key functions of an intelligent pipeline:
- Ensures data quality through cleaning, transformation, and validation
- Enforces consistency using standardized formats
- Guarantees timely delivery
- Data observability acts as a health monitor:
- Detect issues proactively
- Troubleshoot problems faster
- Build more trust in data
- Pipelines are getting more complex:
- Data coming from everywhere
- Need for real-time processing is growing rapidly
Transition:
- With reliable and observable data flowing, we can leverage powerful new techniques, like…
Slide 22
Clearing: This slide really captures the essence of why intelligent data pipelines are so vital.
Content:
- They’re the backbone of any modern AI system
- Key elements include:
- INPUTS: databases, APIs, streaming data, IoT sensors
- Processing: ensuring quality, consistency, and timely delivery
- OUTPUTS: machine learning models, dashboards, applications
- Critical components:
- OBSERVABILITY and EVALS
- Constant monitoring
- Proactive issue detection
- Growing demands for:
- Speed and accuracy
- Consistency across AI and BI systems
- Meeting regulatory requirements
Slide 23
Clearing: Every transformation follows a pattern. Today, three powerful movements are reshaping how enterprises work with data.
Content:
- The Great Consolidation:
- After a decade of expanding complexity
- Companies are dramatically simplifying architectures
- Renaissance of scale-up computing:
- Distributed systems giving way to powerful single machines
- Python-first workflows
- Age of agentic data:
- AI actively manages data
- Production AI systems transform operations and insights
Transition:
- These aren’t isolated trends. They’re converging to create a fundamentally new way of working with data