🌍 Public

Adventures in Multimodal AI

Thoughts on the transformative and unprecedented power of today's tools.
3-min read
SHARE
Adventures in Multimodal AI
Image Source: Google Gemini/Nano Banana via Claude Code

Readers of my site know that I'm all in on Claude. That doesn't mean, though, that we're in an exclusive relationship. As I recently wrote, foolish are the souls who only carry one AI club in their bags.

Against that backdrop and inspired by Joanna Stern's hysterical new book I Am Not a Robot: My Year Using AI to Do (Almost) Everything, I decided to noodle a little more with Google's NotebookLM.

Going From Text to Audio

Yesterday, I launched Notebook LM and promptly uploaded Chapter 1 of Low-Code/No-Code: Citizen Developers and the Surprising Future of Business Applications. (Download the original PDF excerpt here.) I then gave it the following prompt:

Prompt to NotebookLM | Click on the image to enlarge it.

Half an hour later, NotebookLM dutifully spit out two professional, twentyish-minute interviews of topics discussed in that chapter. The final products are anything but robotic. Rather, they sound remarkably natural with the occasional vocal tic.

Here they are, along with their AI-generated descriptions.

Broken IT Departments

In this interview-style podcast, author Phil Simon discusses the struggle of modern workplace technology, focusing on the critical shortage of software developers and how IT backlogs lead employees to resort to shadow IT.

Trillions Spent

This overview examines the persistent chasm between leadership's positive view of workplace tech and the reality of employee frustration, despite trillions of dollars in global IT spending.

A Minor Modification

Ultimately, I wanted these AI-generated conversations to live here. The only issue: unnecessarily large file sizes. Each .wav file ran about 40 MB. Before uploading them to Cloudflare and embedding them here, a little shrinkage was in order. (Cue obligatory Seinfeld reference.) Claude Code wrote the Terminal code to shrink them.

Before deciding on hosting, you should compress these first. M4A files at 40+ MB are almost certainly encoded at higher than necessary bitrates for voice audio. FFmpeg can shrink them dramatically without audible quality loss:

ffmpeg -i Why_your_company_IT_is_so_broken.m4a -b:a 64k output.m4a

64 kbps is fine for speech. That would take a 40 MB file down to roughly 8–10 MB β€” a 75–80% reduction.

True to form, Claude's Terminal command considerably shrunk each file.

Claude and Cardio
Apple Health’s data visualizations are surprisingly basic, so I used AI to generate real insights into my workouts.

Simon Says: AI jargon is everywhere, but expect this term to stick.

I could quibble a bit with the results, but NotebookLM's output impressed the hell out of me. Even three years ago, transforming even raw textβ€”much less a sophisticated PDFβ€”into audio of this quality would have taken much, much longer.

This simple example illustrates the unprecedented power of today's AI tools. They are truly new, and it's no coincidence that use of the term multimodal is spiking. Unlike much AI jargon, expect this one and AI slop to stick.

how can i navigate the dizzying future of work?
CTA Image

Check out my award-winning book, The Nine: The Tectonic Forces Reshaping the Workplace.

BUY THE BOOK

πŸ€–
Disclosure: I wrote this post myself but used Claude to finesse its ending.

Before You Go…
If you'd like to support my writing efforts, I'd appreciate it.

TIP THE AUTHOR

Member discussion