72: Multimodal AI for Ray-Ban Meta glasses

Feb 28, 2025

Explore the fascinating world of multimodal AI and its application in Ray-Ban Meta glasses. Discover how integration of image recognition technology enhances user interactions and the challenges faced in wearable tech. Learn about the collaborative efforts among researchers and engineers that drive innovation forward. Delve into the empowering Be My Eyes initiative, which aids the visually impaired with audio guidance. Unlock the transformative potential of open source contributions in advancing AI and experience the future of smart wearable technology!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

AnyMAL: A Multimodal AI Model

Shane's team published a paper on AnyMAL (Any-Modality Augmented Language Model).
This model efficiently extends large language models to process multiple modalities like images, videos, and audio.

INSIGHT

Encoder Zoo in AnyMAL

AnyMAL leverages an "Encoder Zoo," using pre-trained encoders for each modality.
These encoders translate raw input signals into a feature space understandable by the language model, like perception modules.

INSIGHT

Zero-Shot Performance of AnyMAL

AnyMAL demonstrated strong zero-shot performance in reasoning across modalities after being trained on captioning tasks.
This suggests that training models to describe modalities can unlock their ability to reason about them in new contexts.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this episode of the Meta Tech Podcast, host Pascal sits down with Shane, a research scientist at Meta, to explore the cutting-edge research behind Ray-Ban Meta glasses. Shane shares insights from his seven-year journey at Meta, where he focuses on computer vision and multimodal AI within the Wearables AI organization.

Tune in to learn how Shane's team is pioneering foundational models for Ray-Ban Meta glasses, tackling unique challenges, and pushing the boundaries of AI-driven innovation. Discover how multimodal AI is transforming user experiences and get a glimpse into the future of wearable technology. Whether you're an engineer, a tech enthusiast, or simply curious about the latest advancements, there is something for everyone in this episode.

Got feedback? Send it to us on Threads (https://threads.net/@metatechpod), Instagram (https://instagram.com/metatechpod) and don't forget to follow our host Pascal (https://mastodon.social/@passy, https://threads.net/@passy_). Fancy working with us? Check out https://www.metacareers.com/.

Links

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model - https://arxiv.org/abs/2309.16058
Be My Eyes Programme: https://www.forbes.com/sites/stevenaquino/2024/10/11/inside-the-be-my-eyes-meta-collaboration-and-the-allure-to--impact-humanity/
Meta Open Source on Threads: https://www.threads.net/@metaopensource
CacheLib: https://cachelib.org/
Meta's AI-Powered Ray-Bans Are Life-Enhancing for the Blind - Wall Street Journal: https://www.wsj.com/tech/ai/metas-ai-powered-ray-bans-are-life-enhancing-for-the-blind-3ae38026

Timestamps

Intro 0:06
OSS News 0:56
Introduction Shane 1:30
The role of research scientist over time 3:03
What's Multimodal AI? 5:45
Applying Multimodal AI in Meta's products 7:21
Acoustic modalities beyond speech 9:17
AnyMAL 12:23
Encoder zoos 13:53
0-shot performance 16:25
Iterating on models 17:28
LLM parameter size 19:29
How do we process a request from the glasses? 21:53
Processing moving images 23:44
Scaling to billions of users 26:01
Where lies the optimisation potential? 28:12
Incorporating feedback 29:08
Open-source influence 31:30
Be My Eyes Programme 33:57
Working with industry experts at Meta 36:18
Outro 38:55