#026 Embedding Numbers, Categories, Locations, Images, Text, and The World

10 snips

Oct 10, 2024

Mór Kapronczay, Head of ML at Superlinked, unpacks the nuances of embeddings beyond just text. He emphasizes that traditional text embeddings fall short, especially with complex data. Mór introduces multi-modal embeddings that integrate various data types, improving search relevance and user experiences. He also discusses challenges in embedding numerical data, suggesting innovative methods like logarithmic transformations. The conversation delves into balancing speed and accuracy in vector searches, highlighting the dynamic nature of real-time data prioritization.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Optimize Size And Iterate With Query Weights

Keep vectors as small as needed and use quantization (float16/8) to save memory in vector DBs.
Experiment fast with query-side weight changes instead of re-embedding to iterate quickly.

ADVICE

Set Modality Weights Per Query

Dynamically set modality weights per query by detecting intent in the natural-language query.
Train user-specific weight predictors if you have labels to personalize modality importance.

ADVICE

Evaluate With Labels Or A/B Tests

Evaluate new embeddings with IR metrics when labels exist and run A/B tests when they don't.
Start by eyeballing common queries to justify larger experiments before full A/B testing.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Today’s guest is Mór Kapronczay. Mór is the Head of ML at superlinked. Superlinked is a compute framework for your information retrieval and feature engineering systems, where they turn anything into embeddings.

When most people think about embeddings, they think about ada, openai.

You just take your text and throw it in there.

But that’s too crude.

OpenAI embeddings are trained on the internet.

But your data set (most likely) is not the internet.

You have different nuances.

And you have more than just text.

So why not use it.

Some highlights:

Text Embeddings are Not a Magic Bullet

➡️ Pouring everything into a text embedding model won't yield magical results ➡️ Language is lossy - it's a poor compression method for complex information

Embedding Numerical Data

➡️ Direct number embeddings don't work well for vector search ➡️ Consider projecting number ranges onto a quarter circle ➡️ Apply logarithmic transforms for skewed distributions

Multi-Modal Embeddings

➡️ Create separate vector parts for different data aspects ➡️ Normalize individual parts ➡️ Weight vector parts based on importance

A Multi-Vector approach can help you understand the contributions of each modality or embedding and give you an easier time to fine-tune your retrieval system without fine-tuning your embedding models by tuning your vector database like you would a search database (like Elastic).

Mór Kapronczay

Nicolay Gerold:

00:00 Introduction to Embeddings 00:30 Beyond Text: Expanding Embedding Capabilities 02:09 Challenges and Innovations in Embedding Techniques 03:49 Unified Representations and Vector Computers 05:54 Embedding Complex Data Types 07:21 Recommender Systems and Interaction Data 08:59 Combining and Weighing Embeddings 14:58 Handling Numerical and Categorical Data 20:35 Optimizing Embedding Efficiency 22:46 Dynamic Weighting and Evaluation 24:35 Exploring AB Testing with Embeddings 25:08 Joint vs Separate Embedding Spaces 27:30 Understanding Embedding Dimensions 29:59 Libraries and Frameworks for Embeddings 32:08 Challenges in Embedding Models 33:03 Vector Database Connectors 34:09 Balancing Production and Updates 36:50 Future of Vector Search and Modalities 39:36 Building with Embeddings: Tips and Tricks 42:26 Concluding Thoughts and Next Steps