The Stack Overflow Podcast

Even GenAI uses Wikipedia as a source

15 snips

Feb 20, 2026

Philippe Saade, AI project lead at Wikimedia Deutschland who led the Wikidata Embedding Project, talks about vectorizing millions of Wikidata items for semantic search. They cover reducing scraping pressure, how items were transformed into searchable text, and combining vector search with SPARQL for discovery and precise queries.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Graph-To-Text Conversion Is Essential

Transforming graph items into text is necessary because most embedding models operate on textual inputs.
Aggregation across connected items required multiple passes over the data dump to build suitable text for embeddings.

ADVICE

Choose Useful Fields For Embedding Text

Include labels, descriptions, aliases, and statement labels when building textual representations for embeddings.
Exclude opaque external IDs and irrelevant raw identifiers from embedding input to avoid noise.

ADVICE

Publish Parquet Dumps To Ease Consumption

Publish preprocessed Parquet files (e.g., on Hugging Face) so consumers can read row-by-row without stressing Wikimedia servers.
Use columnar Parquet with connected labels to simplify downstream vectorization and reduce duplicate processing.

Get the Snipd Podcast app to discover more snips from this episode