The Stack Overflow Podcast

Even GenAI uses Wikipedia as a source

15 snips
Feb 20, 2026
Philippe Saade, AI project lead at Wikimedia Deutschland who led the Wikidata Embedding Project, talks about vectorizing millions of Wikidata items for semantic search. They cover reducing scraping pressure, how items were transformed into searchable text, and combining vector search with SPARQL for discovery and precise queries.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Graph-To-Text Conversion Is Essential

  • Transforming graph items into text is necessary because most embedding models operate on textual inputs.
  • Aggregation across connected items required multiple passes over the data dump to build suitable text for embeddings.
ADVICE

Choose Useful Fields For Embedding Text

  • Include labels, descriptions, aliases, and statement labels when building textual representations for embeddings.
  • Exclude opaque external IDs and irrelevant raw identifiers from embedding input to avoid noise.
ADVICE

Publish Parquet Dumps To Ease Consumption

  • Publish preprocessed Parquet files (e.g., on Hugging Face) so consumers can read row-by-row without stressing Wikimedia servers.
  • Use columnar Parquet with connected labels to simplify downstream vectorization and reduce duplicate processing.
Get the Snipd Podcast app to discover more snips from this episode
Get the app