
The Stack Overflow Podcast Even GenAI uses Wikipedia as a source
15 snips
Feb 20, 2026 Philippe Saade, AI project lead at Wikimedia Deutschland who led the Wikidata Embedding Project, talks about vectorizing millions of Wikidata items for semantic search. They cover reducing scraping pressure, how items were transformed into searchable text, and combining vector search with SPARQL for discovery and precise queries.
AI Snips
Chapters
Transcript
Episode notes
Graph-To-Text Conversion Is Essential
- Transforming graph items into text is necessary because most embedding models operate on textual inputs.
- Aggregation across connected items required multiple passes over the data dump to build suitable text for embeddings.
Choose Useful Fields For Embedding Text
- Include labels, descriptions, aliases, and statement labels when building textual representations for embeddings.
- Exclude opaque external IDs and irrelevant raw identifiers from embedding input to avoid noise.
Publish Parquet Dumps To Ease Consumption
- Publish preprocessed Parquet files (e.g., on Hugging Face) so consumers can read row-by-row without stressing Wikimedia servers.
- Use columnar Parquet with connected labels to simplify downstream vectorization and reduce duplicate processing.
