
Data Engineering Podcast Unlocking The Power of Data Lineage In Your Platform with OpenLineage
17 snips
May 18, 2021 Julien Le Dem, a data engineer and CTO of Datakin, discusses the significance of data lineage in understanding data quality and pipeline impacts. He introduces OpenLineage, a project aimed at standardizing lineage metadata across various platforms, promoting collaboration among competing companies. Julien explains its core model and how it benefits data observability, trust, and reliability. He emphasizes the importance of community contributions and outlines the integration process, highlighting the pressing need for better tooling in pipeline observability.
AI Snips
Chapters
Transcript
Episode notes
Career Path That Spawned OpenLineage
- Julien traced his lineage work from Yahoo to Pig, Twitter, Parquet, Arrow, and WeWork.
- Building Marquez at WeWork exposed the need that led to Datakin and OpenLineage.
Publish Facets With JSON Schema Links
- When creating custom facets, publish a reachable JSON Schema URL so consumers can discover semantics.
- Include documentation and examples inside the schema to aid registry and tooling integration.
Decouple Collection From Storage Design
- OpenLineage records raw metadata; storage/indexing is the consumer's responsibility.
- Storing facets as JSON and indexing only needed fields balances flexibility and performant queries.
