
Machine Learning Street Talk (MLST) Facebook Research - Unsupervised Translation of Programming Languages
Jun 24, 2020
Marie-Anne Lachaux, Baptiste Roziere, and Guillaume Lample are talented researchers at Facebook AI Research in Paris, specializing in the unsupervised translation of programming languages. They discuss their groundbreaking method that leverages shared embeddings and tokenization to improve programming language interoperability. The conversation highlights the balance between human insight and machine learning in coding, the challenges of structural differences in languages, and the collaborative culture that fuels innovation at FAIR.
AI Snips
Chapters
Transcript
Episode notes
Shared Vocabulary
- Shared vocabularies and word piece tokenization help align different languages in unsupervised translation.
- Special language tokens guide the decoder to generate the correct target language.
Unsupervised Translator
- The researchers trained an unsupervised translator for programming languages like Java, Python, and C++.
- Previous methods were mostly rule-based, requiring extensive expertise and lacking generalizability.
Anchor Points
- Unsupervised translation of code relies on common tokens (anchor points) like keywords and variable names.
- These anchor points are crucial for aligning cross-lingual representations.



