AI Breakdown

agibreakdown
undefined
Mar 18, 2024 • 4min

arxiv preprint - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

In this episode, we discuss MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training by Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang. This study investigates how different architectural components and data types impact the performance of Multimodal Large Language Models (MLLMs). The authors discovered that using a combination of different data types is crucial for high performance, and that the design of the image encoder is more influential than the vision-language connector. They applied these insights to create MM1, a series of state-of-the-art multimodal models with up to 30 billion parameters, which excel at few-shot learning and complex reasoning tasks.
undefined
Mar 15, 2024 • 4min

arxiv preprint - Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

In this episode, we discuss Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking by Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman. The paper presents Quiet-STaR, an improved self-reasoning language model that internally generates rationales to enhance text prediction abilities. This approach mitigates challenges associated with computational costs and limitations in token prediction by using a new tokenwise parallel sampling algorithm and an extended teacher-forcing method. The enhanced model demonstrates improved zero-shot performance on reasoning benchmarks and a reduction in perplexity without task-specific fine-tuning, indicating a more scalable and general reasoning capability in language models.
undefined
Mar 14, 2024 • 4min

arxiv preprint - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

In this episode, we discuss WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? by Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, Alexandre Lacoste. The paper introduces WorkArena, a benchmark created to evaluate large language model-based agents that interact with web-based enterprise software like ServiceNow, along with BrowserGym, a tool for creating and testing these agents. The study assesses the agents' abilities to complete typical knowledge worker tasks, finding that while agents have potential in this area, there is still a substantial gap before achieving complete task automation. The results also reveal differences in the performances of open versus closed-source language models, pointing to a key direction for continued research and improvement.
undefined
Mar 13, 2024 • 5min

arxiv preprint - Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

In this episode, we discuss Synth 2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings by Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino. The paper introduces a method that combines Large Language Models (LLMs) and image generation models to synthetically create image-text pairs for training Visual-Language Models (VLMs), thus circumventing the need for extensive human-labeled data. Synthetic image embeddings, generated from LLM-produced captions, are used to effectively train VLMs, achieving a 17% performance improvement over baselines while using less data. Additionally, this synthetic data creation in the image embedding space is shown to be 25% faster than working in the pixel space, offering a scalable and efficient solution for enhancing VLM training.
undefined
Mar 12, 2024 • 4min

arxiv preprint - Is Cosine-Similarity of Embeddings Really About Similarity?

In this episode, we discuss Is Cosine-Similarity of Embeddings Really About Similarity? by Harald Steck, Chaitanya Ekanadham, Nathan Kallus. The paper investigates the use of cosine-similarity in quantifying semantic similarity between embedded vectors in high-dimensional space, and reveals potential issues when applied to embeddings from regularized linear models. Analytical study of these models shows that cosine-similarity can produce meaningless or non-unique similarity measures, with the effects of regularization often implicitly influencing the results. The authors warn against the uncritical use of cosine-similarity in deep learning models due to these findings and suggest considering alternative methods to ensure the validity and clarity of semantic similarity assessments.
undefined
Mar 11, 2024 • 4min

arxiv preprint - A Generative Approach for Wikipedia-Scale Visual Entity Recognition

In this episode, we discuss A Generative Approach for Wikipedia-Scale Visual Entity Recognition by Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid. The paper introduces a new Generative Entity Recognition (GER) framework for visual entity recognition, aimed at associating images with corresponding entities on Wikipedia, surpassing the typical dual-encoder and captioning model methods. GER functions by decoding a unique "code" linked to an entity from the image, facilitating effective identification. The authors' tests show that GER outperforms existing methods according to the OVEN benchmark, advancing the capabilities of web-scale image-based entity recognition.
undefined
Mar 8, 2024 • 4min

arxiv preprint - Self-correcting LLM-controlled Diffusion Models

In this episode, we discuss Self-correcting LLM-controlled Diffusion Models by Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell. The paper introduces Self-correcting LLM-controlled Diffusion (SLD), a novel approach to improve text-to-image generation by incorporating a loop where an image is generated, evaluated, and corrected iteratively based on a given text prompt using a Language Model (LLM). SLD can be applied to existing diffusion models and has shown proficiency in generating more accurate images, particularly in aspects requiring understanding of numbers, attributes, and spatial relations. The authors also highlight SLD's capability for image editing through prompt modification and announce their intention to make the code publicly available to foster further research.
undefined
Mar 8, 2024 • 4min

arxiv preprint - tinyBenchmarks: evaluating LLMs with fewer examples

In this episode, we discuss tinyBenchmarks: evaluating LLMs with fewer examples by Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin. The paper discusses strategies to minimize the number of evaluations required to effectively assess the performance of large language models on major benchmarks. By analyzing a popular QA benchmark called MMLU, the authors demonstrate that evaluating a language model on merely 100 well-chosen examples can yield an accurate estimate of its performance. The authors have developed and released evaluation tools and condensed versions of benchmarks including Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0, which have been empirically shown to reliably replicate the outcomes of the original expansive evaluations.
undefined
Mar 6, 2024 • 4min

arxiv preprint - Asymmetry in Low-Rank Adapters of Foundation Models

In this episode, we discuss Asymmetry in Low-Rank Adapters of Foundation Models by Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz Sáez de Ocáriz Borde, Rickard Brüel Gabrielsson, Leshem Choshen, Marzyeh Ghassemi, Mikhail Yurochkin, Justin Solomon. The paper presents an analysis of Low-Rank Adaptation (LoRA), revealing an asymmetry in the roles of the matrices (denoted B and A) involved in updating neural network parameters. It is found that fine-tuning the B matrix is more critical than fine-tuning the A matrix, to the extent that an untrained A can suffice. This insight leads to better parameter efficiency and generalization bounds when only B is trained, with experimental validation on models like RoBERTa and BART-Large, among others, with resources shared on GitHub.
undefined
Mar 5, 2024 • 4min

arxiv preprint - When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

In this episode, we discuss When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method by Biao Zhang, Zhongtao Liu, Colin Cherry, Orhan Firat. The paper investigates how various scaling factors impact the effectiveness of finetuning large language models (LLMs), focusing on full-model tuning (FMT) and parameter-efficient tuning (PET). Through experiments with bilingual LLMs and tasks like machine translation and summarization, the authors find that finetuning follows a joint scaling law where increasing model size is more beneficial than increasing the size of the pretraining data, and that PET's additional parameters typically don't improve performance. They conclude that the best finetuning approach depends on the specific task and the amount of finetuning data available, providing insights for selecting and improving LLM finetuning methods.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app