AI Breakdown

agibreakdown
undefined
Oct 9, 2023 • 4min

arxiv Preprint - Improved Baselines with Visual Instruction Tuning

In this episode we discuss Improved Baselines with Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee. The authors propose enhancements to the LLaVA framework for large multimodal models (LMMs) with visual instruction tuning. By incorporating CLIP-ViT-L-336px with MLP projection and academic-task-oriented VQA data, they achieve superior performance on multiple benchmarks. These improvements are independent of the LLaVA framework and enable enhanced multimodal understanding with state-of-the-art results using a smaller dataset and shorter training time.
undefined
Oct 8, 2023 • 3min

arxiv Preprint - Tree of Thoughts: Deliberate Problem Solving with Large Language Models

In this episode we discuss Tree of Thoughts: Deliberate Problem Solving with Large Language Models by Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan. The authors of this paper introduce a framework called "Tree of Thoughts" (ToT) to enhance language model inference. The ToT framework allows language models to make deliberate decisions by considering multiple reasoning paths and self-evaluating choices. The authors demonstrate the effectiveness of ToT on three tasks, showing significant improvement in problem-solving abilities compared to traditional prompting methods.
undefined
Oct 7, 2023 • 4min

Neurips 2023 - Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

In this episode we discuss Evaluating Cognitive Maps and Planning in Large Language Models with CogEval by Ida Momennejad, Hosein Hasanbeig, Felipe Vieira, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, Jonathan Larson. The paper presents CogEval, a protocol designed to evaluate the cognitive abilities of Large Language Models (LLMs). The authors note the lack of rigorous evaluation in previous studies claiming human-level cognitive abilities in LLMs and propose CogEval as a framework for systematic evaluation. They apply CogEval to assess the cognitive maps and planning skills of eight different LLMs, finding that while they perform well in simpler planning tasks, there are significant failure modes such as hallucinations and being trapped in loops, indicating a lack of understanding of underlying cognitive structures.
undefined
Oct 6, 2023 • 4min

ICCV 2023 - Diffusion Models as Masked Autoencoders

In this episode we discuss Diffusion Models as Masked Autoencoders by Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer. The authors present a method called Diffusion Models as Masked Autoencoders (DiffMAE) that combines generative pre-training with diffusion models for visual data. They show that DiffMAE can be a strong initialization for recognition tasks, perform high-quality image inpainting, and achieve state-of-the-art classification accuracy for video. The paper emphasizes the need to consider the specific challenges and requirements of downstream tasks when using generative pre-training.
undefined
Oct 5, 2023 • 4min

arxiv Preprint - Conditional Diffusion Distillation

In this episode we discuss Conditional Diffusion Distillation by Kangfu Mei, Mauricio Delbracio, Hossein Talebi, Zhengzhong Tu, Vishal M. Patel, Peyman Milanfar. The authors of this paper propose a new method called conditional distillation to speed up the sampling time of diffusion models in text-to-image generation. The method incorporates image conditions to enhance the diffusion priors and enable conditional sampling with fewer steps. The proposed method simplifies the distillation process by directly distilling the unconditional pre-training in a single stage through joint-learning, and it outperforms existing distillation techniques in terms of sampling time.
undefined
Oct 4, 2023 • 4min

arxiv Preprint - Enable Language Models to Implicitly Learn Self-Improvement From Data

In this episode we discuss Enable Language Models to Implicitly Learn Self-Improvement From Data by Ziqi Wang, Le Hou, Tianjian Lu, Yuexin Wu, Yunxuan Li, Hongkun Yu, Heng Ji. The paper introduces a framework called ImPlicit Self-ImprovemenT (PIT) that allows large language models (LLMs) to learn self-improvement from data. PIT learns the improvement goal from human preference data without requiring explicit rubrics, making it more efficient and effective compared to previous approaches that rely on explicit inputs. Experimental results show that PIT outperforms prompting-based methods in enhancing LLM performance.
undefined
Oct 3, 2023 • 4min

arxiv Preprint - Efficient Streaming Language Models with Attention Sinks

In this episode we discuss Efficient Streaming Language Models with Attention Sinks by Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. The paper proposes StreamingLLM, a framework that allows Large Language Models (LLMs) to generalize to infinite sequence length without fine-tuning. By observing the phenomenon of attention sink, where initial tokens have a significant impact on performance, the authors show that caching the Key and Value states of these tokens enhances the efficiency and stability of window attention. The authors demonstrate that StreamingLLM outperforms the sliding window recomputation baseline in streaming applications with a speedup of up to 22.2x.
undefined
Oct 2, 2023 • 4min

Neurips 2023 - PuzzleFusion: Unleashing the Power of Diffusion Models for Spatial Puzzle Solving

In this episode we discuss PuzzleFusion: Unleashing the Power of Diffusion Models for Spatial Puzzle Solving by Sepidehsadat Hosseini, Mohammad Amin Shabani, Saghar Irandoust, Yasutaka Furukawa. The paper introduces PuzzleFusion, a neural architecture based on Diffusion Models for spatial puzzle solving. It focuses on jigsaw puzzle solving and room arrangement tasks, using new datasets including synthetic ones generated by Voronoi diagrams and a real dataset from MagicPlan. The paper shows that PuzzleFusion outperforms other methods in both qualitative and quantitative evaluations.
undefined
Oct 1, 2023 • 3min

arxiv Preprint - Vision Transformers Need Registers

In this episode we discuss Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski. The paper discusses a solution to artifacts found in the feature maps of Vision Transformers (ViT) in low-informative background areas of images. By adding additional tokens called "registers" to the input sequence, the feature maps and attention maps are improved, leading to better visual processing. This solution is effective for both supervised and self-supervised ViT models and achieves state-of-the-art performance on self-supervised visual models. Additionally, the use of registers enables object discovery methods with larger models.
undefined
Sep 30, 2023 • 5min

arxiv Preprint - VPA: Fully Test-Time Visual Prompt Adaptation

In this episode we discuss VPA: Fully Test-Time Visual Prompt Adaptation by Jiachen Sun, Mark Ibrahim, Melissa Hall, Ivan Evtimov, Z. Morley Mao, Cristian Canton Ferrer, Caner Hazirbas. The paper presents Visual Prompt Adaptation (VPA), a framework that extends prompt tuning to visual recognition tasks. VPA allows for test-time adaptation without source-domain information and improves out-of-distribution generalization, corruption robustness, domain adaptation, and zero-shot recognition. Experimental results show improvements of 3.3% in OOD generalization, 6.5% in corruption robustness, and 5.2% in domain adaptation.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app