AI Breakdown

agibreakdown
undefined
Nov 13, 2023 • 3min

arxiv preprint - CogVLM: Visual Expert for Pretrained Language Models

In this episode we discuss CogVLM: Visual Expert for Pretrained Language Models by Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang. CogVLM is an open-source visual language foundation model that significantly improves the integration of vision and language by incorporating a trainable visual expert module within a pre-trained language model's attention and feed-forward layers. Unlike other models, CogVLM deeply fuses visual and language features without losing any natural language processing capabilities. It delivers state-of-the-art results on several cross-modal benchmarks and is competitive on others, with resources and code accessible publicly.
undefined
Nov 10, 2023 • 3min

ArXiv Preprint - De-Diffusion Makes Text a Strong Cross-Modal Interface

In this episode we discuss De-Diffusion Makes Text a Strong Cross-Modal Interface by Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu. The paper introduces De-Diffusion, a new approach that uses text to represent images. An autoencoder is used to transform an image into text, which can be reconstructed back into the original image using a pre-trained text-to-image diffusion model. The De-Diffusion text representation of images is shown to be accurate and comprehensive, making it compatible with various multi-modal tasks and achieving state-of-the-art performance on vision-language tasks.
undefined
Nov 9, 2023 • 3min

ArXiv Preprint - E3 TTS: Easy End-to-End Diffusion-based Text to Speech

In this episode we discuss E3 TTS: Easy End-to-End Diffusion-based Text to Speech by Yuan Gao, Nobuyuki Morioka, Yu Zhang, Nanxin Chen. The paper introduces Easy End-to-End Diffusion-based Text to Speech (E3 TTS), an innovative text-to-speech model that converts text to audio using a diffusion process without the need for intermediate representations or alignment information. E3 TTS functions through iterative refinement directly from plain text to audio waveform, supporting flexible latent structures that enable zero-shot tasks like editing. The model has been tested and offers high-fidelity audio generation, comparable to the performance of advanced neural TTS systems, with samples available online for evaluation.
undefined
Nov 8, 2023 • 3min

ArXiv Preprint - Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges

In this episode we discuss Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges by Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, Huaxiu Yao. The study introduces the Bingo benchmark to analyze hallucination behavior in GPT-4V(ision), a model processing both visual and textual data. Hallucinations, categorized as either bias or interference, reveal that GPT-4V(ision) prefers Western-centric images and is sensitive to how questions and images are presented, with established mitigation strategies proving ineffective. The findings expose similar issues in other leading visual-language models, suggesting an industry-wide challenge that necessitates novel solutions.
undefined
Nov 7, 2023 • 4min

ArXiv Preprint - Learning From Mistakes Makes LLM Better Reasoner

In this episode we discuss Learning From Mistakes Makes LLM Better Reasoner by Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, Weizhu Chen. The paper introduces LEarning from MistAkes (LEMA), a method that improves large language models' (LLMs) ability to solve math problems by fine-tuning them using GPT-4-generated mistake-correction data pairs. LEMA involves identifying an LLM's errors in reasoning, explaining why the mistake occurred, and providing the correct solution. LEMA showed significant performance enhancements on mathematical reasoning tasks, surpassing state-of-the-art performances of open-source models, with the intention to release the code, data, and models publicly.
undefined
Nov 6, 2023 • 3min

ArXiv Preprint - The Generative AI Paradox: ”What It Can Create, It May Not Understand”

In this episode we discuss The Generative AI Paradox: "What It Can Create, It May Not Understand" by Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, Yejin Choi. The paper examines the paradox in generative AI models where they excel in output generation but struggle with comprehension. The authors propose the Generative AI Paradox hypothesis, stating that the models acquire superior generative abilities without corresponding understanding abilities. They compare the performance of humans and models in language and image tasks and find that while models outperform humans in generation, they consistently lag behind in understanding, cautioning against comparing AI to human intelligence.
undefined
Nov 3, 2023 • 4min

ArXiv Preprint - TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

In this episode we discuss TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise by Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao, Zhaohui Hou, Zhiyuan Huang, Shaoqing Lu, Ding Liang, Mingjie Zhan. The paper introduces TeacherLM, a series of language models designed to teach other models. The TeacherLM-7.1B model achieved a high score on MMLU and outperformed models with more parameters. It also has data augmentation abilities and has been used to teach multiple student models.
undefined
Nov 2, 2023 • 4min

ArXiv Preprint - MM-VID: Advancing Video Understanding with GPT-4V(ision)

In this episode we discuss MM-VID: Advancing Video Understanding with GPT-4V(ision) by Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang. The paper introduces MM-VID, a system that incorporates GPT-4V with vision, audio, and speech experts to enhance video understanding. It focuses on handling complex tasks like tracking character storylines across multiple episodes. The paper showcases the capabilities of MM-VID through detailed responses and demonstrations in various figures.
undefined
Nov 1, 2023 • 4min

ArXiv Preprint - Zephyr: Direct Distillation of LM Alignment

In this episode we discuss Zephyr: Direct Distillation of LM Alignment by Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf. The paper introduces ZEPHYR, a language model that focuses on aligning with user intent to improve task accuracy. The authors employ distilled supervised fine-tuning (dSFT) on larger models but note the lack of alignment with natural prompts. To address this, the authors experiment with preference data from AI Feedback (AIF) and use distilled direct preference optimization (dDPO) to enhance intent alignment. Their approach, requiring only a few hours of training, achieves state-of-the-art performance on chat benchmarks without human annotation, surpassing the performance of the best RLHF-based model LLAMA2-CHAT-70B on MT-Bench.
undefined
Oct 31, 2023 • 4min

ArXiv Preprint - ControlLLM: Augment Language Models with Tools by Searching on Graphs

In this episode we discuss ControlLLM: Augment Language Models with Tools by Searching on Graphs by Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang. The paper introduces a framework called ControlLLM that enhances large language models (LLMs) by allowing them to use multi-modal tools for complex tasks. ControlLLM addresses challenges such as ambiguous prompts, inaccurate tool selection, parameterization, and inefficient tool scheduling. It consists of three components: a task decomposer, a Thoughts-on-Graph paradigm, and an execution engine. The framework is evaluated on tasks involving image, audio, and video processing, and it outperforms existing methods in terms of accuracy, efficiency, and versatility.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app