LW - Research Post: Tasks That Language Models Don't Learn by Bruce W. Lee

Feb 23, 2024

The podcast discusses how large language models struggle to understand sensory aspects of language, presented through the H-Test benchmark. Key findings include plateauing performance with stronger models and minimal impact of example quantity.

Ask episode

Chapters

Transcript

Episode notes

Exploring the Limits of Language Models in Sensory Learning

00:00 • 4min

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Research Post: Tasks That Language Models Don't Learn, published by Bruce W. Lee on February 23, 2024 on LessWrong.
Abstract
We argue that there are certain properties of language that our current large language models (LLMs) don't learn. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory experiences, and the sensory-deprived processing capabilities of LLMs. In support of our hypothesis, 1.
deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B -> LLaMA 2 70B) do not trivially bring improvements in H-Test performance.
Therefore, we make a particular connection to the philosophical case of Mary, who learns about the world in a sensory-deprived environment. Our experiments show that some of the strongest proprietary LLMs stay near random chance baseline accuracy of 50%, highlighting the limitations of knowledge acquired in the absence of sensory experience.
Key Findings on H-Test
1. Insignificant Intra-Family Improvements.
A stronger model in the same model family often does not bring meaningful improvement in H-Test performance.
2. Number of Examples Has Minimal Impact.
The number of examples given neither increases nor decreases performance significantly, strongly hinting that the LLM is simply not learning from H-Test few-shot examples.
3. Deliberate Reasoning (CoT) Often Decreases Performance.
If LLMs benefit from such logical, step-by-step semantic reasoning on the H-Test, this can also imply that H-Test is fundamentally solvable by developing stronger language-only models. But CoT decreases performances in general.
4. Training with more orthography-specific language data does not improve H-Test.
We produced 1000 training instances per task in H-Test and fine-tuned gpt-3.5-turbo-0613 ten different times accordingly. After training for three epochs on each task, we evaluated them on H-Test at k = 50 and observed that no significant performance improvement was achieved.
5. Multi-modality does not automatically improve H-Test performance.
At the time of writing, LLaVA V1.6 34B is the strongest open-source multi-modal model available. Despite the addition of visual modality, we observe that simply incorporating visual data into the training does not result in a straightforward improvement in H-Test performance.
6 (Important). But the H-Test is solvable, we just don't know how.
In our paper, we have reported the seemingly unexplainable (jumping) performance improvement on H-Test from GPT-3.5 to GPT-4. This result is important as it shows that H-Test is indeed solvable (by a GPT-4-level system), but not through conventionally discussed language-only modeling techniques.
Acknowledgments and Links
I was inspired to do this research due to examples cases given in @Owain_Evans track application for the Constellation Fellowship. This research is not financially supported by a university lab, and Walnut Research is an independent research organization that runs on personal funding from myself and a few of my friends.
ArXiv: https://arxiv.org/abs/2402.11349
GitHub: https://github.com/brucewlee/H-Test
Twitter: https://x.com/BruceWLee1/status/1760736653747081681?s=20
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org