Lossless LLM Weight Compression and Fine-Tuning Large Models on a Single GPU

Explore how the SPQR method enables near lossless LLM weight compression, allowing for running large models on a single GPU without compromising accuracy. The chapter also introduces QLora, a method combining low-rank adaptation and quantization for enhancing the performance of open-source large models on a single GPU.

Play episode from 00:00

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app