ArXiv Preprint - S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Nov 21, 2023

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica

Researchers discuss S-LoRA system for efficiently serving a large number of Low-Rank Adaptation language model adapters by using optimized memory management and computation strategies. They explain the concept of unified paging for memory management and batched inference to minimize communication and memory overheads.

Ask episode

Chapters

Transcript

Episode notes

S. Laura: Managing Thousands of Concurrent Laura Adapters

00:00 • 2min

Efficient Memory Management, Batched Inference, and Tensor Parallelism for AI Model Serving

01:30 • 2min