rw-book-cover

Metadata

Highlights

  • vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers @lmsysorg Vicuna and Chatbot Arena. (View Highlight)
  • The core of vLLM is PagedAttention, a novel attention algorithm that brings the classic idea of paging in OSโ€™s virtual memory to LLM serving. Without modifying the model, PagedAttention can batch 5x more sequences together, increasing GPU utilization and thus the throughput. (View Highlight)