Transformers Backend Integration in vLLM

rw-book-cover

Metadata

Author: The Hugging Face Team
Full Title: Transformers Backend Integration in vLLM
URL: https://blog.vllm.ai/2025/04/11/transformers-backend.html

Highlights

The Hugging Face Transformers library offers a flexible, unified interface to a vast ecosystem of model architectures. From research to fine-tuning on custom dataset, transformers is the go-to toolkit for all. But when it comes to deploying these models at scale, inference speed and efficiency often take center stage. Enter vLLM, a library engineered for high-throughput inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance. A recent addition to the vLLM codebase enables leveraging transformers as a backend to run models. vLLM will therefore optimize throughput/latency on top of existing transformers architectures. In this post, we’ll explore how vLLM leverages the transformers backend to combine flexibility with efficiency, enabling you to deploy state-of-the-art models faster and smarter. (View Highlight)
vLLM’s inference is noticeably faster and more resource-efficient, especially under load. For example, it can handle thousands of requests per second with lower GPU memory usage. (View Highlight)
Beyond raw performance, vLLM offers an OpenAI-compatible API, making it a drop-in replacement for external services. Launch a server: (View Highlight)
The transformers library is optimized for contributions and addition of new models. Adding a new model to vLLM on the other hand is a little more involved. In the ideal world, we would be able to use the new model in vLLM as soon as it is added to transformers. With the integration of the transformers backend, we step towards that ideal world. Here is the official documentation on how to make your transformers model compatible with vLLM for the integration to kick in. We followed this and made modeling_gpt2.py compatible with the integration! You can follow the changes in this transformers pull request. (View Highlight)
This backend acts as a bridge, marrying transformers’ plug-and-play flexibility with vLLM’s inference prowess. You get the best of both worlds: rapid prototyping with transformers and optimized deployment with vLLM. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

AI Learning Paths for Software Engineers Without Becoming a Data Scientist

Power and Prediction

Why Software Engineers Should Learn a Bit of Data Science

Transformers Backend Integration in vLLM

Metadata

Highlights

Graph View

Table of Contents

Now Reading

John Snow Probably Didn’t Use That Broad Street Map to Reach His Conclusions About Cholera