A Guide to Structured Generation Using Constrained Decoding

rw-book-cover

Metadata

Author: Impromptu Engineer
Full Title: A Guide to Structured Generation Using Constrained Decoding
URL: https://www.aidancooper.co.uk/constrained-decoding/

Highlights

In the context of generative language models, structured generation encompasses a range of techniques that aim to generate outputs with a desired structure. Structured output could take any form depending on the user’s requirements, but the archetypal example would be a JSON object with a specific schema. Another example of structured output could be strings that match a regex pattern, such as an email address, a telephone number, or even just a simple “Y” or “N”. Combining these ideas, a common structured generation goal could be to output a JSON object whose keys conform to a desired schema and whose values are consistent with expected data types, enumerations, and regex patterns. Another variation on structured generation are context-free grammars, which are used to generate outputs that follow sets of rules that ensure validity (e.g., to form a working SQL query). (View Highlight)
Structured generation can be achieved in various ways, including:
1. Prompt design: the simplest approach is to include a description of the desired output structure inside the prompt. This could also involve few-shot prompting, whereby examples of inputs and outputs are included in the prompt.
2. Fine-tuning: a model can be subjected to further training for a specialised task using input-output pairs that demonstrate the desired output structure. This will incline the model to generate similarly-shaped responses during inference.
3. Multi-stage prompting: rather than have the model directly generate structured output, you can have the model respond to a series of prompts, and then assemble the structured output yourself outside of the generative process.
4. Specialised services: OpenAI offers an optional JSON mode that ensures API responses are returned as valid JSON, although it doesn’t provide strong guarantees about the schema and contents of the JSON. These techniques work with varying degrees of success, depending on the difficulty of the task and the capability of the model. However, there’s a more forceful method that can guarantee precise outputs, even when working with relatively weak models applied to complex tasks: constrained decoding. (View Highlight)
In the context of structured generation, constrained decoding is a technique that manipulates a generative model’s token generation process to constrain its next-token predictions to only tokens that do not violate the required output structure. State of the art constrained decoding skips the parts of the structured output that are boilerplate scaffolding or tokens that can be uniquely determined based on preceding tokens and the constraints of the desired output. Only the parts of the output that strictly require generation are sampled from a restricted set of compatible tokens in the model’s next-token probability distribution. For a full exploration of the mechanics behind constrained decoding, I refer the reader to excellent articles and papers from the teams behind Outlines [1] [2] and SGLang [3]. Constrained decoding is an area of active innovation that continues to benefit from increasingly effective optimisations. (View Highlight)
Additional benefits of constrained decoding As well as guaranteeing compliant outputs, the mechanics of constrained decoding outlined above can also reduce inference costs and improve throughput by:
1. Increasing token-generation speed. Constrained decoding simplifies the next-token prediction space, accelerating generation — especially when implemented with clever optimisations that allow some token generation steps to be outright skipped.
2. Reducing the number of generated tokens. The throughput of text generation systems is almost always bottlenecked by the speed of token generation, and for many structured generation tasks, much of the output is scaffolding that can bypass generation. For instance, for a rigid JSON schema with fixed field definitions, we can save a lot of time by only generating the values and not the surrounding boilerplate. There’s even precedent suggesting that constrained decoding can improve task performance. (View Highlight)
Constrained decoding is only compatible with generative language models that make their complete next-token probability distribution available: i.e., constrained decoding is only possible for models run locally; not external APIs. External APIs may offer some structured generation functionality, such as Open AI’s JSON mode, but at the time of writing, I’m not aware of any that support full-fledged constrained decoding. There are various frameworks that enable constrained decoding to be leveraged with local models, including: SGLang, Outlines, guidance, and DSPy Assertions. In this article, I elect to use SGLang (which builds on Outlines under the hood) to illustrate examples, although the same concepts apply across all frameworks. These frameworks are generally pretty agnostic towards the local model used, and will usually be compatible with most popular models. In this article, any outputs that accompany the examples have been generated using google/gemma-2b-it. There are three main ways that a structured output can be defined for constrained decoding: regex, code, and generative constructs. (View Highlight)
Structured output using regex Regular expressions (regex) are a powerful way to define structured output, as they offer maximum specificity over the format and contents. (View Highlight)
The main downside of regex is that it’s tedious to write and maintain — especially in the context of an active codebase with an evolving data model. Another downside to regex in SGLang is that it introduces a considerable compilation time when initialising run that is not encountered when using the specialised generative constructs. (View Highlight)
tructured output as code Structured output can also be defined as code using Pydantic models. These can then be dynamically converted into regex for constrained decoding. Using Pydantic models like this is convenient, as it ensures alignment between your application code and the constrained decoding process. To fully streamline your application, you can also use string interpolation to describe the desired output structure in your instruction prompt. This is a major benefit over alternative approaches, where you may need to reimplement your data model in multiple places in multiple ways, risking code drift. (View Highlight)
Defining structured output using Pydantic models suffers from the same downsides as regex, as this is what they get converted into under the hood. Whatsmore, the conversion process for complex Pydantic models involving multiple levels of models can be unreliable or yield poorly constructed regex, often rendering this approach unviable in practice. (View Highlight)
The third option is to define your structured output as alternating static strings and generative constructs. This has the benefit of splitting out your data model into more manageable components where the generative parts of the task can be defined individually. In SGLang, the task performance of the gen constructs using functionality such as choices can be superior to their regex equivalents due to algorithmic differences in their implementations (more on this in the pitfalls section). The guidance package provides an illustrative example of this pattern. (View Highlight)
Depending on the framework and implementation, managing the scaffolding around the generative constructs can be syntactically awkward. (View Highlight)
Remember: the model does not know the constraints in advance Most constrained decoding pitfalls are due to misalignment between what the generative model wants to output and what the model is forced to output. The mechanics of next-token prediction and constrained decoding are such that the model does not know what is coming in advance: it generates predictions only based on the tokens that precede it. This can result in poor performance if the model is forced to output something unnatural. (View Highlight)
The best way to mitigate this is to ensure the model understands the task and expectations by providing sufficient information in the prompt. If working with JSON, it’s also worth considering the ordering of your schema fields, and structuring them logically such that they’re amenable to next-token prediction. This may also include using descriptive field names, and not constraining the model to esoteric field values. Ideally, we would inspect the actual token probabilities to identify cases where the model has been forced to output something with low confidence. It would be great if we could triage outputs where the predictions do not meet certain confidence thresholds. However, I’m not aware of this functionality existing in the frameworks I’m familiar with. (View Highlight)
Another pitfall to be mindful of is how your constrained decoding is being implemented algorithmically. This will vary depending on the framework and constructs you’re using. For instance, in SGLang, constrained decoding for a multiple-choice string is different when configured as regex versus gen’s choices argument. The former uses a greedy next-token prediction, whereas the latter evaluates all options completely. Other libraries may use variations on beam search. (View Highlight)
The greedy algorithm used by the regex implementation is short-sighted, and cannot resist choosing the “Donald” option, despite ultimately landing on an incorrect answer. The choices approach avoids this by evaluating the quality of all options in their entirety, and not just based on the most attractive initial token. (View Highlight)
How to generate a varying number of structured outputs A tricky but not uncommon structured generation requirement is to generate zero, one, or many outputs that adhere to constraints, from a single input. Specifying such constraints in regex is sometimes theoretically possible, but usually ill-advised in practice for all but the most simple string patterns. A better approach is to use control flow to enable the model to generate multiple structured outputs in a series of responses. (View Highlight)
Post-processing of structured outputs There are scenarios where you can achieve better task performance by defining intermediary structured outputs for constrained decoding, and then converting this to your final structured output during post-processing. This can be the case when using enumerations that the model handles poorly. (View Highlight)
Another case where post-processing can be useful is for JSON schemas that contain optional fields. Rather than use complex regex to specify the optional fields, it’s usually easiest to include all fields in the structured output and have the generative model populate unneeded field values with placeholders. These fields can then be deleted during post-processing. Even for expansive JSON schemas with many optional fields, this is inexpensive with constrained decoding as this JSON scaffolding does not actually get generated. Constrained decoding is a powerful but underutilised technique for structured generation. It’s a significant advantage that local models have over external APIs, and can enable relatively small and inexpensive models to perform comparably to much larger alternatives. The throughput and latency improvements that constrained decoding provide can also be extremely compelling. In my experience, SGLang has particularly impressive backend optimisations that increases the speed of JSON generation by an order of magnitude versus unconstrained equivalents (whilst also guaranteeing well-formed outputs…!). (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

AI Learning Paths for Software Engineers Without Becoming a Data Scientist

Power and Prediction

Why Software Engineers Should Learn a Bit of Data Science

A Guide to Structured Generation Using Constrained Decoding

Metadata

Highlights

Graph View

Table of Contents

Now Reading

Reader: Frequently Asked Questions