Discovering Language Model Behaviors with Model-Written Evaluations

rw-book-cover

Metadata

Prior work creates evaluations with
crowdwork (which is time-consuming and
expensive) or existing data sources (which are
not always available). Here, we automatically
generate evaluations with LMs. We explore
approaches with varying amounts of human
effort, from instructing LMs to write yes/no
questions to making complex Winogender
schemas with multiple stages of LM-based
generation and ﬁltering (View Highlight)
It is crucial to evaluate LM behaviors
extensively, to quickly understand LMs’ potential
for novel risks before LMs are deployed. (View Highlight)
Prior work creates evaluation datasets manually
(Bowman et al., 2015; Rajpurkar et al., 2016,
inter alia), which is time-consuming and effortful,
limiting the number and diversity of behaviors
tested. Other work uses existing data sources to
form datasets (Lai et al., 2017, inter alia), but
such sources are not always available, especially
for novel behaviors. (View Highlight)
Here, we show it is possible to generate many
diverse evaluations with signiﬁcantly less human
effort by using LMs (View Highlight)