Discovering Language Model Behaviors with Model-Written Evaluations




  • Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering (View Highlight)
  • It is crucial to evaluate LM behaviors extensively, to quickly understand LMs’ potential for novel risks before LMs are deployed. (View Highlight)
  • Prior work creates evaluation datasets manually (Bowman et al., 2015; Rajpurkar et al., 2016, inter alia), which is time-consuming and effortful, limiting the number and diversity of behaviors tested. Other work uses existing data sources to form datasets (Lai et al., 2017, inter alia), but such sources are not always available, especially for novel behaviors. (View Highlight)
  • Here, we show it is possible to generate many diverse evaluations with significantly less human effort by using LMs (View Highlight)