Evaluating and Uncovering Open LLMs




  • The hypothesis I put the most weight in is that open source wins by offering a smaller, better model for each use-case/area. The giant models are the best horizontal project, but there are very few technology solutions where you need your model to be able to solve many tasks and they are inherently in lower margin applications like chat. (View Highlight)
  • you’ll want a model to cover at max two or three tasks above (some tasks will synergize well). To do so, it will likely be a model fine-tuned on a curated dataset of examples, nothing crazy. I’m sure people have started doing this. (View Highlight)
  • Model tracking and searching are very likely to grow out of the environment where people realize there are models out there that likely can do their task very well, but they don’t have a method to find it. The model you’re looking for is probably on HuggingFace, but it will have no model card describing its training process or capabilities. (View Highlight)
  • The Open LLM Leaderboard largely serves as the leaderboard for base model performance / simple instruction following performance (View Highlight)
  • the leaderboard will also add human and GPT4 ratings across a secret validation set of tasks (View Highlight)
  • Getting a set of crowd-workers to label responses in the style that you want is extremely hard (take a look at the thoroughness of InstructGPT’s training document), so relying on the results of doing this should come with contextualization. (View Highlight)
  • A very common line in the announcement of new open-source models replicating ChatGPT is that our model beats ChatGPT in binary comparisons N% of the time (View Highlight)
  • It’s all deeply related to the idea that you cannot easily imitate proprietary LLMs because you only have access to inference data and not the human preference or raw red-teaming data needed to fully extract information from the mode (View Highlight)