Data Is Better Together

rw-book-cover

Metadata

Data continues to be essential for better models: We see continued evidence from published research, open-source experiments, and from the open-source community that better data can lead to better models. (View Highlight)
Empowering the community to build and improve datasets collectively will allow people to: • Contribute to the development of Open Source ML with no ML or programming skills required. • Create chat datasets for a particular language. • Develop benchmark datasets for a specific domain. • Create preference datasets from a diverse range of participants. • Build datasets for a particular task. • Build completely new types of datasets collectively as a community. (View Highlight)
One of the challenges to many previous efforts to build AI datasets collectively was setting up an efficient annotation task. Argilla is an open-source tool that can help create datasets for LLMs and smaller specialised task-specific models. Hugging Face Spaces is a platform for building and hosting machine learning demos and applications. Recently, Argilla added support for authentication via a Hugging Face account for Argilla instances hosted on Spaces. This means it now takes seconds for users to start contributing to an annotation task. (View Highlight)