rw-book-cover

Metadata

Highlights

  • was found that ImageNet had over 100k mislabeled images.
    Real-world datasets are messy.
    They often come with noisy labels, missing values, and outliers that can severely degrade your model’s performance.
    No sophisticated ML algorithms can compensate for poor-quality or mislabeled data. (View Highlight)
  • (View Highlight)
  • (View Highlight)
  • Researchers from MIT developed Cleanlab, which is an open-source library that cleans your data in just a few lines of code. (View Highlight)
  • As shown in the image above, Cleanlab can flag errors in any type of data (text, image, tabular, audio), like:
    • out-of-distribution samples
    • outliers
    • label issues
    • duplicates, etc.
    All it takes is just four lines of code:



    • Import the package.
    • Pass the dataset and specify the label column.
    • Find issues by passing the embedding matrix and the probabilities predicted by the model.
    • Finally, generate the report!
    Done!
    It will generate a report like the one shown above.
    This way, you can easily clean your datasets for training accurate ML models. (View Highlight)