rw-book-cover

Metadata

Highlights

  • While Pandas is the most popular DataFrame library, it is terribly slow.
    • It only uses a single CPU core.
    • It has bulky DataFrames.
    • It eagerly executes code, which prevents any possible optimization.
    FireDucks is a highly optimized, drop-in replacement for Pandas with the same API.
    You just need to change one line of code → 𝐢𝐦𝐩𝐨𝐫𝐭 𝗳𝗶𝗿𝗲𝗱𝘂𝗰𝗸𝘀.𝐩𝐚𝐧𝐝𝐚𝐬 𝐚𝐬 𝐩𝐝 (View Highlight)
  • As you can tell, FireDucks is even faster than cuDF in this case.
    That said, the query in the above experiment loads all columns of the two parquet files.
    When I optimized it manually by only loading the required columns, the run-time dropped to:



    • Pandas: 14 seconds (from 48 seconds)
    • FireDucks: 0.8 seconds (from 0.8 seconds) [same as before]
    • cuDF: 0.9 seconds (from 2.6 seconds)
    This shows that the FireDucks’ compiler does the same optimization automatically, which one has to explicitly do in cuDF and Pandas. (View Highlight)