Distributed Computing
Tips and Tricks: Dask
Writing computationally efficient python code using Dask: A distributed framework for Analytics
Why Dask ?
Over the last couple of months, I have been working relentlessly to productionize a computationally expensive algorithm. The algorithm ingests millions of rows with unstructured text data and trains a language model before proceeding to the inference stage. One of the bottlenecks in deploying the feature was its inability to train the model with TBs worth of data. It was evident that we needed a scalable framework to make it work !!
There were a few options available to tackle the problem including MapReduce, Spark, etc but I decided to go with Dask for a couple of reasons:
- Dask uses existing python APIs and data structures. This resulted in a much faster development cycle.
- The application uses GCP as cloud technology and Dask integrates seamlessly with GCP services including BigQuery, Vertex AI, and Dataproc.
The goal of this article is to highlight the prominent learnings/caveats that we unearthed while working with Dask. So, with much further ado, here we go: