If Modin is slower than expected, first thing to check is NPartitions. Simply put, it’s the number of partitions along columns and the rows. Since most of data frame one would be handling will be large (otherwise, why would use ray?), it will result in many more partitions than the number of multiprocessing.cpu_count(). One way to handle that is described in the usage guide. It suggests to set NPartitions as int(cpu_count() ** 0.5)) such that there are as many partitions as cpu_count().
Another thing to consider is disk IO. If your data is being loaded into memory from the disk and if there are many of such IOs, performance may not be CPU bounded. If so, it makes sense to run the code in parallel using ray.multiprocessing.Pool while setting NPartitions as 1. By doing so, we run many workers in parallel while each worker deals with a single data frame.
It’s worthwhile to note that IO bound may not be apparent in iostat or top (the value at ‘wa’). It seems to be internal setting of ray that delays the IO.
When using Pool(), experiment with different num_cpus parameter of ray.init() like 1, 2, 4, etc. With 32 vcpus, I observed best performance when it was 2.
It’s always best to experiment. Code and data are different in each problem. Use different NPartitions, num_cpus, and serial / parallel version.
In some cases, Modin might not be an answer. Since it’s your own problem, you’d have better knowledge on how things can be optimized, e.g., merge sort using the disk will be faster than out of core Modin dataframe (which uses memory mapped files) if your access pattern is serial.