Cooking parallely in Python
Parallelism refers to multiple cores or processors to achieve genuine speedup which is very helpful for CPU-based tasks.
There are multiple ways one can achieve parallelism in Python:
- Multiprocessing
- Parallel processing libraries (e.g., Joblib, Dask, Ray, etc.)
Let's continue running a restraunt, we're reusing the same example used in last article.
1. Multiprocessing
Multiprocessing is a parallel processing technique that uses multiple processes to execute tasks. Each process has its own memory space, allowing true parallel execution of tasks on multiple CPU cores. We avoid having to worry about the GIL since we're using subprocesses (which have their own memory space) instead of threads (which share a memory space and GIL keeps multiple threads from accessing the same memory at the same time).
-
Let's create a
Pool
with 3 processes and map the ordered dishes with ourprepare_dish
function.=
-
Full script
= = = = yield yield from yield f yield f return f =
2. Parallel processing libraries
Python offers various libraries like Joblib, Dask and Ray that simplify parallelism and can handle tasks across multiple cores efficiently (even multiple different machines with Dask & Ray). I'm sure you can find more tools and libraries which aim to attain the same with their own pros and cons. We're going to check out how these three work and how to use them. In this article we'll only checkout Joblib and cover Dask & Ray in a future article where we'd prepare ingredients in a different restraunt and cook them in a different restraunt.
-
The function
Parallel
here is pretty self explainatory. The functiondelayed
is meant to avoid calling the fucntion immediately, instead delay it and wait to be called elsewhere. It creates a reference to the function to be called along with args and kwargs.=
-
Full script
= = = = yield yield from yield f yield f return f =
Pros & Cons
Feature | Multiprocessing | Parallel Processing Libraries (Joblib) |
---|---|---|
Concurrency Model | Multi-process | Multi-threaded/Process, based on selected backend (loky or threading) |
I/O-bound tasks | Overhead for process creation | Less overhead |
Memory Usage | High (due to multiple processes) | Moderate (less memory overhead compared to multiprocessing) |
Complexity | Slightly complex (process management) | Simpler (higher-level APIs, easier to use than multiprocessing) |
Moving forward
When choosing between multiprocessing and parallel processing libraries, consider the nature of your tasks. For CPU-bound tasks, both methods offer good performance. However, for I/O-bound tasks, parallel processing libraries may provide a more efficient solution due to lower overhead. Remember to always test and benchmark your code to determine which method and which specific library best suits your specific requirements. If you're dealing with big data loads or anything that requires high computational capacity then single machine parallelism is not going to solve your issue. For this we'll look into distributed processing/computing in python. Till then, here are a few good resources to explore:
- https://docs.python.org/3/library/multiprocessing.html
- https://joblib.readthedocs.io/