I've wanted an excuse to try parallel processing with Python for a few months and yesterday, the FBI provided one. They released a collection of records related to the assassination of Reverend Dr. Martin Luthor King Jr. It's easy enough to get a count of the pdf files released from the announcement page.
I was able to quickly read that there were 6,301 files. A brief internet search indicated that the files have not been released in any kind of compressed container, like a zip file yet. I also tested that the search box only searches the pdf file names, not their contents. The immediate next question was how many bytes of disc space do all the pdfs consume?
I asked Chat GPT o4-mini-high to write a Python script to determine the size of the all the files combined. The script was unable to determine the size of each file by looking at the HEAD of the URL for each file, so it wound up having to use GET requests to measure the size. This was a bit slow. With only a single thread running, it was very slow. It took about 15 seconds to pull in the first 100 files, (there was a two second setup time to start the script.)
In a project a few months ago, I surmised that I could get better results on this kind of operation by running processes in parallel. I asked ChatGPT to rewrite the script to kick off the same process as four parallel threads by dividing the number of files by four and then passing a quarter of the files to each thread to be processed identically.
Performance was improved. It took about 5 seconds to process the first seven files, (there was a 3 second setup time.)
Because the code was somewhat concisely written, it was easy enough to measure the time for 100 files using 8 and then 16 threads as well.
Eight threads took about 5 seconds, (with a 2 second setup.)
Sixteen threads took about 2 seconds, (with a 3 second setup.)
The processor on my computer only has sixteen logical processors iternally, so I stopped at 16.
The little bit of effort to ask ChatGPT to parallelize the process was more than worth it. I also have example code available when I want to paralellize other processes.
Update:
Spot checking the results, I got the following from my Windows directory
Comments
Post a Comment
Please leave your comments on this topic: