Using Generative AI to Count Archive Files

I've wanted an excuse to try parallel processing with Python for a few months and yesterday, the FBI provided one. They released a collection of records related to the assassination of Reverend Dr. Martin Luthor King Jr. It's easy enough to get a count of the pdf files released from the announcement page.

Information about the MLK file release to the National Archives on July 21, 2025. There were 6301 pdf files released.

Information about the files released to the National Archives

I was able to quickly read that there were 6,301 files. A brief internet search indicated that the files have not been released in any kind of compressed container, like a zip file yet. I also tested that the search box only searches the pdf file names, not their contents. The immediate next question was how many bytes of disc space do all the pdfs consume?

I asked Chat GPT o4-mini-high to write a Python script to determine the size of the all the files combined. The script was unable to determine the size of each file by looking at the HEAD of the URL for each file, so it wound up having to use GET requests to measure the size. This was a bit slow. With only a single thread running, it was very slow. It took about 15 seconds to pull in the first 100 files, (there was a two second setup time to start the script.)

Screen shot of video of the file capture process. It shows that the 100th file was measured 17 seconds into the process

First 100 files processed in about 15 seconds using on thread in Python script

In a project a few months ago, I surmised that I could get better results on this kind of operation by running processes in parallel. I asked ChatGPT to rewrite the script to kick off the same process as four parallel threads by dividing the number of files by four and then passing a quarter of the files to each thread to be processed identically.

Performance was improved. It took about 5 seconds to process the first seven files, (there was a 3 second setup time.)

Screen shot of video of the file capture process. It shows that the 100th file was measured 8 seconds into the process

First 100 files processed in about 5 seconds using on thread in Python script

Because the code was somewhat concisely written, it was easy enough to measure the time for 100 files using 8 and then 16 threads as well.

Eight threads took about 5 seconds, (with a 2 second setup.)

8 threads

Sixteen threads took about 2 seconds, (with a 3 second setup.)

16 threads

The processor on my computer only has sixteen logical processors iternally, so I stopped at 16.

The little bit of effort to ask ChatGPT to parallelize the process was more than worth it. I also have example code available when I want to paralellize other processes.

Update:

Spot checking the results, I got the following from my Windows directory

vs the results from the script

The code seems to be working correclty.

Copasetic Flow

Search This Blog

Using Generative AI to Count Archive Files

Update:

Labels

Comments

Post a Comment

Popular posts from this blog

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

The Valentine's Day Magnetic Monopole

More Cowbell! Record Production using Google Forms and Charts