Skip to main content

Using Generative AI to Count Archive Files

 I've wanted an excuse to try parallel processing with Python for a few months and yesterday, the FBI provided one. They released a collection of records related to the assassination of Reverend Dr. Martin Luthor King Jr. It's easy enough to get a count of the pdf files released from the announcement page.

Information about the MLK file release to the National Archives on July 21, 2025. There were 6301 pdf files released.
Information about the files released to the National Archives


I was able to quickly read that there were 6,301 files. A brief internet search indicated that the files have not been released in any kind of compressed container, like a zip file yet. I also tested that the search box only searches the pdf file names, not their contents. The immediate next question was how many bytes of disc space do all the pdfs consume?

I asked Chat GPT o4-mini-high to write a Python script to determine the size of the all the files combined. The script was unable to determine the size of each file by looking at the HEAD of the URL for each file, so it wound up having to use GET requests to measure the size. This was a bit slow. With only a single thread running, it was very slow. It took about 15 seconds to pull in the first 100 files, (there was a two second setup time to start the script.)

Screen shot of video of the file capture process. It shows that the 100th file was measured 17 seconds into the process
First 100 files processed in about 15 seconds using on thread in Python script

In a project a few months ago, I surmised that I could get better results on this kind of operation by running processes in parallel. I asked ChatGPT to rewrite the script to kick off the same process as four parallel threads by dividing the number of files by four and then passing a quarter of the files to each thread to be processed identically.

Performance was improved. It took about 5 seconds to process the first seven files, (there was a 3 second setup time.)

Screen shot of video of the file capture process. It shows that the 100th file was measured 8 seconds into the process
First 100 files processed in about 5 seconds using on thread in Python script

Because the code was somewhat concisely written, it was easy enough to measure the time for 100 files using 8 and then 16 threads as well. 

Eight threads took about 5 seconds, (with a 2 second setup.)

Screen shot of video of the file capture process using 8 threads. It shows that the 100th file was measured 8 seconds into the process
8 threads

Sixteen threads took about 2 seconds, (with a 3 second setup.)

Screen shot of video of the file capture process using 16 threads. It shows that the 100th file was measured 5 seconds into the process
16 threads

The processor on my computer only has sixteen logical processors iternally, so I stopped at 16.

The little bit of effort to ask ChatGPT to parallelize the process was more than worth it. I also have example code available when I want to paralellize other processes. 

Update:

Spot checking the results, I got the following from my Windows directory 



vs the results from the script


The code seems to be working correclty.




Comments

Popular posts from this blog

Cool Math Tricks: Deriving the Divergence, (Del or Nabla) into New (Cylindrical) Coordinate Systems

Now available as a Kindle ebook for 99 cents ! Get a spiffy ebook, and fund more physics The following is a pretty lengthy procedure, but converting the divergence, (nabla, del) operator between coordinate systems comes up pretty often. While there are tables for converting between common coordinate systems , there seem to be fewer explanations of the procedure for deriving the conversion, so here goes! What do we actually want? To convert the Cartesian nabla to the nabla for another coordinate system, say… cylindrical coordinates. What we’ll need: 1. The Cartesian Nabla: 2. A set of equations relating the Cartesian coordinates to cylindrical coordinates: 3. A set of equations relating the Cartesian basis vectors to the basis vectors of the new coordinate system: How to do it: Use the chain rule for differentiation to convert the derivatives with respect to the Cartesian variables to derivatives with respect to the cylindrical variables. The chain ...

The Valentine's Day Magnetic Monopole

There's an assymetry to the form of the two Maxwell's equations shown in picture 1.  While the divergence of the electric field is proportional to the electric charge density at a given point, the divergence of the magnetic field is equal to zero.  This is typically explained in the following way.  While we know that electrons, the fundamental electric charge carriers exist, evidence seems to indicate that magnetic monopoles, the particles that would carry magnetic 'charge', either don't exist, or, the energies required to create them are so high that they are exceedingly rare.  That doesn't stop us from looking for them though! Keeping with the theme of Fairbank[1] and his academic progeny over the semester break, today's post is about the discovery of a magnetic monopole candidate event by one of the Fairbank's graduate students, Blas Cabrera[2].  Cabrera was utilizing a loop type of magnetic monopole detector.  Its operation is in...

More Cowbell! Record Production using Google Forms and Charts

First, the what : This article shows how to embed a new Google Form into any web page. To demonstrate ths, a chart and form that allow blog readers to control the recording levels of each instrument in Blue Oyster Cult's "(Don't Fear) The Reaper" is used. HTML code from the Google version of the form included on this page is shown and the parts that need to be modified are highlighted. Next, the why : Google recently released an e-mail form feature that allows users of Google Documents to create an e-mail a form that automatically places each user's input into an associated spreadsheet. As it turns out, with a little bit of work, the forms that are created by Google Docs can be embedded into any web page. Now, The Goods: Click on the instrument you want turned up, click the submit button and then refresh the page. Through the magic of Google Forms as soon as you click on submit and refresh this web page, the data chart will update immediately. Turn up the:...