Skip to main content

Posts

Showing posts with the label genai

Using Generative AI to Count Archive Files

 I've wanted an excuse to try parallel processing with Python for a few months and yesterday, the FBI provided one. They released a collection of records related to the assassination of Reverend Dr. Martin Luthor King Jr. It's easy enough to get a count of the pdf files released from the announcement page . Information about the files released to the National Archives I was able to quickly read that there were 6,301 files. A brief internet search indicated that the files have not been released in any kind of compressed container, like a zip file yet. I also tested that the search box only searches the pdf file names, not their contents. The immediate next question was how many bytes of disc space do all the pdfs consume? I asked Chat GPT o4-mini-high to write a Python script to determine the size of the all the files combined. The script was unable to determine the size of each file by looking at the HEAD of the URL for each file, so it wound up having to use GET requests to m...

Things I Learned: Google Drive Downloads and Long Windows Paths

 While doing some experiments about how someone might backup the Soldersmoke blog, I came across an interesting issue. I'd worked with ChatGPT to create a Python script that stored a post from the soldersmoke blog in a directory. The name of the directory was the date and time of the blog post concatentated with the the blog post title. In most cases, this wound up being a file name with more than 256 characters. I dont' remember where I tried my first prototypes a few months ago, but whatever system I was on, it wasn't Windows, because Windows doesn't support folder or file names longer than 256 characters. Google drive, however, does. As a result, when I tried to download the folders to a Windows machine, I got an error message indicating that the zipped directories from Google Drive could not be read. The simple fix, it turned out, was to change the naming scheme of the folders so that they contained only the timestamp of the post down to the second in UTC time. Wit...