Dealing with pesky government debt
In law, we’re frequently dealing with ordering and reviewing courtroom audio. In Ontario, the form that we tend to take delivery of those recordings is in PDF files with embedded audio.
Of course, like all government technology, the form that it’s delivered in is incredibly antiquated and difficult to access, unless you happen to use exactly the same operating systems and programs as the government.
In this case, the audio stream is hidden somewhere in the PDF file, and you get various green ‘play buttons’ that are timestamped to various parts of the proceeding audio. Clicking these play buttons tends to work on PCs, and not so much on macs, just like almost everything else in the frustrating world of legacy legal tech.
I remember struggling with that issue back when I was still in private practice and starting to daily drive a chromebook, and now my wife has this issue on her mac. Basically, trying to play the embedded audio launches a prompt asking you to ‘download an external program’, which when followed leads you to a dead link on Adobe’s website. Googling the issue has revealed many people struggling with the same issue for years, and no solutions.
So I said screw it - I do computer things now - let’s figure it out.
I dove it with some python and the PyMuPDF library, which is what I usually reach for when manipulating PDFs. After some time spent looking through the attached annotations for audio files, no luck. Just one ‘Screen’ type annotation.
Fine - I dove into the settings on my wife’s Adobe Acrobat reader, and found that the play buttons seem to be AcroForm fields triggering javascript. That javascript referenced playing an audio clip by launching an internal media player - almost all abstracted away, and so a bit of a dead end.
I realized I needed to figure out how and where the audio file was embedded, so after GitHub surfing I found these tools provided freely by a blessed gentleman by the name of Didier Stevens - what a mensch.
Using his pdf-parser command line tool and piping the gigantic output into a txt file, I was able to Ctrl-F for the audio filename referenced in the javascript I located earlier. Plugging that into ChatGPT helped me to decipher the various PDF objects and follow a trail, wherein each object seemed to reference and lead to another object, until I was ultimately able to find the one object of filetype ‘/EmbeddedFile’ - voila!
Now it was a hop-and-a-skip to use PyMuPDF’s .xref_stream_raw() function to load the data from the PDF object number containing the embedded file, and then write that data out to an audio format (I used .wma, because somewhere along the way I found some evidence that the original format was likely ‘Windows Media Audio’ (of course).
Boom. Audio located/extracted, and some relief for Mac users struggling with legacy methods of embedding media into PDFs.
What a thing.