π

Search for a Text in Multiple PDF Files in Parallel

Show Sidebar

Update 2021-01-13: Section on ripgrep-all which is much faster.

I often do need to search a specific word or term within specification files in PDF format. With only a small number of PDF files, this can be done from a decent shell using pdfgrep quite easily.

However, with a larger set of PDF files, you may want to do it more efficiently.

pdfgrepp

For this reason I have written a tiny shell script. It is called pdfgrepp (notice the second "p" for "parallel"). I invoke the script with an argument which resembles the search string: pdfgrepp foobar. The script looks like that:

#!/bin/sh
## finds all PDF files (recursively) and
## applies "pdfgrep" to it with the parameters:
find . -type f -name '*.pdf' | \
  parallel -k -n 1 -m pdfgrep -H -n "${1}" {}
#end	  

My script is locating all PDF files within the current sub-hierarchy. Those PDF files are then scanned for the given word or term. This is done by using GNU parallel which is taking advantage of all of your CPU cores. The result is then written to the output in the right order. Take a look at the man-pages of the tools for further details.

This small script serves me well in my current business life, dealing with tons of specification files.

ripgrep-all

Andreas Voit sent a comment where he noted that he prefers ripgrep-all for the same purpose.

I just ran two quick tests. The first test consists of searching for a term within 34 PDF files with 2204 pages in total (min: 8 pages; max: 296 pages):

command execution time
time pdfgrepp A_5557 3.3 seconds
time rpa A_5557 0.05 seconds

The second test I did consisted of a larger set of 94 PDF files with 7907 pages in total (min: 8 pages; max: 624 pages):

command execution time
time pdfgrepp A_5557 1 minute 25 seconds
time rpa A_5557 42 seconds

Running the same ripgrep-all command a second time, resulted in even faster results. Therefore, some caching seems to be working much better than with pdfgrep.

I guess this is a good argument to use ripgrep-all instead.

Comment via email or via Disqus comments below: