Making Good PDFs from Scanned Documents
This is a quick post to document my personal workflow of producing nice PDF documents from scanned images. Some of my obligations and hobbies produce paper documents with no option for digital copies, so I am left to digitze the information myself. Here’s how I do it.
Scanning
I was lucky enough to obtain an Epson WF-7610 from my local e-waste drop-off. Although I’ve never used it to print, its 1200dpi auto-document-feed (ADF) scanner is very capable. A driver is available on the AUR under epson-inkjet-printer-escpr2. I use SANE’s scanimage to invoke the scanner.
Here are some utility functions that I use in my shell (bash):
# set $EPSON_PATH to the path of the Epson scanner
set_epson_path() {
EPSON_PATH=$(scanimage -L | grep -oP "epson2:libusb:\d+:\d+")
}
scan_epson() {
local mode="${1,,}"
shift || true
local output_format="jpeg"
local output_extension="jpg"
if [ -z "${EPSON_PATH}" ]; then
set_epson_path
fi
case "$mode" in
bw)
output_format="png";
output_extension="png";
mode="Lineart";
;;
gray|grey)
mode="Gray";
;;
color)
mode="Color";
;;
*)
echo "Usage: scan_epson {bw|gray|grey|color} [scanimage options...]" >&2
return 1
;;
esac
scanimage --batch="%02d.${output_extension}" --batch-start=1 \
--format="$output_format" -d "$EPSON_PATH" \
--source "Automatic Document Feeder" --resolution 300 \
-l 40 -x 220 -y 279 --mode "$mode" "$@"
}scanimage accepts the last instance of a given argument as its value, so with the scan_epson wrapper, I can for example override the resolution with scan_epson bw --resolution 900.
I previously used a Brother printer/scanner with ADF, however, I discovered that Brother scanners read pages with an “unscannable area” of 2mm on each edge. This wasn’t acceptable for my purposes.
Photo Manipulation
From the above scan_epson function, you can see that I default to very specific values for the scan area. For my scanner, these values end up scanning standard letter-size paper. Thus I don’t typically have to crop standard pages.
For one-off visual cropping, I use the graphical image tool nomacs image lounge. For mass cropping, I use ImageMagick’s crop and chop arguments.
I’ve had the pleasure of dealing with various kinds of physical documents, including saddle-stitched booklets - where each physical page is two logicial pages with the outermost physical page is the first and last logical page. I have a script to process scanned physical pages into their appropriately split logical pages using imagemagick, but frankly it is not at a point to share it
Making a PDF with OCR
I simply use img2pdf and ocrmypdf like so:
img2pdf *.png --pagesize=letter | ocrmypdf - out.pdf
PDF Page Numbering and Bookmarks
This is what I mean by a “good” PDF. Logical page ranges should reflect the printed page numbers, and the table of contents should be fully populated. If you scan documents into PDFs, I beg you to not skip these steps!
I use jpdftweak to modify PDF page range settings and jpdfbookmarks to modify PDF bookmarks (index entries). jpdftweak is capable of modifying bookmarks, but I prefer the viewer in jpdfbookmarks.
I would much prefer a simpler command-line based alternative to jpdftweak for modifying page range settings. Writing this post may be the nudge I need to investigate for or write such a tool.
Bonus - Debinding
I’ve only had to do this once so far, and it was a bit of an ordeal. You can find plenty of videos on YouTube, but I recommend the process of using an iron atop a damp towel on the binding. Once the adhesive is warm enough, the jacket should come right off. Adhesive (if warm enough) can be removed with a knife. However, I had a painstaking time of splitting individual pages from each other even after most of the adhesive was removed.
I strongly recommend practicing on a book you don’t care about before working on the one you intend to scan.