Digitizing All Your Paper Stuff

Updates
- 2015-06-20: Email comments from @pdfkungfoo
- 2015-11-02: Link to German DIY-scanner
- 2016-05-02: German talk about this topic is online
- 2021-10-03: VueScan

Over the previous three years, I digitized almost all of my paper documents, converted them to PDF files, ran OCR so that I can find text via full text search, and threw away the paper. New paper artifacts like official letters are being digitized since then as well. This way, I can limit new paper to an absolute minimum.

With recycling most of the digitized paper, I got rid of a huge pile of paper I had to store, move, take care of.

However, my biggest motivation was that I had many paper artifacts which were not that important to store on the one hand but on the other hand also too valuable to dump. Further more, digital information can be linked, searched for (even coincidental), backup-ed, and so forth. Whereas a physical document only resides at one single place, not worth much if you don't grab it and look for its containing information.

Initial Position

There was a wide spectrum of document types I wanted to scan. Following list contains the most important types, sorted by date: Filofax pages, school exercises, school leaving examination, educational books, lecture notes from university, many personal notes of other people and my own, technical books, novels, manuals, and so forth.

You should build three types of heaps: 1) documents you can throw away right now. 2) documents that you scan and then recycle as well. 3) documents that you scan and archive afterward even on paper.

Reliable, working backup

I hope that you are going to find this obvious: you'll going to need a bullet-proof backup system. Having lots of documents only in digital format requires a multi-stage backup system including an off-site backup in case of theft or fire. This system has to be automated as much as possible so that you are really going to do your backups.

There is this 3-2-1-rule: store at least three copies of your data, in at least two different formats, with at least one copy off-site. Three copies are not that hard to get as you might think: when you sync your data between a notebook and a desktop, then you get three copies with your first (additional) backup copy. Two different formats are important to overcome storage format disadvantages such as self-burned DVDs which get corrupted over time. The off-site copy is important though. Think of a horrendous fire at your place or a theft which results in losing all of your data from one physical space. You might as well think of encrypting your backups.

Test your backup and your restore process. No one wants backup but everybody needs restore in case you lose something. So please make sure that your whole backup tool-chain is working from the actual backup procedure until successfully restoring a file.

Research

Digitizing paper documents is nothing I invented - obviously. Digitizing all paper documents of a person is still unusual though.

A well documented project is the MyLifeBits project of Microsoft Research: Webpage, Wikipedia, Video. Several white papers describe various aspects of the project: 2004 and 2006. Gordon Bell and Jim Gemmell also wrote a book which was called "Total Recall". I guess for obvious copyright issues, they had to rename it to "Your Live, Uploaded". Besides many other aspects, it has got many practical tips for realizing your own digitizing project.

Unfortunately, the most important software used in the MyLifeBits project was only described in the papers and never released. It was influenced by Vannevar Bush's Memex (1945). If you are interested in this aspect, you should check out my own open software solution called Memacs which is described in this white paper.

Scanning destructive or non-destructive?

Using an scanner with Automatic Document Feeder (ADF) requires destructive scanning: since the ADF gets each page separately to scan its two sides in parallel, bindings have to be removed beforehand.

The alternative would be scanning techniques which require manual page turning. Because of the sheer amount of paper, I had to choose destructive scanning. I recycled most scanned documents. So destroying books was not a big issue in most cases.

Target file format

There were more or less three different target file formats to choose from. Image files, DjVu, or PDF files.

Most scanned pages could be easily post-processed with Optical character recognition (OCR). Therefore, image files are not that great since I wanted to keep the original information (scanned image) very close to the computer-readable information (result of OCR).

DjVu seems to be a perfect choice. It is an open format, specialized for scanning results. However, it is not wide spread and not every computer is able to display DjVu documents. As a matter of fact, I personally never got in touch with a single DjVu file so far.

A couple of years ago I ripped all of my audio CDs into computer files. Being a strong supporter for open formats, I chose OGG as a container format and Vorbis as audio compression codec. After this time-consuming ripping process I realized that I made a mistake. OGG Vorbis is a fine format but not all computers or mobile audio devices could handle my music files. So I had to rip my whole CD collection once more in order to get standard mp3 files.

Therefore, I decided not to use the open DjVu file format as my target file format. I chose ordinary PDF files which could be viewed and processed via a large number of software tools on any platform. OCR tools can add transparent layers on top of the scanned image, which holds the result of the OCR process. Those sandwich-documents are indexed by any decent desktop search engine and search results could be highlighted within the files.

GNU/Linux and OCR (2011-09)

Disclaimer: the following section describes my personal experience with GNU/Linux and OCR back in September 2011. Since then, the mentioned OCR tools might have improved a lot. I did not investigate the current situation, I only describe the situation I was back in 2011.

My main system for productive working was and is GNU/Linux. Debian GNU/Linux, to be precise. Therefore, I wanted to come up with a work-flow which could be used in my GNU/Linux environment right away.

To my surprise, there were a large number of GNU/Linux OCR tools. I tested only the most recommended ones for basic work-flows. My requirements were that sandwich PDF files should contain the scanned images as well as the recognized text as overlay layers. When I use full text search within a PDF file, I want to get to the search hits.

Most important tool I tested was hOCR. It's an open standard for OCR results and pretty every GNU/Linux tool for OCR is using hOCR as intermediate result format. If you've got a scanned image and the hOCR files, you can use hocr2pdf to generate the PDF sandwich files. Therefore, you need other tools for the OCR process.

OCRopus, Tesseract, CuneiForm are open source OCR tools. For processing scanned image files to get a better OCR result, you can use ExactImage.

Sadly, I was not able to get decent PDF results which satisfied my requirements. The issue which was the most relevant one was that the invisible text layer containing the words from the OCR results did not match the layout of the scanned image. The fonts differed in size and therefore any full text search match could not be highlighted properly. I tested various settings of hocr2pdf but any time I generated the PDF, the sandwich layer was either way too small or that huge that it exceeded the page by far. Therefore, no copy and paste was possible in a reasonable manner. Search results looked weird such that I had to read whole pages in order to find the true location of the phrase I was looking for.

Further more, I had to face software crashes with core dumps while OCR processing. Certain files could not be processed at all for this reason.

I was very disappointed. The GNU/Linux OCR results varied between bad and catastrophic.

Therefore, I had to look out for software work-flows which are part of the software package delivered by the scanner manufacturer.

Hardware

Types of scanning hardware

You can get digital representations of paper in numerous ways. I mention a couple of them sorted by their practical suitability.

For a few pages, a digital camera with at least three Megapixel might suit you well. This method requires good light and you have to flatten the pages as much as possible, which might be hard to achieve with books.

Hand document scanners produce better output since they usually have their own scanner light built-in. You have to move the scanner by hand in constant speed over the page. Most hand scanners I know of are not able to scan A4 pages with one scan since they are much narrower. In my opinion, the biggest advantage of this type of scanner is its form factor for mobile usage - probably also powered by batteries.

Flatbed scanners are the most used scanner type these days. Scanning a large number of pages of a book, you have to position the pages on its scanning area precisely. In most cases, you have to do post-processing: de-skewing, cropping, and so forth.

In recent years, multi-function printers got quite common (again). Combined with a ink-jet printer, you get a copy machine for small money. In my opinion, most multi-function printers do either get a poor scanner or you get a poor printer or both - you seldom get two great sub-devices in one multi-function printer.

The professional market is addressed via photocopy machines. This high-priced multi-function devices are made to last and are made for high throughput. As long as they don't include horrible software flaws which turn all scanned documents into unreliable junk, you are probably fine.

And then there is this special class called document scanner with automatic document feeder (ADF). Photocopy machines as well as some multi-function printers got ADF as well. Document scanners are specialized for one purpose: scanning large amounts of separated paper pages. Most have two separate scanner units in order to scan the front and the back side of a paper in parallel.

Scanners with ADF units require destructive scanning for all documents that do not consist of already separated pages.

Further more, there are other scanners like the Bookeye 4 which are special purpose scanners for non-destructive minimal damage scans of books. You will probably find such devices in public libraries as well. I did not have the pleasure of testing those scanners yet.

HP OfficeJet Pro 8500A Plus

Since I did not own a printer nor a scanner, I had to decide which device I am going to buy.

Back then, HP launched a new multi-function printer which was praised in reviews for its printing speed and cost per printed page that was said to be comparable to laser printers. However, scanning units are not tested as thorough as printing units. The ADF as well as the printing unit is full duplex: printing and scanning both side of a paper without manually re-entering is a big win.

Despite my distrust in ink-jet printers, I was tempted by this HP OfficeJet. The cheap printing costs, flatbed scanner with ADF combined with a reasonable price for a device which targets the small office market - my decision was made.

After I received the printer, I was eager to scan my first documents.

For my purpose, I did not need the flatbet scanner. The most important unit of the HP was the ADF unit.

How disappointing.

The ADF unit was very slow and error-prone. Scanning of both sides of a page lasted twenty seconds, when I remember correctly. And if the page was damaged slightly, chances for destroying the page completely by ADF were high. The HP software for OS X had poor user experience. In total, my scanning project could not be done with this device.

Fujitsu ScanSnap S1500M

After the disaster with the HP printer, I had to look out for another device. I read a lot of positive opinions on document scanners from Fujitsu when I was doing research before the HP device. However, five hundred Euro for a scanner only, that was too expensive to me back then.

I once again read many reviews and comments about the Fujitsu ScanSnap S1500M. The 'M' is for Mac OS X which describes the differences in the software bundle.

The tech specs are marvelous: twenty pages per minute in simplex mode, forty pages in duplex mode. You can scan paper from business card format up to A3 size (via folding).

The re-sell price is high so that I could sell it, if it does not suit my requirements or when I finish my first huge batch of scanning. Accepting the high price tag and still being skeptical, I ordered the document scanner.

What a difference!

Its OS X software bundle is a big win. Almost being perfect, it supports my scanning process very good. I can define independent scanning profiles for various situations. The software does automatically de-skew, rotate pages, detect color/gray-scale, and paper sizes. The ADF is working perfectly. Paper jam or feeding multiple pages happens very seldom. And in case of an ADF error, it is detected every time using multiple sensors. Empty pages are detected and ignored as well, if you want.

The ABBYY FineReader software is doing a great OCR job. The resulting PDF files match my vision.

Although, there are also a few negative aspects. When the OCR process is running - which takes approximately one to five seconds per page - the scanner is blocked as well. I noticed that the scanner is using the lossy JPEG format as internal format. Even when I chose to scan into lossless PNG files, I recognize the JPEG artifacts in the result. Fujitsu should have chosen a lossless in-between format for the raw scanning results.

Overall, the scanner is almost perfect. Scanning huge amount of pages is fun. If you got separated pages (destructive scanning), you just have to feed it batches of paper. It works very fast and reliable.

I am so not going to re-sell this great device - it's a keeper. :-)

The ScanSnap S1500M is discontinued. Its successors should be fine as well, I guess. And there is even a mobile model Fujitsu iX100.

Paper pre-processing

Because of the ADF of the ScanSnap, I have to remove paper clips and staples.

When I want to scan a book, I have to remove the spine of the book. For small books, I can remove the spine with a utility knife and a metal ruler. For thicker books, I used a jigsaw, a buzz saw with a thin blade, or a plane. Each was working out fine.

Current status

So far, I scanned over 40,000 pages. I processed almost all of my old paper documents. Paper from school and university is done as well as many books I read.

What is left are documents which are tricky to scan and analog photographs. I have to admit that the document scanner is not the device of choice for scanning photographs. The ADF produces small stripes which are hardly visible on white paper but do annoy on scanned images. I did not find a solution for photographs yet.

Lessons learned

What can I recommend you for starting a digitizing project on your own?

Well, first of all: do everything that makes the digitizing process as smooth as possible. Reliable hardware and software which reduces post-processing steps to an absolute minimum makes the task way easier.

It is perfectly OK to begin with easy to process documents to get a fast feeling of success. However, before that, you should define your requirements that your process should be able to handle. Further more, prepare a small set of different documents to test your setup: different paper sizes, paper thickness, shapes, and paper conditions.

When you digitized a document, mark it as done. I put a small dot on the upper right corner of the first page. This way, I don't lose overview in a huge paper chaos on my desk.

Develop a decent file name convention. You can come up with simple names or complicated ones. I started to develop my system which handles date-stamps, time-stamps, file names, tagging, and much more. With my set of helper scripts, my digital life got better and better. You should check this out.

Some links

Here are some random links which contain experiences of other people.

Here an article about scanning and OCR like a pro with open source tools in case you are looking for GNU/Linux tips. You will also find unpaper quite handy since it peps up scan results for better OCR results. An Ubuntu forum lists things for OCR. Here is also an askubuntu-page with tips. People have reported to get good results using scanadf from SANE with convert from ImageMagick for post-processing.

You can find a bunch of DIY book scanner projects on YouTube. A Google employee created a DIY scanner which even turns pages using a vacuum cleaner. A German DIY-scanner is described here.

There's an interesting episode of a podcast which gives good tips.

The Fujitsu scanners seem to have a community page.

I spoke about this topic on Grazer Linuxtage on April 25th 2015. Please watch the German video online:

Feel free and ask me anything in the comments or write me an email (see links below).

Final thoughts

Having all paper documents as digital PDF files is great. Besides more free space in the closet, I now can type any phrase in my desktop search and get material from school, university, books I've read years ago, or magazine articles about a topic. This is an awesome thing to have.

Information at your fingertips.

Email Comments

After my presentation of this topic at GLT15 @pdfkungfoo wrote me a list of seven questions I am going to answer:

The following questions appeared during the Q+A session of your GLT15 talk. Some of them came to (my) mind after your session only:
1. What is the typical file size of a scanned PDF page with the scanner settings you use? (Assume text-only content.)
2. What is the scan resolution you picked for your scanner settings?

The scan profiles I use most of the time are:

1page
- Scan to Folder
- Automatic resolution
- Auto color detection
- Simplex Scan (Single sided)
- Allow automatic blank page removal
- Allow automatic image rotation
- File format PDF
- Multipage PDF (whole batch in one PDF)
- Automatic paper size detection
- Check Overlapping via Ultrasonic
- Compression Rate: 4 (between middle 3 and max 5)
Similar profile: 2page
- Same as above but with Duplex Scan
1page, separated: one PDF per page
book: Similar to 2page but Compression rate 5

Example documents and their file size:

Prof.	What	Form	Colour	Pages	Bytes	Bytes/Page
2page	Letter with small logo on first page	A4	B/W	2	540000	270000
1page	Lung status letter with 3 small graphics	A4	B/W	1	171000	171000
1page	Recipe with coloured image	A4	Yes	1	206000	206000
2page	Lecture notes with drawings	A4+	Mixed	296	20674186	69845
book	Advanced Engineering Mathematics 1993	US?	Blue/Black	1436	204167585	142178
book	Operating Systems 3rd edition	US?	B/W	812	111106417	136831
book	Michael Crichton: Airframe (paperback)	US?	Yes[1]	448	56067954	125152
book	Bugs in Writing, 1st edition	US?	mostly B/W	690	74886277	108531

[1] It's black ink on a slightly coloured and structured paper which was scanned with colour.

3. What is the overall space consumed by your 40,000 scanned pages (which obviously do include pages with images etc.) ?

That's not easy to determine any more because I have different filename conventions for books and for other files. When I do neglect the books I scanned, I get following numbers:

What	Value	Notes
number of files	3204	without books
overall size	7535685008	that's 7 Gigabytes
average byte/file	2351962
number of pages	36545
average byte/page	206203
pages/file	11.4

4. Since Abbyy also offers a command line interface for its OCR, which also works on Linux -- did you ever test it? If so, what was your experience?

No, I did not use Abbyy on GNU/Linux nor with any command line interface. On OS X, Abbyy is well hidden within the software package of Fujitsu. You can't even use Abbyy on any PDF file independent from the scanning process of the Fujitsu.

5. Does your scanner software use JBIG2 compression (even if occasionally, for some of the scans)? Background to this question: - https://twitter.com/pdfkungfoo/status/586637635520700418 - https://twitter.com/pdfkungfoo/status/587228501280952321

I have no idea.

All I know is that the internal format of all scanned pages before processed to PDF (or not) is JPEG - very unfortunately.

6. What is the typical compression factor achieved for the scanned images in your PDFs? (You can check that with `pdfimages -list` -- look at the "ratio" column.)
7. Can you provide the complete outcome of `pdfimages -list` for one ~10-page document you recently scanned, with some text-only pages, some color-image-only-pages and some mixed text/image pages?

I can't determine it for all my PDF files, because pdfimages (from 2007) on my OS X machine holding all PDFs does not provide the option -list. After I migrated my main system to GNU/Linux, I'll probably come back to this.

pdfimages version 0.18.4 on my Debian GNU/Linux wheezy machine does not provide -list as well. Therefore, I can't give you this for a selection of PDF files as well. Let's see if I can determine it after upgrading to Debian Jessie.

2021-10-03: VueScan

As of 2021-10, I'm still using the workflow from back then. However, I got rid of Apple products altogether and therefore switched GNU/Linux running a Windows VM for scanning which runs the Fujitsu software.

Hamrick Software contacted me because of my article and they let me test their VueScan Linux software for free. It really looks great and it is definitely a huge improvement over the situation ten years ago when I started this project of mine, looking for the appropriate hardware and software support with GNU/Linux.

VueScan seems to be the way to go on GNU/Linux systems when it comes to automate or semi-automate scanning processes.

It still lacks support of all features of my Fujitsu ScanSnap S1500M such as recognizing paper jam to 100 percent, auto-select paper sizes on a per-page basis and so forth. They're constantly improving their software and I'm going to watch this process losely myself. When the support for my ADF scanner reaches a certain degree, I'm going to switch to VueScan fully myself. For most users without the need for specific features, VueScan might do the trick already. I'd say you should think of spending the money for their product since it's really doing a great job compared to the DIY workflows you would need to create yourself otherwise.

Comment via using the hashtag #20150401_DigitizingPaper (decentralized), email (persistent) or via Disqus (ephemeral) comments below: