Digitize a Book With Linux

by aDimWit in Living > Life Hacks

14527 Views, 23 Favorites, 0 Comments

Digitize a Book With Linux

5658466261_92543507cf.jpg
I've tried several methods for digitizing books and this is the least stressful. Most of the work is reduced thanks to the fact that the command line in Linux does a lot of the more tedious work for you.

Aside from scanning an entire book (you have to do that), Linux will split the pages, combine the book and even compress or convert the finished file to black and white. Before I found this method, I had to do everything in a photo editor, and then combine it all to PDF in MS Word. That was a nightmare.

Most of the problems will be user-related errors like skipping a page while scanning, mis-naming files, or deleting files accidentally. I've performed this method a few times now and have gotten the work time down to one or two hours.

Get a Scanner

DSC08585.JPG
DSC08586.JPG
There are a large number of scanners available on the mass market, but I'm not entirely sure which of them are compatible with Linux operating systems. Therefore, for the next section only, I had to use Windows to scan all the pages.

I used an Epson CX9400 scanner/fax for this project. The software takes most of the work out of it. You can edit the prefixes of the files; each file is named by progressive numbering; you can choose to save sections of the scanned image; you can change the image resolution, color, etc. I highly recommend finding a scanner with similar functions.

Scan the Book.

DSC08587.JPG
DSC08588.JPG
This will be the most tedious part of the project. You are going to scan the book cover to cover.

The default resolution for most scanners is 75 DPI, but most of the time I use 300 DPI. 75 DPI is standard for most computer monitors.
Note: If you plan to run your scanned images through an OCR program to extract the text, it's best if you use 300 DPI. This will give you a more accurate text output.

Lay the open book on top of the scanner with the pages to the glass. Make sure it is lying in the right direction. If the book is too big, you can rotate it sideways.

There is usually an edging on the sides of the scanning screen. You can align the pages with these edgings so the scanner makes straight copies.

To get a crisp. clean scan of the book pages, you are going to have to press down on the book. Don't press too hard or you will break the glass.

You are going to repeat these steps at least a hundred times. Once you are done scanning, skim through all the saved files to make sure you didn't make any mistakes (like skipping a couple pages).

Split the Page Files

eg05a.png
eg03a.png
eg04a.png
If in the last step you had to rotate the book to get a complete scan, the pages will appear sideways on your computer. Most computers now have the ability to rotate these images without opening any photo editing software. Use that to rotate each image to the correct orientation (i. e. left page is at left, right page is at right, words go left to right etc.)

Since we scanned both pages of the book at once, we will need to split them.

Open the Terminal in Linux. For this section, I am using a Linux Mint 9 operating system.

Type:
convert -crop 50%x100% 'filepath.jpg' +repage outputname_%d.jpg

"50%x100% " will split the files down the center vertically. If you want to split it horizontally, type in "100%x50% ".

"filepath.jpg " is the file's path and name. You can drag the file (or multiple files) into the terminal to avoid typing out everything.

"outputname_%d.jpg " is the name you want for the output file. The "%d " at the end will add progressive numbering to the name. So when the file is processed, you will have two files named "outputname_0.jpg" and "outputname_1.jpg".

You can split multiple files by dragging and dropping them into the terminal. But they will be processed all at once, not one at a a time. Therefore, they will take a lot of processing power to accomplish this task. I recommend splitting 5 scanned pages at a time (i. e. splitting them into ten pages).

Combine to PDF

eg06a.png
Once we have all the images split and in order, we are going to combine them all into one PDF file.

When you have to type the file's name, just drag and drop all the files. Remember, they all have to be in exact order for this to work right.

In the terminal type:
convert -adjoin 'filename_0.jpg' 'filename_1.jpg' outputname.pdf

"-adjoin " will combine the two files into two pages, not as one page.

"filename_0.jpg " is the name of the file you split. Remember, you can drag and drop multiple (preferably all) the files into the terminal for this step.

Now, you have to wait. You have to combine all the files at once into one PDF file. This will take a long time. I think it took a 300 page book about 20 minutes to process.

I recommend doing this all at once. As in, don't combine the first 100 pages then combine an additional hundred pages. This will bring down the quality a lot.

Once the files are done processing, the completed PDF file will be found in the home folder. Open it and look through it to ensure all the pages are in order.

Note on Copyrights

Copyright_symbol_9.gif
I'm going to lay it down simple.

Anything published before 1923 is guaranteed to be in the Public Domain.

Anything published after 1964 is guaranteed to be Copyright protected.

Between those years, there is a grey area. Some of the stuff is still under Copyright protection while some of it is in the Public Domain.

During that time frame, the authors of these published books had to file a Copyright Renewal to the Copyright Office to ensure their works continued getting protection. If they didn't file a renewal, they lost their protection and the works went into the Public Domain. Because of this renewal procedure, a very large majority of the books published in that time have never received renewed protection. Therefore, there are a lot of books out there in the Public Domain.

To find out for sure which is Copyrighted and which isn't, you have to look at the Copyright records and find out if a renewal was filed for that book. Thankfully, Stanford University has compiled a Copyright Renewal Database . It is entirely searchable and the rate of error is less than 1%. Any book not found in the database, is most likely Public Domain.

In the case that the book is Copyrighted, remember that no one but the Copyright holder (i. e. author) has the right to reproduce these items. The only circumstances where it is alright to reproduce a book is if the author gave you direct permission.

Notes on This Method

One of the problems with this method is the fact that the final PDF files are very large. There are two ways to fix this.

Compress the file. I found this tool at Web UpD8 to be extremely helpful and simple.

Another way is to convert the file to Black and White. To do this, you will need to use the convert function in the Terminal.

Type:
convert -monochrome 'filepath.jpg' outputname.jpg

I recommend converting all the split files to B/W and then combine them into a PDF file. If the complete (colored) PDF file is converted to B/W, the quality will degrade.

Finished. What Now?

FAF_155.jpg
FAF_153.jpg
If the book is for private use, you can read it on your computer or iPod, or Nook, or whatever. But that really doesn't make much sense since you have a physical copy on hand.

If you are sure the book is Public Domain, you can try uploading it to the internet. Increasingly, libraries and universities are digitizing their Public Domain collections for the general public. If you have such a book, and have digitized it, why not upload it to the Internet Archive ?