OCRmyPDF is a free and open source OCR (Optical character recognition) application for Linux. It is released under GNU General Public License v3.0 and written in python. It adds an OCR text layer to your scanned PDF files and it allows you to search the pdf text and you can also copy paste the text. Using OCRmyPDF you can convert scanned pdf to a text searchable pdf. Some of its features are Keeping the exact resolution original images in output and validates input and output pdf files. It Uses Tesseract OCR engine to recognize the pdf languages. It support more than 100 languages.
Install OCRmyPDF on Ubuntu
You can install OCRmyPDF via the below command on Ubuntu. Open your terminal application (ctrl+alt+t) and run this command.
sudo apt update
sudo apt install ocrmypdf
Enter your Ubuntu user password if needed.
Install addition language pack:
In terminal run this command to show the list of all available tesseract language packs.
sudo apt-cache search tesseract-ocr
From the list if you want to install the Tamil language pack, then run this command.
sudo apt install tesseract-ocr-tam
Convert Scanned PDF to Text Searchable PDF:
Syntax:
ocrmypdf input.pdf output.pdf
Replace input.pdf with your scanned file name and output.pdf with your new file name.
Example: First go to your scanned pdf folder. If your scanned file is in your Downloads folder. then in terminal.
cd Downloads
Then run
ocrmypdf scanned.pdf newgeneratedfilename.pdf
Here the “scanned.pdf” is a my pdf file name which is in Downloads folder. After running this command. It will create new “newgeneratedfilename.pdf” file in the same Downloads folder. Now open the new file and it will be searchable and you can also copy paste the text.
OCRmyPDF Usage:
For the complete usage details run this command in terminal.
ocrmypdf -h