Wednesday, December 21, 2011

Automatic PDF Form Filler

Filling out forms is one of my least favorite activities; especially for programmers / IT people who are used to doing everything electronically, looking at that form with pen in hand somehow brings time to a crawl. The boxes are always too small, if there are mistakes, you need to reprint, and repeat the whole thing again. By hand.

Here is a collection of Python scripts that will allow you fill out a PDF form automatically. First, you need to convert the PDF to a collection of jpgs, using

python DOC.pdf [target dir]

In [target dir] you should now see DOC-0.jpg, DOC-1.jpg, etc.

Then, you need to identify box locations. For that use

python [target dir]/DOC-0.jpg

This brings up a UI tool; as you click on boxes, the coordinates of those boxes will be written to a [target dir]/DOC-0.jpg.loc file. Make sure you click on the boxes in a logical order, most forms specify a number on the page for each box anyway, use that order for instance.The coordinates are written to the loc file as you click, so once you are done, simply shut off

Now in [target dir], start a new file called DOC-0.jpg.fill

This file will carry the values to be used to fill our your PDF form. Each line in this file should correspond to the line specified in DOC-0.jpg.loc. The line orders must match. You can manually tell to skip pixels in up or down direction by using e.g.

[down=40]bla bla bla

You can also use up, left, right commands. If you need to change the font size, e.g. for size 20 use [font=20].

Once that is done,

python [target dir]/DOC-0.jpg

This will use the loc file, fill file, and generate a final DOC-0.jpg-out.jpg

In this file you will see stuff from fill file placed in proper coordinates.

This tool uses ImageMagick, so make sure you install that first. Also, for the necessary Python libraries on Ubuntu you can use

sudo apt-get install python python-tk idle python-pmw python-imaging python-imaging-tk

An improvement to this code could be using a vision algorithm to automatically detect the location of each box. There is a certain visual pattern to a form -- words are in straight lines, there are big empty spaces in between, and the whole thing is usually surrounded by lines.