Adobe Portable Document Format (PDF) documents can be created with
blank spaces that can be filled in by the user.
The US Internal Revenue Service is a
popular source of forms using this feature. It's very convenient if you
want to use the Adobe Reader and fill out the forms interactively. It's a
bit less convenient if you want to do calculations in a spreadsheet and then
transfer the numbers to the form. One of the reasons I own a computer is
to let it do grunt work like copying numbers from one place to another,
so I decided to figure out how to fill out PDF forms automatically.
The tools exist but it takes a while to figure out how to use them,
so I decided to create this tutorial to capture what I've found out.
Here's the process flow I use. Starting from the right, the pdftk
program takes specially formatted values (the fdf file), merges them with
an existing pdf form, and writes a new pdf file which is the filled-out form.
On the left, I wrote a program fdf_gen
, which handles the
process of getting my data into the right format. It's fairly specific
to my problem, but may be useful as a starting point for someone else.
pdftk f1040.pdf fill_form f1040_kat.fdf output f1040_kat.pdf
where f1040_kat.fdf defines the contents of each field, and
f1040_kat.pdf is the new pdf with the values inserted.
The fdf file contains specially formatted PostScript, and
looks like this:
%FDF-1.2
1 0 obj<</FDF<< /Fields[
<</T(f1_04(0))/V(Katherine(Kat))>>
<</T(f1_05\(0\))/V(Astrofic)>>
<</T(c1_03(0))/V(a)>>
% And lots more lines like these.
] >> >>
endobj
trailer
<</Root 1 0 R>>
%%EOF
The top 2 lines and bottom 5 lines are standard headers and
trailers, which should never need to be changed. On each line,
the characters in parentheses after the T are the names of
fields in the form, and the value in parentheses after the V
is the value to be written into the field. For example,
f1_04(0)
is the "First Name" field.
pdftk f1040.pdf dump_data_fields >f1040_fields.txt
FieldType: Text
FieldName: f1_01(0)
Most fields are of type "Text", we'll talk about FieldType "Button" next.
The FieldName is just a character string that labels the field. Unfortunately,
there's no relationship between these names and the line numbers or anything
else on the form, so the only good way to figure out what's what is to stuff
a dummy value into the field, and see where it shows up on the form.
FieldType: Button
FieldName: c1_03(0)
FieldFlags: 0
FieldJustification: Left
FieldStateOption: Off
FieldStateOption: Yes
FieldStateOption: a
FieldStateOption: b
FieldStateOption: c
FieldStateOption: d
The FieldStateOption lines define the allowed values for the checkboxes.
Most just have options Off
(no boxes checked) or Yes
(check the box). In this case, there are 5 possible choices. Naturally, the option
value have absolutely no relationship to anything actually printed on the form, so
we have to try the values until we get the one we want. And here it is.
These notes describe what I've seen on IRS forms; others may have other quirks.
I wrote program fdf_gen.c
to implement part of the process of creating
an fdf file. It works on some simple
test cases, but hasn't had any extensive validation. In other words, if you're
going to use it for something critical like real tax forms, you really need
to doublecheck the output to make sure it's doing what you want it to do.
In this case, I generate the fdf file using the command
fdf_gen f1040.flds kat.in kat.fdf
where f1040.flds
just assigns a content type and more descriptive name to
each value to be entered, and kat.in
contains the input values.
Typical entries in f1040.flds
are:
where the first item is the type of data, the next item is my descriptive name, and the
rest of the line contains the field or fields the value will be written into.
string LblLastName f1_05(0)
string3 LblSSN f1_06(0) f1_07(0) f1_08(0)
dollar_cents L7 f1_44(0) f1_45(0)
File kat.in
Just contains descriptive names and values:
My data types are:
LblLastName Astrofic
LblSSN -123-45-feed
L7 77.25
string
- Value is just a character string.
number
- Synonym for string.
button
- Synonym for string.
string3
- String is broken into multiple sections, and each section goes
into a different field in the form. The first character is the
section break character.
dollar_cents
- A numeric value placed into 2 fields. The dollar value goes
into the first field, and the cents value goes into the second.
dollar_cents_paren
- like dollar_cents, except that negative values are
in parentheses. E.G., -123.21 is generates (123
and
21)
, in separate fields.
This program was originally published in 2008. As of March 2012, Greg Lawson is also working on
this code as part of an open source tax project. You may want to check his repository for
more recent updates. See the links below.
pdftk
program has a website at
www.accesspdf.com. It has
the program, mailing lists, and links to purchase a book, PDF Hacks.
I haven't purchased the book but the program is great, so I assume the book will
be too.