Purpose for Building, and Description of, the Workbench
Briefly: Workbench for easier and faster transcription of information from the 1901 Canadian Census into machine-readable form.
Overall, the product is intended to make it easier and faster to transcribe information accurately from the 1901 Canadian Census forms into machine-readable form.
The 1901 Canadian Census is available as a large collection of MrSID-format images which show entire pages (of 50 lines each, one line per person) of the handwritten census. One of the primary problems that users of these images face is that the software available for viewing MrSID images is not made to be embedded in, or controlled by, other software that might aid in the transcription process. In particular, the full-scale images will not fit within ordinary computer screens, which obliges the person doing a transcription to zoom and pan repeatedly and, at the same time, to manipulate whatever GUI is being used for data entry.
The product will display selected rows or columns of one of these forms, along with an conventional computer input form, to facillitate entry. The basic output of the product will be appropriately defined XML.
The product is being written in Python, with a wxPython-based GUI. The product transforms MrSID images to JPEG, so that they can be manipulated using PIL (the Python Imaging Library). The DOS program used to make the transformation from MrSID format is MrSIDDECODE, which is freely available from the company with a proprietary interest in MrSID.
We intend to support the Windows, Mac OS X and Linux platforms, which is to say, all of those platforms for which MrSIDDECODE is made available. Initial development has been for the Windows platform.
Since most of the code is quite straightforward, and having prototyped those parts of the code needed to process the MrSID format, we foresee no big obstacles to development.
Chief features:
The prototype of the product comprises in excess of 700 lines plus a few small graphic files.
Using the Workbench (as it exists as of
2004-02-03) NB: The Workbench is not yet
operational, mainly insofar as it lacks facilities for saving and retrieving transcribed data.
The Workbench can mediate
the retrieval of MrSID-format census form images from the Canadian federal
government web server on which they are stored.. Click on "Retrieve" in the
"File" menu to begin.
Click on the name
of the province or on "The Territories", then indicate the records that you
seek, as shown, and press 'OK'.
If you
successfully identified this series of images before then the Workbench will
display the dialog box shown to the left. If you select 'Yes' then the Workbench
will refresh its list from the government server. Normally this is unnecessary.
The Workbench
lists the census form images that it identified corresponding to your search.
Check the ones that you want to download.
The Workbench
will indicate if you have already downloaded one of the census images that you
have just identified using the dialog box shown to the left. Normally you would
not need to download again.
At this point the MrSID census form image should be available on your disc. Here are some typical entries in a listing of such files that would appear in a Windows Explorer pane.
Notice that each file is named using the numbering system for the image files
on the government server, as well as with identifying information.
The next step is to
prepare an image for transcription, and then to open it for transcription. Click
on "Open" in the "File" menu to begin.
At this point you
can select one of the files that you have downloaded using the Workbench, or you
can select one that you have downloaded by other means. Here is how the file
open dialog appears.
Because the Workbench displays just a few lines from the census form at a time it needs to have alignment information. If you had aligned the current census form image previously then you would see the dialog that is shown to the left.
If this were the first time that you were opening this particular census form
image the dialog would not appear. Let us press "Yes" so that we can continue
with alignment (as if this were the first time we had met this image).
If you had converted the MrSID image for processing previously then you would see the following dialog.
If this were the first time that you were opening this particular census form
image the dialog would not appear. Let us again press "Yes" so that we can
continue with alignment (as if this were the first time we had met this image).
Whilst the Workbench is using another program, called "mrsiddecode", to convert the image for you, the following window will be displayed.
Notice how progress is indicated as a percentage.
When the
MrSIDDECODE window closes press 'OK' in the dialog shown to the right.
The Workbench now expands to fill the screen (if necessary), and displays the entire census form for alignment, as shown in the graphic to the left.
Please follow the instructions that appear on this screen. Essentially you
need to click on each of the four corners of the area that contains the (up to)
fifty lines of census information. The Workbench uses this information to be
able to identify individual lines in the converted image for display during the
transcription process. Notice especially that it is necessary to download and
align a census form image only once. The Workbench maintains the information
that you have supplied between sessions.
The above is a portion of the
image that appears when you open a census image. (As a matter of fact, this is a
composite image, to save space on this web page.) In this case line 1 of the
form is presented in the graphic above the computer form. Notice that each field
in the computer form is aligned with the column it represented in the form
above. What you cannot see in this image is that, as you tab through the
computer form the image above it scrolls so that the appropriate item to be entered is kept in
view. Other notes:
Please note that this is a work-in-progress. Clearly there are lots of ways to introduce efficiencies in this process; however, this is the way the program works as of the first half of February 2004.