Setting up a TEI Work Flow
The "work flow" for your TEI project is just the steps by which you transcribe, encode, and publish your TEI files. The work flow may change as your project matures (for instance, as you add staff or get help from collaborators) but it is worth thinking about what work needs doing, and in what order, before you get started. It's also a very good idea to document your work flow (and keep that documentation up to date as things change), since this documentation can serve as training materials for new staff members, and as a reminder of essential tasks.
The essential steps in any TEI encoding work flow are:
- Transcription: The capture of the actual content of the documents being encoded, as text. Although this step is conceptually distinct from the encoding stage described next, in practice the two are very often combined. There are many features of early printed books and manuscripts which can best be captured in the encoding itself (for instance, shifts of font, deleted text, indentation) and it can be quite awkward to represent these without markup. However, for projects where the transcription has already been done, where the text is being taken from some pre-existing edition, or where the transcription is being done by collaborators who for any reason cannot use XML, separating these two steps may be necessary. If so, bear in mind that the more consistent the transcription is in its representation of features like formatting, deleted material, and so forth, the easier it will be to convert this information into XML. Under ideal circumstances, the conversion can be done automatically, so that's something to aim for. These considerations also apply if the text is being transcribed by a vendor, or if the text capture is being done through an automated OCR process.
- Encoding: In the encoding stage, the text becomes an XML file through the addition of explicit tags (for instance, TEI) that represent the document's structure and content. As noted above, this process is very often carried out while the text is being transcribed; an encoder will type the text into an XML editor and apply the appropriate encoding as he/she goes along. For some kinds of documents, there may be features that can be automatically encoded after the text has been fully transcribed (such as personal names or dates). During the encoding process the text is checked at intervals to make sure it is valid. The encoding of the document may be completed in a single pass, or or it may be structured as a staged process with an initial, very simple encoding to capture the essential textual features, followed later on by one or more encoding processes focused on capturing more details of content or structure. This sequence of tasks might also map onto levels of staff expertise, with more experienced encoders or subject specialists undertaking the later stages. A phased approach to encoding has the advantage that it produces a visible output sooner (helpful when producing a prototype) and lends itself to multiple rounds of grant funding.
- Encoding review: When the text has been captured and encoded, some correction process is necessary, usually involving a review of the encoding to ensure consistency and correctness. This may be partly a manual process but you can also take advantage of several types of XML tools to help review the encoding. The first and most important is validation; your schema should be a good expression of what your encoding ought to look like, and ensuring your files are valid will rule out many basic encoding errors (like putting headings in the wrong place, or having list items that aren't inside a list). To get the most value from validation, you should use the TEI customization process to make sure your schema does a good job of expressing the encoding you want in your documents (and of excluding encoding you don't want). In addition to validation, you may find it helpful to develop a set of search routines that look for specific kinds of errors and inconsistencies that may be common in your encoding process. These might be wrapped into a small script that can be run on your files as they are completed. Such tools are highly project-specific and as a result typically need to be developed in-house, so they may not be practical for very small projects without programming support.
- Proofreading: Following transcription you will typically also need to perform one or more proofreading passes through the text to check the accuracy of the transcription against the source. Because some of the text's information (and even some of the content) may be represented in the encoding, it's important to provide output for proofreading that conveys all the information necessary. For instance, if you encode typographical errors in the source with the choice element, your proofreading output may need to give the proofreader a way to check both the original error and the corrected reading. If your proofreading output does not display renditional features like italics or indentation, your proofreaders need to know that they shouldn't be checking or correcting these features. Many projects find it very useful to have their proofreaders trained as encoders, so that they have an understanding of the underlying data they're working with. At the same time, this imposes an extra burden of training, and it may be difficult to attract and maintain a large enough pool of fully trained people.
- Entry of corrections: It is feasible to enter corrections as part of the process of encoding review and proofreading, and this may be an attractive approach since it minimizes the need for printouts and simplifies the process somewhat. However, there are some advantages to having errors marked on a printout and then corrected in the XML as a separate step. One of these is that it allows for a proofreading staff who are not familiar with XML, which provides some flexibility in staffing and a diminished training burden. Another advantage is that it leaves a clear record of what was done, especially if the person entering corrections marks each one as made, and it permits the proposed corrections to be reviewed, which can be valuable if the proofreading staff are inexperienced. The WWP has also experimented with a further "checking round" in which a third person checks that each correction was made accurately in the XML; this is expensive but for very high-stakes work it may be worth the cost.
The steps described above cover the essential stages of the work flow. How about the practicalities of setting up such a work flow for your own project? What do you need to have in place to support this kind of work? What preparatory work is required? Here are some key points:
- Develop a well-constrained TEI schema: Working with a set of representative sample documents, develop an inventory of the features you want to capture and the contexts in which they appear. What this will yield is an intellectual model of how you want to represent your documents in digital form. Then you need to create a TEI schema that expresses this intellectual model as closely as possible (allowing for the variances you may discover in other documents in your collection). This schema will help avoid inconsistency in your encoding (ensuring that you encode the same feature in the same way across your project) and will also help guide your encoders as they learn their work, since it will constrain the TEI elements that are available to them as they encode the text. You can make changes to the schema over time, if you discover unforeseen features or variations between texts, but it's helpful to take this initial development process seriously as a foundation for your project.
- Document the entire process: Ideally, before you hire or train encoders, think through what the basic work process will be and write it down; pretend you're explaining it to a family member or someone unfamiliar with the project. Share the description with a colleague and ask them to point out any missing details or unclear points. This documentation is the basis for training your staff and it's an important surrogate when no one is available to answer questions. This documentation should include instructions for what tools and templates to use, where to save things, file-naming conventions, documentation processes, how to fill out time sheets, all the practical trivia that your staff need to know in order to do things in an organized way. Writing this documentation is also helpful in revealing things you haven't thought about yet. You should also document your encoding process (probably as a separate document).
- Develop encoding templates: If you're encoding more one text, it's likely that there will be some standard, repeated features of your encoding (such as the metadata, and perhaps basic structures like front, body, and back) that should be consistently represented. If you're encoding more than a small handful of texts, you'll find it very helpful to create a template that contains these standard features, to use as a starting point when encoding a new text. The template should include any standard project information (such as the details of the fileDesc, projectDesc, and encodingDesc) and should also include placeholders for any information that you want your encoders to fill in (such as the author and title of the text being encoded) with clear instructions for how to fill it in correctly (e.g. the format of the author's name; whether to modernize the spelling of titles). When an encoder starts a new text, he or she can save a copy of the template, fill in the necessary metadata, and begin encoding. This process helps prevent casual inconsistencies and omissions of important information, and it also streamlines the encoding process by avoiding unnecessary work.
- Set up your computers with a uniform encoding environment: Make sure that if possible your encoders are working on computers that have the same setup: the same software, the same versions of things, the same way of getting access to servers if needed, etc. This can help streamline training by eliminating confusing special cases, and also makes it easier to maintain the working environment (since you can update everything in the same way at the same time). Make sure encoders know what to do: document where they should save files, how they interact with the version control system, etc. There's a great potential for unhappiness and wasted time if files are saved in the wrong place, or if duplicate files are made (and then have to be reconciled). In a subtler way, the overall organization and orderliness of the working environment helps maintain morale and encourages staff to take responsibility for the overall smooth working of the system. Encoders who are confused or uncertain about what to do are more likely to make mistakes, and will feel less strongly invested in the project.
- Think about version control: Version control is the process of managing changes to your files in an organized manner, and (of particular importance) avoiding loss of data. In any project involving more than one person, there is potential for version control problems, if two people edit the same file at the same time, or create incompatible changes to a file, or create multiple, differing versions of a file. In a small project it may be possible to avoid these problems by observing careful conventions (e.g. maintaining all files on a central server, and alerting the group whenever someone is editing a file, to avoid conflicts). However, using a version control tool provides a more reliable protection against conflict and also enables you to return to earlier versions of a file (for instance, to reverse an erroneous change). Version control is an invaluable part of any long-term text encoding project. You may need to get help from your local technical support to do the initial installation and setup, but once installed, version control systems are comparatively easy to use and are well worth the investment of time and training.
- Set up a backup system: Along with version control, good systems for backing up your data are crucial basic infrastructure for a successful encoding project. This is an area where the best should not be the enemy of the good: even if you can't set up an automated daily or weekly backup system right away, you can purchase a cheap high-capacity external hard drive and make it part of your weekly Friday ritual to copy all the project's data onto it before closing down for the weekend. If you can't back up all of your data every time, focus on the data that represents human effort (e.g. encoded files) rather than automated work that could be reproduced. If your files are stored on a server maintained by someone else, make sure you know what the backup systems for that server are. Don't hesitate to complain if the server isn't being backed up regularly. Think about the cost to you of reproducing a day's work (a week's work, a month's work) and use that as a measure of often you need to back up your data.
- Develop a training plan for new encoding staff: Especially if your project is likely to last for more than a single employment cycle, but even if you'll only be hiring one cohort of encoding staff, it's worth thinking about how you will train your encoders. This is partly a question of what they need to know, but also a question of what is the best way for them to learn it: by encoding a sample text, by reading documentation, by reading the TEI Guidelines, by some formal training course, by asking questions of you or a colleague? As with documentation, having a clear and well-organized training program helps maintain good morale and avoids confusion and wasted time. It also ensures that all of your encoding staff get the same training and know the same things (so that they can help each other). If you regularly hire encoders (e.g. at the start of each academic year), having a training program in place with materials ready to go will save you tremendous amounts of work, and will help you get your new encoders into the project that much quicker. At the WWP, we have found that our student encoders learn best by a combination of training methods: a few initial group sessions to explain basic concepts of text encoding and the TEI, individual hands-on practice with a short sample text (with an immediate review and feedback process), and access to online tutorials and documentation on specific topics so that they can look things up on their own as they work.
- Develop a system for assigning and tracking work: It's very helpful to have a system of some kind to keep track of what tasks need doing, who is doing them, and when they're completed. This helps specific tasks from falling through the cracks, and also helps you see how much time and effort the different aspects of the project are taking. At the WWP we have experimented with databases (using Filemaker Pro), wikis, and paper tracking systems. Having a physical chart on a wall or desk is useful, partly because it's easy to see and use; encoders can check off tasks as they're completed and can easily tell what needs doing. A wiki serves a similar function, being easily group-writable, with the added benefit of being remotely accessible (e.g. if students are working from off-campus). Having the information stored more systematically in a database is important from a project management perspective, since it allows for retrospective analysis (how many texts did we complete that year? how long does proofreading typically take? how many texts did so-and-so work on?). The key feature of any system is that it be usable; if it's so complex and elegant that no one understands it, then it won't serve its purpose.