CDL Corpus-Building
Steve Tinney
2007-03-22
This document provides an introduction to
corpus-building facilities used in the Cuneiform Digital
Library.
Why build a corpus of Mesopotamian texts using the CDL tools?
- Text Cataloging
The CDLI catalog provides a global repository of unique
identifiers for inscribed objects:
- you can use this catalog to develop a corpus and eliminate duplication
- you can also use additional catalog data for fields not in the CDLI
catalog
- Data Entry
-
You can use old data and enter new data easily:
- The heart of the CDL tools is a strictly defined text format which
has an ASCII input version, ATF, and an XML version used by programs (XTF).
- The CDL group has extensive experience in legacy data
conversion and we are willing to help with substantial conversion jobs
by bringing old data into ATF.
- ATF makes easy things easy and difficult things
possible.
- ATF provides a complete solution to
transliteration needs for cuneiform texts in any of the languages
written in cuneiform.
- An online template generator takes some of the drudgery out of
entering new texts.
- Many data preparers can still create consistent results thanks to
the simplicity and thorough documentation of ATF.
- Data Consistency
-
The CDL tools help you get data into a well-defined and highly
consistent format and keep it that way:
- An online service, the web-based ATF checker, identifies hundreds
of different kinds of errors and can also do content validation of
graphemes and more.
- The ATF checker can produce lists of graphemes and words, with
their frequencies, and these lists can be edited to eliminate
inconsistent transliterations. The checker can then use the revised
lists to check the content of your data, ensuring that it stays valid
both in structure and content.
- ATF is backed by a rigorously defined XML document
definition in the international standard Relax/NG Schema language.
Both the ATF input mechanism and the XML schemas are fully documented
and available on the web.
- Data Backup and Version History
The CDLI ATF text repository can look after your data:
- the repository is backed up nightly
- files are available from anywhere to download, work on, and upload again
- a history of changes is maintained; old versions can be retrieved easily
- Data Development
Once texts are entered they can be enhanced in various ways:
- Lemmatization can be added either with interlinear tags or
dynamically during ATF processing.
- Sentence boundaries may be added.
- Full support is provided for different translation styles;
translation units can be lines, groups of lines, or sentences.
- The XML format, XTF, is able to provide a very rich version of
your textual data for programs to work on while the approach of typing
simple ASCII texts and augmenting them automatically via lists means that the
benefits of XTF are achieved with as little human effort as possible.
- Presentation
The same transliterations and translation can be presented in several ways:
- Online, as HTML; we can even host your project either in its
entirety or in part.
- In print, by using the 'Create RTF'
option on the web-service and importing the results into most modern
word processors and page-layout programs.
- Projects hosted on the CDL web server can easily take advantage of
sophisticate searching and web-display facilities.
- The various lists produced by the ATF web service can be used as
indices of print publications.
- Enhanced Usefulness
Corpora prepared with these tools are reusable and more useful:
- they add to the globally searchable and browsable Cuneiform Digital Library
- they provide more and a greater variety of instances for use in online sign lists and dictionaries
There are several steps in building a corpus; these steps may not
all be necessary, and they can be carried out on a few texts at a time
to build up a corpus incrementally, or all at one time if the basic
corpus is already available as legacy data.
- P-numbers
P-numbers are unique identifiers required by the tools
- send a brief catalog of texts to CDLI staff
- initially such a catalog might only contain identifying information such as museum and publication numbers and a
few additional fields giving the author and date of the primary
publication and the owner of the objects
- information on period, provenience and genre is desirable--and is used by the web-based browse and search tools--but may be added later
- further fields are useful but not absolutely required.
- Project Definition
Projects are the organizational core of the CDL server
- defining a project on the CDL server is optional; you can work with most of the tools without doing so
- having project makes it easy for you to store control lists of graphemes and lemmata, and view your data online even while you are developing it.
- a project provides an easy way to present your work online
- e-mail Steve Tinney to arrange this
- Transliterations
Transliterations are the core of a corpus
- convert legacy data to ATF
- add new texts by typing them in ATF
- validate your transliterations using the ATF checker
- you can optionally use your own control lists for data content like allowable grapheme values and more
- Linguistic Annotation
Linguistic annotation makes a corpus more useful
- you can add sentence boundaries to the transliterations by simply typing
+. in the appropriate places
- lemmatization, identifying the word to which each form belongs, is particularly useful
- the CDL tools provide a straightforward procedure for making lists of forms and their lemmata
- Translation
Translations can be integrated into the corpus
- Translate the texts or convert legacy translations to ATF.
- Publish the corpus to the web with searching or in print with indexes.
There are several web pages that you can look at for further
information:
- The CDL
pager gives an alternative view of the CDLI catalog and
transliteration data; you can search the transliterations, limit the
search by catalog keywords and navigate to the texts and to the
ePSD.
- The ATF web
page offers a guide to introductory and full documentation on ATF
and has a link to the web service.
- You can play with the ATF template
generator here.
- The main CDL
documentation entry point gives another kind of introduction to this material.
If you are interested in building a corpus with the CDL tools
please e-mail cdli@cdli.ucla.edu or Steve
Tinney (stinney at sas dot upenn dot edu) to discuss the next
steps.
Questions about this document may be directed to
Steve Tinney (stinney at sas dot upenn dot edu).