Introduction


Why?

How?

Where?

What next?


Resources

CDL Corpus-Building

Steve Tinney
2007-03-22

Introduction

This document provides an introduction to corpus-building facilities used in the Cuneiform Digital Library.

Why?

Why build a corpus of Mesopotamian texts using the CDL tools?

Text Cataloging

The CDLI catalog provides a global repository of unique identifiers for inscribed objects:

  • you can use this catalog to develop a corpus and eliminate duplication
  • you can also use additional catalog data for fields not in the CDLI catalog
Data Entry

You can use old data and enter new data easily:

  • The heart of the CDL tools is a strictly defined text format which has an ASCII input version, ATF, and an XML version used by programs (XTF).
  • The CDL group has extensive experience in legacy data conversion and we are willing to help with substantial conversion jobs by bringing old data into ATF.
  • ATF makes easy things easy and difficult things possible.
  • ATF provides a complete solution to transliteration needs for cuneiform texts in any of the languages written in cuneiform.
  • An online template generator takes some of the drudgery out of entering new texts.
  • Many data preparers can still create consistent results thanks to the simplicity and thorough documentation of ATF.
Data Consistency

The CDL tools help you get data into a well-defined and highly consistent format and keep it that way:

  • An online service, the web-based ATF checker, identifies hundreds of different kinds of errors and can also do content validation of graphemes and more.
  • The ATF checker can produce lists of graphemes and words, with their frequencies, and these lists can be edited to eliminate inconsistent transliterations. The checker can then use the revised lists to check the content of your data, ensuring that it stays valid both in structure and content.
  • ATF is backed by a rigorously defined XML document definition in the international standard Relax/NG Schema language. Both the ATF input mechanism and the XML schemas are fully documented and available on the web.
Data Backup and Version History

The CDLI ATF text repository can look after your data:

  • the repository is backed up nightly
  • files are available from anywhere to download, work on, and upload again
  • a history of changes is maintained; old versions can be retrieved easily
Data Development

Once texts are entered they can be enhanced in various ways:

  • Lemmatization can be added either with interlinear tags or dynamically during ATF processing.
  • Sentence boundaries may be added.
  • Full support is provided for different translation styles; translation units can be lines, groups of lines, or sentences.
  • The XML format, XTF, is able to provide a very rich version of your textual data for programs to work on while the approach of typing simple ASCII texts and augmenting them automatically via lists means that the benefits of XTF are achieved with as little human effort as possible.
Presentation

The same transliterations and translation can be presented in several ways:

  • Online, as HTML; we can even host your project either in its entirety or in part.
  • In print, by using the 'Create RTF' option on the web-service and importing the results into most modern word processors and page-layout programs.
  • Projects hosted on the CDL web server can easily take advantage of sophisticate searching and web-display facilities.
  • The various lists produced by the ATF web service can be used as indices of print publications.
Enhanced Usefulness

Corpora prepared with these tools are reusable and more useful:

  • they add to the globally searchable and browsable Cuneiform Digital Library
  • they provide more and a greater variety of instances for use in online sign lists and dictionaries

How?

There are several steps in building a corpus; these steps may not all be necessary, and they can be carried out on a few texts at a time to build up a corpus incrementally, or all at one time if the basic corpus is already available as legacy data.

P-numbers

P-numbers are unique identifiers required by the tools

  • send a brief catalog of texts to CDLI staff
  • initially such a catalog might only contain identifying information such as museum and publication numbers and a few additional fields giving the author and date of the primary publication and the owner of the objects
  • information on period, provenience and genre is desirable--and is used by the web-based browse and search tools--but may be added later
  • further fields are useful but not absolutely required.
Project Definition

Projects are the organizational core of the CDL server

  • defining a project on the CDL server is optional; you can work with most of the tools without doing so
  • having project makes it easy for you to store control lists of graphemes and lemmata, and view your data online even while you are developing it.
  • a project provides an easy way to present your work online
  • e-mail Steve Tinney to arrange this
Transliterations

Transliterations are the core of a corpus

  • convert legacy data to ATF
  • add new texts by typing them in ATF
  • validate your transliterations using the ATF checker
  • you can optionally use your own control lists for data content like allowable grapheme values and more
Linguistic Annotation

Linguistic annotation makes a corpus more useful

  • you can add sentence boundaries to the transliterations by simply typing +. in the appropriate places
  • lemmatization, identifying the word to which each form belongs, is particularly useful
  • the CDL tools provide a straightforward procedure for making lists of forms and their lemmata
Translation

Translations can be integrated into the corpus

  • Translate the texts or convert legacy translations to ATF.
  • Publish the corpus to the web with searching or in print with indexes.

Where?

There are several web pages that you can look at for further information:

What next?

If you are interested in building a corpus with the CDL tools please e-mail cdli@cdli.ucla.edu or Steve Tinney (stinney at sas dot upenn dot edu) to discuss the next steps.

Resources

cbuild.xdf
XDF source for this documentation.

Questions about this document may be directed to Steve Tinney (stinney at sas dot upenn dot edu).