Home


Introduction


ATF Files

Lemmatization

Forms

ORTH

CF

GW

SENSE

POS

EPOS

BASE

CONT

NORM0

MORPH

#lem: lines

Separator

Ambiguity

Multiplexes

Uncertainty

Breakage

Numbers

Miscellanea

Forms Files

Inlining

Disambiguation

Augmentation

Gudea 1 Redux

Units

Dictionaries

Syntax Trees


Resources


References

CDL Linguistic Annotation

Steve Tinney and Eleanor Robson
Version of 2009-11-22

Introduction

This document provides an overview of linguistic annotation facilities used in the Cuneiform Digital Library. We focus here on the data-entry view of linguistic annotation, describing the general facilities provided by the ATF document type which support lemmatization, morphological annotation, unit divisions and syntactic trees. The facilities described here are not language-specific; instructions for applying this approach to different languages are maintained in language-specific documents.

ATF Files

The ATF format is a precisely defined ASCII notation which can be translated correctly and without loss of information into an XML equivalent; ATF is fully documented and the following example is provided only as a simple orientation for the reader:

&Q000887 = Gudea 1
@composite
1. {d}ba-u2
2. dumu an-na
3. nin-a-ni

Readers to whom this much ATF is not familiar are referred to the ATF Primer.

Lemmatization

Lemmatization is the simplest and most common annotation which consists of labelling written words, which may be inflected, with the base word (or dictionary headword) of which the written form is an instance. The lemmatized version of the example above might look something like this:

&Q000887 = Gudea 1
@composite
1. {d}ba-u2
#lem: DN

2. dumu an-na
#lem: dumu[child]; An[]DN

3. nin-a-ni
#lem: nin[lady]

This example illustrates several common aspects of lemmatization; though the example is in Sumerian, the principles apply equally to all languages.

STOP HERE. If you have not already done so, you should read the Lemmatization Quick Start document before reading the sometimes more detailed discussions below.

Forms

The word form is used to mean two things. The first meaning is the collection of analytic and linguistic information which belongs to a word. The word may be an instance in a corpus, in which case the form may have fields specific to the corpus context. Or the word may be an entry in a dictionary, in which case the form may have fields specific to the dictionary context (in a dictionary, each head word may have several forms which represent different combinations of spelling, part-of-speech etc.).

In the following sections, we describe the fields that make up a form; by convention we write the names of these fields in uppercase and using abbreviations which, once learned, make it easy to refer to the fields of a form. The first field discussed utilizes our second sense of form, namely the orthographic form: to reduce the ambiguity of these usages, we abbreviate the orthographic form using the name 'ORTH'.

One part of the lemmatizer's job is to match up corpus-instance forms and dictionary-entry forms. It does this by using the written instance of a word and/or the supplied lemmatizations, and it is actually possible to specify most form fields in the lemmatization line but that is almost never necessary. Instead, we support supplying hints to the linguistic processing layer which allow programs to make the correct choices between ambiguous forms or to supplement the information provided in the external lists of forms which are either embedded in the project's corpus-based dictionary or maintained as separate forms files.

ORTH

We call the written sequence of graphemes which represents a word, including its bound morphemes and inflectional characteristics, an orthographic form, or sometimes simply 'form'. The words in a transliteration are such forms. Simply, forms are the words which are separated by spaces.

CF

The fundamental portion of the lemmatization is the provision of a conventional form of the word of which the form is an instance; we call this conventional form of the word the Citation Form, and abbreviate it as CF.

In lemmatizations containing square brackets, the CF is given before the opening square bracket (lemmatizations which do not contain square brackets are discussed below under 'POS').

Lemmatizations which consist of CFs are required to have square brackets even if the brackets are empty; unbracketed forms are taken as bare part-of-speech annotations.

In languages which use normalization, it is permissible to give the normalization of the current instance instead of the citation form:

1.     AN{+e}
#lem:  szam^e[heavens]

GW

A Guide Word (abbreviated as GW), is a specific label paired with the CF to identify lemmata uniquely. A GW may be given within the square brackets following a CF but this is optional and is generally not used in the case of proper nouns (see the example An[]DN above).

SENSE

A meaning, or sense, which a word has in the current context may be used instead of a GW between the brackets. The automated lemmatization starts by inserting the basic GW; this may be corrected by subsequent lemmatization processing, but normally annotators will correct this to specific senses as they work through the texts.

POS

Part-of-speech may be given immediately after the closing square bracket; for some languages the POS may be optional, as in Sumerian where the POS can normally be supplied from the dictionary.

A special case of lemmatization is the provision of a POS alone; this is presently used in CDL files as a place-holder pending more sophisticated approaches to prosopography and other realia. Thus, such lemmatizations as 'n' (number) and PN (personal name) are frequently seen in lemmatized files.

POS tags for proper nouns are not language-specific.

POS Tags for Proper Nouns
ANAgricultural (locus) Name
CNCelestial Name
DNDivine Name
ENEthnos Name
FNField Name
GNGeographical Name (lands and other geographical entities without their own tag)
LNLine Name (ancestral clan)
MNMonth Name
ONObject Name
PNPersonal Name
RNRoyal Name
SNSettlement Name
TNTemple Name
WNWatercourse Name
YNYear Name

EPOS

The lemmatizer actually uses a double POS: the default POS and the effective POS (EPOS).

Every instance of a lemma has a part-of-speech associated with it which is tied to the current syntactic context: this is the effective POS. Every lemma has a part-of-speech which is assumed to be the effective part-of-speech unless the lemmatizer is informed otherwise: this is the default POS. It is not necessarily the case that the default POS (or, since it is the unmarked case, simply POS) is the statistically most frequent EPOS (though this usually will be true). The nomination of a POS is primarily a matter of practical convenience.

In most lemmatization, even the existence of the EPOS can be ignored--this is why EPOS is not mentioned in the Lemmatization Quick Start. Some words, however, have more than one POS and it is necessary to annotate those cases in which the EPOS is not the POS. Another common case is that some classes of words may be used in certain contexts with a unusual EPOS--verbs may function as nouns, for example. Here, too, the EPOS must be annotated explicitly.

The EPOS is never inherited; every instance in which the EPOS is different from the POS must be annotated explicitly.

The EPOS is signified in ATF by use of the ASCII prime symbol, or right quote (') immediately before the POS. Annotation may give both POS and EPOS or EPOS alone:

1. sag9-ga-zu bi2-in-dug4
#lem: sag[good]V'N dug[speak]

or:

1. sag9-ga-zu bi2-in-dug4
#lem: sag[good]'N dug[speak]

(Other analyses of the construct above are possible; this is just an example).

BASE

For some languages part of the written instance gives a particular base form of the word, for example in Sumerian mu-un-du3 the base is du3. It is normally unnecessary to specify this unless the form being lemmatized uses a base which is new. When given as part of the lemmatization this must come after the closing square bracket and is introduced by a forward slash (/) character:

1.     mu-un-du6
#lem:  +du[build]/du6

CONT

For Sumerian the continuation is the grapheme which encodes the final consonant of the base as its initial consonant. It is normally unnecessary to specify this unless the form being lemmatized is new. When specified, it must come immediately after the base, separated from it by a plus sign (+):

1.     du-ga
#lem:  dug[speak]/du+ga

NORM0

Normalization is the version of the word in which the actual written form is replaced by an analytic form which is a linguistic interpretation of the word represented by the writing. Though the CDL system recognizes several kinds of normalization, which are referred to internally as norm0, norm1, etc., it is only norm0 which is given in the lemmatization; this is the full form with accents and length marks and is always created by the person annotating the form at its first instance (once it has been included in the dictionary the lemmatizer has "learned" it and it no longer needs to be given).

Normalized forms are given somewhere after the closing square bracket and are introduced by a dollar sign ($):

1.     A
#lem: +m^u[water]$m^e

MORPH

Morphology (MORPH) may be annotated either externally, as part of the forms definitions, or inline, as part of the lemmatization string.

A morphology string is a sequence of morphemes or abstract denotations of morphemes, separated by periods. A morphology string may reference the base-form of the word by using a tilde (~). Additional conventions and constraints on the nature of morphology strings are language-specific.

#lem: lines

Separator

The sequence '; ', i.e., semi-colon followed by space, is reserved as the separator between lemmatizations. There must be the same number of lemmatizations in the #lem: line as there are forms in the corresponding line of transliteration; the ATF processor signals an error when it detects mismatches of this kind. Special provision is made for preserving this 1:1 relationship when labelling broken forms or breakage on manuscripts as described below.

Ambiguity

Ambiguous forms may have multiple lemmatizations attached to them with the lemmatizations separated by vertical bars:

1. an-na
#lem: DN|an[sky]

The sequences either side of vertical bars are complete lemmatizations in their own right and may therefore have their own POS, morphology, disambiguation and any other characteristics.

Multiplexes

There are several circumstances in which a single orthorgraphic form ("word") actually writes more than one lemma: these include crasis and sandhi writings as well as logograms which are best treated as a single word (perhaps because of word order) but which correspond to more than one word in the target language (e.g., the writing {d}UTU.E3 for Akkadian s,=it szamszi).

In all these cases, the input is analogous to the ambiguous forms described above, but the & is used instead of the vertical bar. Thus, {d}UTU.E3 would be lemmatized as s,=it[exit]&szamszi[sun]. (Note, by the way, that compound phrases are always lemmatized according to their constituents).

Uncertainty

Uncertainty in lemmatization is indicated by the use of the conventional lemmatization X (uppercase 'X'). This should be used when the form is in principle open to lemmatization but no lemmatization can be suggested.

Breakage

Breakage in the manuscript is lemmatized with the conventional lemmatization u; such forms are considered unlemmatizable.

Numbers

Numbers are lemmatized with the conventional lemmatization n; a special-purpose processor is planned for higher order annotation and manipulation of numerical data.

N.B. In narrative context, numbers should be lemmatized as words; in administrative context, the n convention should be used.

Miscellanea

The conventional lemmatization M is used where the form is a standalone instance of a morpheme such as occur in certain Mesopotamian lexical lists.

The conventional lemmatization L is used where the form is in a language that is not currently handled by the lemmatization system.

Forms Files

Forms files give a mapping from forms and CF[GW] pairs to other information relevant to a form such as the morphological analysis. The format of forms files is documented [REF]here[REF].

Inlining

Morphology may be included directly on the lemmatization, following any POS. In such cases, the separator is a hash character (#) with no surrounding spaces, and the morphology string following directly afterwards: du[build]V#mu.na.ni:~. This is mainly needed for syllabic writings.

Disambiguation

Two common forms of annotation carried out manually are disambiguation and augmentation; the difference between them is that disambiguation is necessary when a form give part of a morpheme but that part could be analyzed more than one way. The three cases that are recognized in Sumerian are: Locative-Terminative vs. Ergative when the form ends in /e/; Locative vs. Genitive when a nominal form ends in /a/; and Nominalizer vs. Copula when a verbal form ends in /a/.

Augmentation is used when no part of the morpheme is preserved in the writing of the form; it is an easy way of adding unexpressed morphemes such as Sumerian /ak/ and other case-markers. Augmentation is discussed further below.

Ambiguous forms which are susceptible to multiple analyses even within the same CFGW can be disambiguated using the syntax \<DISAMBIGUATOR>. The particular disambiguators are language-specific; examples in Sumerian include:

\a = select locative form
\k = select genitive form

\l = select locative-terminative form (default)
\e = select ergative form

\a = nominalizer
\m = copula (am)

For examples see the next version of Gudea 1 below.

Disambiguation can also be given as part of the sense immediately before the closing square bracket in a lemmatization string; these disambiguations refer to choices available in the lexicon. For Sumerian a common lexical disambiguation is the choice between intransitive and transitive in labile verbs or so-called causatives:

/t = select transitive

We specify the unmarked case to be intransitive so that, e.g., gub[stand] needs no further annotation when intransitive; when transitive it should be annotated as gub[stand/t].

Augmentation

Augmentation consists of adding to morphological sequences. Augmentation is currently primitive and consists exclusively of the ability to append morphemes at the end of the morphology given in the forms file. This is probably only useful for the common case of adding unexpressed case-markers to Sumerian annotation; for more complex cases, the entire morphology string must be given inline as described under 'Inlining' above.

Augmentation is given after POS and the optional disambiguation, but before any morphology string; it is indicated using the plus sign (+).

Gudea 1 Redux

Given the Disambiguation and Augmentation conventions above, our sample text can now be annotated more completely as follows:

&Q000887 = Gudea 1
@composite
1. {d}ba-u2
#lem: DN

2. dumu an-na
#lem: dumu[child]; An[]DN/k

3. nin-a-ni
#lem: nin[lady]+.*ra

Units

Top-level unit (normally main sentence) boundaries can be annotated within the lemmatization by use of two conventions:

+. = insert unit boundary
-. = suppress unit boundary

The +. convention is relevant to all languages. It must occur either at the very beginning or the very end of the lemmatization string: if it precedes the lemmatization it must be followed by a space; it if follows the lemmatization it must be preceded by a space.

For some languages (e.g., Sumerian) most unit boundaries are correctly identified programmatically; where the program is wrong, the -. can be used to suppress a break. The -. convention is subject to the same rules for placement and whitespace as +..

6. mu-na-du3
#lem: du[build] +.

...

10. e2 mu-na-du3 lugal-e
#lem: e[house]; du[build] -.; lugal[king] +.

Dictionaries

A specific type of dictionary, the Corpus-Based Dictionary XML datatype, is used by CDL annotation to provide control lists of permitted CFs, GWs, Senses and POS information. Documentation of this format is in preparation.

This dictionary is the means of supplying POS information when it is not given explicitly (if given explicitly, the POS in the lemmatization overrides the one given in the dictionary).

The dictionary is also the means of canonicalizing lemmatizations of the form CF[SENSE] since such pairs can be looked up and the corresponding unique CF[GW] identified; this is relevant in the construction of forms files.

Syntax Trees

Syntax annotation is language-specific, but the principles and notational conventions supported by the ATF processor apply to all languages. In all cases the annotation may come before or after the lemma, but in practice most annotation (except for unit boundaries) comes before the lemma.

Language-specific documents are currently available for the following languages: Akkadian; Aramaic; Greek; Elamite; Sumerian.

+/-[SYMBOL]

The convention described above for unit boundaries is also available for syntax annotation; a plus or minus sign followed by a single non-alphabetic symbol indicates the presence/absence or forcing/suppression of a feature. The relationship between symbols and features is specified in the documentation for specific languages.

([+/-][TEXT])

Syntax annotation may be placed between a pair of parentheses. An optional plus or minus may occur immediately after the open parenthesis. The content of the parentheses is any text but may not contain parentheses.

Generic

In principle, generic syntax annotation can be provided before or after the lemmatization string. Syntax preceding the lemma must begin with an opening parenthesis and be separated from the lemma by a space; that following the lemma must end with a closing parenthesis and must also be separated from the lemma by a space. A syntax string may not contain spaces.

6. mu-na-du3
#lem: (VC-F du[build] ) +.

This feature is not yet supported and will only be supported when a demonstrated need for it is encountered.

Resources

linganno.xdf
XDF source for this documentation.

References

[AKK] Akkadian Language Information

[ARC] Aramaic Language Information

[ELX] Elamite Language Information

[GRC] Greek Language Information

[SUX] Sumerian Language Information


Questions about this document may be directed to Steve Tinney (stinney at sas dot upenn dot edu).