This document provides an overview of linguistic annotation facilities used in the Cuneiform Digital Library. We focus here on the data-entry view of linguistic annotation, describing the general facilities provided by the ATF document type which support lemmatization, morphological annotation, unit divisions and syntactic trees. The facilities described here are not language-specific; instructions for applying this approach to different languages are maintained in language-specific documents.
The ATF format is a precisely defined ASCII notation which can be translated correctly and without loss of information into an XML equivalent; ATF is fully documented and the following example is provided only as a simple orientation for the reader:
&Q000887 = Gudea 1
@composite
1. {d}ba-u2
2. dumu an-na
3. nin-a-ni
Readers to whom this much ATF is not familiar are referred to the ATF Primer.
Lemmatization is the simplest and most common annotation which consists of labelling written words, which may be inflected, with the base word (or dictionary headword) of which the written form is an instance. The lemmatized version of the example above might look something like this:
&Q000887 = Gudea 1
@composite
1. {d}ba-u2
#lem: DN
2. dumu an-na
#lem: dumu[child]; An[]DN
3. nin-a-ni
#lem: nin[lady]
This example illustrates several common aspects of lemmatization; though the example is in Sumerian, the principles apply equally to all languages.
STOP HERE. If you have not already done so, you should read the Lemmatization Quick Start document before reading the sometimes more detailed discussions below.
The word form is used to mean two things. The first meaning is the collection of analytic and linguistic information which belongs to a word. The word may be an instance in a corpus, in which case the form may have fields specific to the corpus context. Or the word may be an entry in a dictionary, in which case the form may have fields specific to the dictionary context (in a dictionary, each head word may have several forms which represent different combinations of spelling, part-of-speech etc.).
In the following sections, we describe the fields that make up a form; by convention we write the names of these fields in uppercase and using abbreviations which, once learned, make it easy to refer to the fields of a form. The first field discussed utilizes our second sense of form, namely the orthographic form: to reduce the ambiguity of these usages, we abbreviate the orthographic form using the name 'ORTH'.
One part of the lemmatizer's job is to match up corpus-instance forms and dictionary-entry forms. It does this by using the written instance of a word and/or the supplied lemmatizations, and it is actually possible to specify most form fields in the lemmatization line but that is almost never necessary. Instead, we support supplying hints to the linguistic processing layer which allow programs to make the correct choices between ambiguous forms or to supplement the information provided in the external lists of forms which are either embedded in the project's corpus-based dictionary or maintained as separate forms files.
We call the written sequence of graphemes which represents a word, including its bound morphemes and inflectional characteristics, an orthographic form, or sometimes simply 'form'. The words in a transliteration are such forms. Simply, forms are the words which are separated by spaces.
The fundamental portion of the lemmatization is the provision of a conventional form of the word of which the form is an instance; we call this conventional form of the word the Citation Form, and abbreviate it as CF.
In lemmatizations containing square brackets, the CF is given before the opening square bracket (lemmatizations which do not contain square brackets are discussed below under 'POS').
Lemmatizations which consist of CFs are required to have square brackets even if the brackets are empty; unbracketed forms are taken as bare part-of-speech annotations.
In languages which use normalization, it is permissible to give the normalization of the current instance instead of the citation form:
1. AN{+e}
#lem: szam^e[heavens]
A Guide Word (abbreviated as GW),
is a specific label paired with the CF to identify lemmata uniquely.
A GW may be given within the square brackets following a CF but this
is optional and is generally not used in the case of proper nouns (see
the example An[]DN above).
A meaning, or sense, which a word has in the current context may be used instead of a GW between the brackets. The automated lemmatization starts by inserting the basic GW; this may be corrected by subsequent lemmatization processing, but normally annotators will correct this to specific senses as they work through the texts.
Part-of-speech may be given immediately after the closing square bracket; for some languages the POS may be optional, as in Sumerian where the POS can normally be supplied from the dictionary.
A special case of lemmatization is the provision of a POS alone; this is presently used in CDL files as a place-holder pending more sophisticated approaches to prosopography and other realia. Thus, such lemmatizations as 'n' (number) and PN (personal name) are frequently seen in lemmatized files.
POS tags for proper nouns are not language-specific.
| POS Tags for Proper Nouns | |
| AN | Agricultural (locus) Name |
| CN | Celestial Name |
| DN | Divine Name |
| EN | Ethnos Name |
| FN | Field Name |
| GN | Geographical Name (lands and other geographical entities without their own tag) |
| LN | Line Name (ancestral clan) |
| MN | Month Name |
| ON | Object Name |
| PN | Personal Name |
| RN | Royal Name |
| SN | Settlement Name |
| TN | Temple Name |
| WN | Watercourse Name |
| YN | Year Name |
The lemmatizer actually uses a double POS: the default POS and the effective POS (EPOS).
Every instance of a lemma has a part-of-speech associated with it which is tied to the current syntactic context: this is the effective POS. Every lemma has a part-of-speech which is assumed to be the effective part-of-speech unless the lemmatizer is informed otherwise: this is the default POS. It is not necessarily the case that the default POS (or, since it is the unmarked case, simply POS) is the statistically most frequent EPOS (though this usually will be true). The nomination of a POS is primarily a matter of practical convenience.
In most lemmatization, even the existence of the EPOS can be ignored--this is why EPOS is not mentioned in the Lemmatization Quick Start. Some words, however, have more than one POS and it is necessary to annotate those cases in which the EPOS is not the POS. Another common case is that some classes of words may be used in certain contexts with a unusual EPOS--verbs may function as nouns, for example. Here, too, the EPOS must be annotated explicitly.
The EPOS is never inherited; every instance in which the EPOS is different from the POS must be annotated explicitly.
The EPOS is signified in ATF by use of the ASCII prime symbol, or
right quote (') immediately before the POS. Annotation
may give both POS and EPOS or EPOS alone:
1. sag9-ga-zu bi2-in-dug4 #lem: sag[good]V'N dug[speak]
or:
1. sag9-ga-zu bi2-in-dug4 #lem: sag[good]'N dug[speak]
(Other analyses of the construct above are possible; this is just an example).
For some languages part of the written instance gives a particular
base form of the word, for example in Sumerian mu-un-du3
the base is du3. It is normally unnecessary to specify
this unless the form being lemmatized uses a base which is new. When
given as part of the lemmatization this must come after the closing
square bracket and is introduced by a forward slash (/)
character:
1. mu-un-du6 #lem: +du[build]/du6
For Sumerian the continuation is the grapheme which encodes the
final consonant of the base as its initial consonant. It is normally
unnecessary to specify this unless the form being lemmatized is new.
When specified, it must come immediately after the base, separated
from it by a plus sign (+):
1. du-ga #lem: dug[speak]/du+ga
Normalization is the version of the word in which the actual
written form is replaced by an analytic form which is a linguistic
interpretation of the word represented by the writing. Though the CDL
system recognizes several kinds of normalization, which are referred
to internally as norm0, norm1, etc., it is
only norm0 which is given in the lemmatization; this is the full form
with accents and length marks and is always created by the person
annotating the form at its first instance (once it has been included
in the dictionary the lemmatizer has "learned" it and it no longer
needs to be given).
Normalized forms are given somewhere after the closing square
bracket and are introduced by a dollar sign ($):
1. A #lem: +m^u[water]$m^e
Morphology (MORPH) may be annotated either externally, as part of the forms definitions, or inline, as part of the lemmatization string.
A morphology string is a sequence of morphemes or abstract
denotations of morphemes, separated by periods. A morphology string
may reference the base-form of the word by using a tilde
(~). Additional conventions and constraints on the
nature of morphology strings are language-specific.
The sequence '; ', i.e., semi-colon followed by space,
is reserved as the separator between lemmatizations. There must be
the same number of lemmatizations in the #lem: line as
there are forms in the corresponding line of transliteration; the ATF
processor signals an error when it detects mismatches of this kind.
Special provision is made for preserving this 1:1 relationship when
labelling broken forms or breakage on manuscripts as described
below.
Ambiguous forms may have multiple lemmatizations attached to them with the lemmatizations separated by vertical bars:
1. an-na #lem: DN|an[sky]
The sequences either side of vertical bars are complete lemmatizations in their own right and may therefore have their own POS, morphology, disambiguation and any other characteristics.
There are several circumstances in which a single orthorgraphic
form ("word") actually writes more than one lemma: these include
crasis and sandhi writings as well as logograms which are best treated
as a single word (perhaps because of word order) but which correspond
to more than one word in the target language (e.g., the writing
{d}UTU.E3 for Akkadian s,=it szamszi).
In all these cases, the input is analogous to the ambiguous forms
described above, but the & is used instead of the
vertical bar. Thus, {d}UTU.E3 would be lemmatized as
s,=it[exit]&szamszi[sun]. (Note, by the way, that
compound phrases are always lemmatized according to their
constituents).
Uncertainty in lemmatization is indicated by the use of the
conventional lemmatization X (uppercase 'X'). This
should be used when the form is in principle open to lemmatization but
no lemmatization can be suggested.
Breakage in the manuscript is lemmatized with the conventional
lemmatization u; such forms are considered
unlemmatizable.
Numbers are lemmatized with the conventional lemmatization
n; a special-purpose processor is planned for higher
order annotation and manipulation of numerical data.
N.B. In narrative context, numbers should be
lemmatized as words; in administrative context, the n
convention should be used.
The conventional lemmatization M is used where the
form is a standalone instance of a morpheme such as occur in certain
Mesopotamian lexical lists.
The conventional lemmatization L is used where the
form is in a language that is not currently handled by the
lemmatization system.
Forms files give a mapping from forms and CF[GW] pairs to other information relevant to a form such as the morphological analysis. The format of forms files is documented [REF]here[REF].
Morphology may be included directly on the lemmatization, following
any POS. In such cases, the separator is a hash character
(#) with no surrounding spaces, and the morphology string
following directly afterwards: du[build]V#mu.na.ni:~. This is mainly needed
for syllabic writings.
Two common forms of annotation carried out manually are disambiguation and augmentation; the difference between them is that disambiguation is necessary when a form give part of a morpheme but that part could be analyzed more than one way. The three cases that are recognized in Sumerian are: Locative-Terminative vs. Ergative when the form ends in /e/; Locative vs. Genitive when a nominal form ends in /a/; and Nominalizer vs. Copula when a verbal form ends in /a/.
Augmentation is used when no part of the morpheme is preserved in the writing of the form; it is an easy way of adding unexpressed morphemes such as Sumerian /ak/ and other case-markers. Augmentation is discussed further below.
Ambiguous forms which are susceptible to multiple analyses even
within the same CFGW can be disambiguated using the syntax \<DISAMBIGUATOR>. The particular
disambiguators are language-specific; examples in Sumerian
include:
\a = select locative form \k = select genitive form \l = select locative-terminative form (default) \e = select ergative form \a = nominalizer \m = copula (am)
For examples see the next version of Gudea 1 below.
Disambiguation can also be given as part of the sense immediately before the closing square bracket in a lemmatization string; these disambiguations refer to choices available in the lexicon. For Sumerian a common lexical disambiguation is the choice between intransitive and transitive in labile verbs or so-called causatives:
/t = select transitive
We specify the unmarked case to be intransitive so that, e.g.,
gub[stand] needs no further annotation when intransitive;
when transitive it should be annotated as gub[stand/t].
Augmentation consists of adding to morphological sequences. Augmentation is currently primitive and consists exclusively of the ability to append morphemes at the end of the morphology given in the forms file. This is probably only useful for the common case of adding unexpressed case-markers to Sumerian annotation; for more complex cases, the entire morphology string must be given inline as described under 'Inlining' above.
Augmentation is given after POS and the optional disambiguation,
but before any morphology string; it is indicated using the plus sign
(+).
Given the Disambiguation and Augmentation conventions above, our sample text can now be annotated more completely as follows:
&Q000887 = Gudea 1
@composite
1. {d}ba-u2
#lem: DN
2. dumu an-na
#lem: dumu[child]; An[]DN/k
3. nin-a-ni
#lem: nin[lady]+.*ra
Top-level unit (normally main sentence) boundaries can be annotated within the lemmatization by use of two conventions:
+. = insert unit boundary -. = suppress unit boundary
The +. convention is relevant to all languages. It
must occur either at the very beginning or the very end of the
lemmatization string: if it precedes the lemmatization it must be
followed by a space; it if follows the lemmatization it must be
preceded by a space.
For some languages (e.g., Sumerian) most unit boundaries are
correctly identified programmatically; where the program is wrong, the
-. can be used to suppress a break. The -.
convention is subject to the same rules for placement and whitespace
as +..
6. mu-na-du3 #lem: du[build] +. ... 10. e2 mu-na-du3 lugal-e #lem: e[house]; du[build] -.; lugal[king] +.
A specific type of dictionary, the Corpus-Based Dictionary XML datatype, is used by CDL annotation to provide control lists of permitted CFs, GWs, Senses and POS information. Documentation of this format is in preparation.
This dictionary is the means of supplying POS information when it is not given explicitly (if given explicitly, the POS in the lemmatization overrides the one given in the dictionary).
The dictionary is also the means of canonicalizing lemmatizations
of the form CF[SENSE] since such pairs can be looked up
and the corresponding unique CF[GW] identified; this is
relevant in the construction of forms files.
Syntax annotation is language-specific, but the principles and notational conventions supported by the ATF processor apply to all languages. In all cases the annotation may come before or after the lemma, but in practice most annotation (except for unit boundaries) comes before the lemma.
Language-specific documents are currently available for the following languages: Akkadian; Aramaic; Greek; Elamite; Sumerian.
The convention described above for unit boundaries is also available for syntax annotation; a plus or minus sign followed by a single non-alphabetic symbol indicates the presence/absence or forcing/suppression of a feature. The relationship between symbols and features is specified in the documentation for specific languages.
Syntax annotation may be placed between a pair of parentheses. An optional plus or minus may occur immediately after the open parenthesis. The content of the parentheses is any text but may not contain parentheses.
In principle, generic syntax annotation can be provided before or after the lemmatization string. Syntax preceding the lemma must begin with an opening parenthesis and be separated from the lemma by a space; that following the lemma must end with a closing parenthesis and must also be separated from the lemma by a space. A syntax string may not contain spaces.
6. mu-na-du3 #lem: (VC-F du[build] ) +.
This feature is not yet supported and will only be supported when a demonstrated need for it is encountered.
[AKK] Akkadian Language Information
[ARC] Aramaic Language Information
[ELX] Elamite Language Information
[GRC] Greek Language Information
[SUX] Sumerian Language Information
Questions about this document may be directed to Steve Tinney (stinney at sas dot upenn dot edu).