Grapheme Description Language for embedding in higher-order document types such as text editions and signlists. A formal definition with RNC schema is given interwoven with the ATF conventions for representing each element in the schema.
The term "grapheme" as used in this document refers to a string of
letters, numbers, modifiers and operators used to specify a Sumero-Akkadian
cuneiform sign by name or to render one of the values of such a sign.
While sign names are often glyph-descriptive (e.g., KA×A
meaning sign A written inside sign KA), this
document does not provide a glyph description language. Rather, we
define a Grapheme Description Language.
GDL is not intended to be generated manually; rather, it is the XML result of processing ASCII Transliteration Format (ATF) with the ATF processor. This document includes implementor notes on ATF interwoven with the technical documentation. Unless you are an implementor, or are pathologically curious (or both), you don't need to read this document! Read the tutorial instead. If you are a developer who is new to GDL and ATF it is recommended that you first read the tutorial, and then this document.
An XSL script to convert from GDL back to ATF can be found in the resources section below. The script does not convert the character set from Unicode to ASCII.
In this section we provide a model for constraining the lexical representation of graphemic atoms. This aspect of grapheme description does not constrain the validity of values within a given signiary; that is handled elsewhere.
Atoms are tightly constrained sequences of characters separated into distinct lowercase and uppercase sets to permit finer-grained constraints.
GDL does not support any of the common ASCII approximations of the various non-ASCII characters used in cuneiform transliteration; GDL uses only the specific Unicode codepoints listed below for the representation of these characters. Details and images of the Unicode characters can be found at http://www.unicode.org/charts.
ATF is restricted to ASCII characters and we define simple equivalents for the characters used in cuneiform transliteration which are not in the ASCII character set. The following table gives the ASCII sequences and the Unicode codepoints to which the ATF processor translates them. Certain conventions are not used in CDLI-strict notation; this is indicated in another column.
| ATF | Character | Unicode | CDLI-Strict?1 |
|---|---|---|---|
1Characters not in the strict repertoire are not permitted in CDLI archival ATF. | |||
2Lowercase x is permitted
only in sign values; in sign names, only uppercase X is
permitted as a notation for subscript-x. In sign names, lowercase
x is an operator. | |||
| sz | shin | U+161 | yes |
| SZ | SHIN | U+160 | yes |
| s, | sadhe | U+1E63 | yes |
| S, | SADHE | U+1E62 | yes |
| t, | teth | U+1E6D | yes |
| T, | TETH | U+1E6C | yes |
| s' | sin | U+015B | yes |
| S' | SIN | U+015A | yes |
| ' | ALEPH | U+02BE | yes |
| 0-9 | digits | U+2080-U+2089 | yes |
| x2 | subscript x | U+208A | yes |
| X2 | subscript x | U+208A | yes |
| h, | heth | U+1E2B | no |
| H, | HETH | U+1E2A | no |
| j | eng | U+014B | no |
| J | ENG | U+014A | no |
Characters are combined into atom specifications by grouping them in classes which are used to place lexical constraints on the atoms.
a e i uA E I Ub d g h k l m n p q r s u w y zB D G H K L M N P Q R S U W Y ZU+2081-U+2089 (Unicode subscript 1 through 9)U+2080-U+2089 (Unicode subscript 0 through 9)This yields the following base character sets and definitions (dollar-variables are expanded by a preprocessor to generate the actual RNC schema):
$lV = [aeiu]
$lC = [\x{2BE}bdegh\x{1E2B}i\x{14B}klmnpqrs\x{161}\x{1E63}\x{15B}t\x{1E6D}uwyz]
$uV = [AEIU]
$uC = [\x{2BE}BDEGH\x{1E2A}I\x{14A}KLMNPQRS\x{160}\x{1E62}\x{15A}T\x{1E6C}UWYZ]
$Si = [\x{2081}\x{2082}\x{2083}\x{2084}\x{2085}\x{2086}\x{2087}\x{2088}\x{2089}]
$Sc = [\x{2080}\x{2081}\x{2082}\x{2083}\x{2084}\x{2085}\x{2086}\x{2087}\x{2088}\x{2089}]
$subscript = (${Si}${Sc}?|\x{208A})?
lV = xsd:string {
pattern = "${lV}${subscript}"
}
lVCv = xsd:string {
pattern = "(${lV}${lC})+${lV}?${subscript}"
}
lCVc = xsd:string {
pattern = "(${lC}${lV})+${lC}?${subscript}"
}
lVCCvc = xsd:string {
pattern = "(${lV}${lC}{1,2})+(${lV}${lC}?)${subscript}"
}
lCVCCvc = xsd:string {
pattern = "(suen|kuara|${lC}(${lV}${lC}{1,2})+(${lV}${lC}?))${subscript}"
}
uV = xsd:string {
pattern = "${uV}${subscript}"
}
uVCv = xsd:string {
pattern = "(${uV}${uC})+${uV}?${subscript}"
}
uCVc = xsd:string {
pattern = "(${uC}${uV})+${uC}?${subscript}"
}
uVCCvc = xsd:string {
pattern = "(${uV}${uC}{1,2})+(${uV}${uC}?)${subscript}"
}
uCVCCvc = xsd:string {
pattern = "${uC}(${uV}${uC}{1,2})+(${uV}${uC}?)${subscript}"
}
namespace g = "http://emegir.info/gdl"
grapheme = v | q | s | n | c | gloss | g | nongrapheme | punct | gsurro
form = attribute form { text }
sb = element g:b { s.model }
vb = element g:b { v.model }
punct = element g:p { p.model }
lang = attribute xml:lang { xsd:language }
gsurro =
element g:surro {
delim? , (s|c|n) , groupgroup
}
# Values
#v.model = "x" | lV | lVCv | lCVc | lVCCvc | lCVCCvc
v.model = text
v = element g:v { form? , g.meta , lang? , (v.model | (vb , mods+)) }
dingir = element g:v { g.meta , lang? , ("d") }
mister = element g:v { g.meta , lang? , ("m") }
# Names
#s.model = "N" | "X" | uV | uVCv | uCVc | uVCCvc | uCVCCvc | lst | num
s.model = text
lst = xsd:string {
pattern="(..?SL|ABZ|BAU|HZL|KWU|LAK|M|MEA|MZL|REC|RSP|ZATU)\d+[a-z]*"
}
#[ABCD] is a stop-gap until lateuruk numbers are fixed
num = xsd:string { pattern = "N|N\d+[ABCD]?" }
s = element g:s { form? , g.meta , (s.model | (sb , mods+)) }
# Qualified graphemes
q = element g:q { form? , g.meta , (v|s|c) , (s|c|n) }
# Numbers
n.model = r , (v|s|c|q)?
r = element g:r {
xsd:string {
pattern = "[nN]\+[0-9]+|[nN]|[0-9]+|[n1-9]+/[n1-9]" } }
n = element g:n { form? , g.meta , n.model , mods* }
# Modifiers
mods = modifier | allograph | formvar
modifier = element g:m { xsd:string { pattern = "[a-z]|[0-9]{1,3}" } }
allograph= element g:a { xsd:string { pattern = "[a-wyz0-9]+" } }
formvar = element g:f { xsd:string { pattern = "[a-z0-9]+" } }
# Compounds
c.model = (compound , (o.join , compound)+) | unary | binary | ternary | (g , mods+)
c = element g:c { form? , g.meta , c.model , mods* }
g = element g:g { g.meta , c.model , mods* }
compound = single | unary | binary
single = n | s | c | (g,mods*) | q
unary = o.unary , single
binary = single , o.binary , single
ternary = single , o.binary , single , o.binary , single
o.join = element g:o { attribute g:type { "beside" | "joining" | "reordered" } }
o.unary = element g:o { attribute g:type { "repeated" } , xsd:integer }
o.binary =
element g:o {
attribute g:type {
"containing" | "above" | "crossing" | "opposing"
}
}
# Punctuation
p.model =
attribute g:type { "*"|":"|":'"|':"'|":."|"::"|"|"|"/" } ,
g.meta ,
(v|q|s|n|c)?
As a design principle, all of the most common GDL elements have single character names. In order to minimize possible confusion with similar names in other vocabularies, it is recommended that GDL elements always be namespace-qualified. To reinforce this point, the definition of the GDL schema does not use a default namespace.
The examples in this document all assume that the prefix
g is bound to the namespace of the GDL schema.
namespace g = "http://emegir.info/gdl"
grapheme = v | q | s | n | c | gloss | g | nongrapheme | punct | gsurro
form = attribute form { text }
sb = element g:b { s.model }
vb = element g:b { v.model }
punct = element g:p { p.model }
lang = attribute xml:lang { xsd:language }
gsurro =
element g:surro {
delim? , (s|c|n) , groupgroup
}
We call the core alphanumeric portion of a sign an atom. This is a single grapheme component which for the purposes of this grapheme description instance is not susceptible to further sub-description.
All sign values are by definition atoms.
Sign names consist of one or more atoms. In the grapheme
A there is a single atom; in the grapheme
KA×A there are two atoms, KA and
A. In another context, that same grapheme might be named
as NAG; this version of the name contains a single atom,
despite the fact that a sign list might describe the sign as
KA×A. In other words, atomicity in grapheme names is
determined by the naming scheme rather than the underlying
construction of the glyph.
Two simple elements are defined for atoms: g:v, for sign
values, and g:s for sign names.
# Values
#v.model = "x" | lV | lVCv | lCVc | lVCCvc | lCVCCvc
v.model = text
v = element g:v { form? , g.meta , lang? , (v.model | (vb , mods+)) }
dingir = element g:v { g.meta , lang? , ("d") }
mister = element g:v { g.meta , lang? , ("m") }
In ATF a sign value is a sequence of lowercase letters (and possibly some non-letters used to indicate non-ASCII characters) followed by optional ASCII digits:
a a2 babbar dug4 s,e2 sza13
# Names
#s.model = "N" | "X" | uV | uVCv | uCVc | uVCCvc | uCVCCvc | lst | num
s.model = text
lst = xsd:string {
pattern="(..?SL|ABZ|BAU|HZL|KWU|LAK|M|MEA|MZL|REC|RSP|ZATU)\d+[a-z]*"
}
#[ABCD] is a stop-gap until lateuruk numbers are fixed
num = xsd:string { pattern = "N|N\d+[ABCD]?" }
s = element g:s { form? , g.meta , (s.model | (sb , mods+)) }
In ATF a simple sign name is a sequence of uppercase letters (and possibly some non-letters used to indicate non-ASCII characters) followed by optional ASCII digits:
A BA SZA3 GILIM
A sign name in a transliteration conventionally means either that the sign is clear but its reading is uncertain, or that the sign is being used as a logogram. ATF has some simple rules to mark the difference between these two:
$) before a sign
name to indicate that its reading is uncertain.~) before a sign name to
indicate that it is a logogram.As a result of these rules, $AN always means "the
AN-sign is there but I am not sure which reading to choose" and
~AN always means "the AN sign is a logogram here". The
meaning of AN can be configured to mean either one. By
default, in Sumerian language context the meaning of AN is
equivalent to $AN. In all other language contexts, the
meaning of AN is equivalent to ~AN. This
means that typing logograms in Akkadian is as easy as:
sza AN-e
Two special classes of sign name are signlists and numerical
sign names. Numerical sign names match the pattern
N<DIGITS>. Signlist names consist of an uppercase
alphabetic prefix and an ASCII digit suffix; the prefix is the name of
the sign list and the suffix is the number of the sign in that list.
Prefixes fall into one of two groups. Generic signlist prefixes
consist of any one or two uppercase letters followed by
SL; hence, CDSL, PSL,
PCSL are all valid signlist prefixes. The second group
is the built-in set of historic sign lists.
| Name | Bibliography |
|---|---|
| ABZ | R. Borger, Assyrisch-babylonische Zeichenliste (AOAT 33; Neukirchen-Vluyn 1978) |
| BAU | E. Burrows, Archaic Texts (UET 2; London 1935) |
| HZL | C. Ruster and E. Neu, Hethitisches Zeichenlexikon (Harrassowitz Verlag 1989) |
| KWU | N. Schneider, Die Keilschriftzeichen der Wirtschaftsurkunden von Ur III (Rome 1935) |
| LAK | A. Deimel, Liste der archaischen Keilschriftzeichen (WVDOG 40; Berlin 1922) |
| MEA | R. Labat, Manuel d'épigraphie akkadienne (6th ed. Paris 1988) |
| MZL | R. Borger, Mesopotamisches Zeichenlexikon (AOAT 305; Ugarit-Verlag 2003) |
| REC | F. Thureau-Dangin, Recherches sur l'origine de l'écriture cunéiforme (Paris 1898) |
| RSP | Y. Rosengarten, Répertoire commenté des signes présargoniques sumériens de Lagash (Paris 1967) |
| ZATU | M. Green and H. J. Nissen, Zeichenliste der Archaischen Texte aus Uruk (ATU 2; Berlin 1987) |
Qualifed graphemes consist of a sign value followed by a sign name
in parentheses, e.g., pu(BU). (In normalized text the
superficially similar construct is used to indicate the logograms used
for the normalized form, e.g., %akk/n b=elu(EN).)
# Qualified graphemes
q = element g:q { form? , g.meta , (v|s|c) , (s|c|n) }
Signs which have the special subscript x
must be qualified in ATF by placing the sign name in parentheses
immediately after the sign value:
bax(PI) ZAX(LAK384)
Note: in sign values, use lowercase
x; in sign names, use uppercase X.
Numerical graphemes have a special form. Each numerical grapheme consists of at least two parts: the repetition count and the sign value, sign name or compound sign. A special case is made for numerical graphemes by allowing them to have modifiers even if the graphemic base is a sign value.
The repetition count must have one of the following forms:
digitsnn+digitsWhile it would in principle be possible to constrain the value
space of GRAPHEME in the schema we do not do so; instead,
as with non-numerical graphemes, we constrain the lexical space and
require the values of numerical graphemes to be validated elsewhere.
This allows the schema to be open-ended with respect to the
identification of new numerical systems.
# Numbers
n.model = r , (v|s|c|q)?
r = element g:r {
xsd:string {
pattern = "[nN]\+[0-9]+|[nN]|[0-9]+|[n1-9]+/[n1-9]" } }
n = element g:n { form? , g.meta , n.model , mods* }
In ATF a number sign conforms to the pattern:
REPETITION '(' GRAPHEME ')'
where REPETITION is either a number giving the
repetition factor or the letter n or the combination
n+DIGITS (in sign names or compound signs use
N instead of n). The GRAPHEME is a sign value or
sign name, including compound signs.
The following examples illustrate a few basic ATF numerical forms:
1(N01) 4(ban2) 1(asz@c) n(gesz2) n+1(asz)
The notation n(asz) means: some quantity in the
asz system which is not determinable from the traces on
the tablet. The notation n+1(asz) (where '1' could be any
number) means: a quantity in the asz system which is
damaged or lost and which is at least 1 but may be more. ATF does not
use the notation x+1(asz).
Sign names and numerical sign value atoms may be described by
reference to modifications of the base sign, as summarized in the
table below. The lexical representation of modifiers is restricted to
either a single lower case letter or a sequence of one, two or three ASCII
digits. The semantics of these modifiers is indicated in the table,
but is irrelevant from the point of view of the schema. A single GDL
element, g:m, contains the modifier.
Modifiers may not follow a compound sign's terminating pipe character; if an entire compound is to be modified, the compound's content must be grouped and the modifiers suffixed between the closing parenthesis and the closing pipe.
# Modifiers
mods = modifier | allograph | formvar
modifier = element g:m { xsd:string { pattern = "[a-z]|[0-9]{1,3}" } }
In ATF the at-sign (@) precedes each modifier;
multiple modifiers may be given in which case each modifier requires
its own at-sign. The entire sequence of modifiers (and allographs,
described below) belongs to the immediately preceding sign or group.
Sign names and values with modifiers and/or allographs following them
should not be be treated as compounds.
| Modifier | ATF | Example | Sign |
|---|---|---|---|
| curved | @c | ASZ@c |
|
| flat | @f | 1(N01@f) |
|
| gunu (4 extra wedges) | @g | DU@g |
|
| sheshig (added še-sign) | @s | DU@s |
|
| tenu (slanting) | @t | GAN2@t |
|
| nutillu (unfinished) | @n | SAG@n |
|
| zidatenu (slanting right) | @z | ASZ@z |
|
| kabatenu (slanting left) | @k | ASZ@k |
|
| vertically reflected | @r | U@r |
|
| horizontally reflected | @h | N07~a@h |
|
| rotated | @<DIGITS> | NAGA@180 | |
| variant | @v | 4(ban2)@v |
Modifiers on numerical graphemes may go inside or outside the closing parenthesis depending on the naming schema for values and sign names used by the style manual or sign list for an individual project.
It is sometimes desirable to distinguish between grapheme instances
which have otherwise been considered the same sign, or which actually
are the same sign, for semantic or glyph-analytic reasons. This is
expressed in GDL by the g:a element whose content is a
sequence of one or more lowercase letters, excluding
x, and ASCII digits. Sign list creators are free to assign
whatever meanings they like to any combinations of these characters;
in PCSL, for example, sequences such as a1a versus
a1b and a2a versus a2b are used
to implement multi-level distinctions between variants of a sign. An
allograph may follow the closing parenthesis of a group within a
compound sign, but may not follow the final vertical bar of the
compound.
The reason for the exclusion of x in the allowable set
of lowercase letters in an allograph is that allowing it would
introduce an ambiguity at the ATF level between x in
allographs and x as a compound operator.
allograph= element g:a { xsd:string { pattern = "[a-wyz0-9]+" } }
In ATF an allograph, or systemic sign variant, is introduced by the
tilde-prefix (~); the sequence of characters following
the tilde is restricted to ASCII digits and lowercase letters,
except for x.
|EN~a| |EN~b| |GA2~a1| |GA2~a2| |GESZTU~axSZE~a@t|
The use of x as an operator in examples like the last
one in the line above is the reason for excluding x from
the characters allowed in allograph sequences.
The special allograph ~v is used instead of
~x to indicate that the form is some variant of the sign
in question but the specific variant is not identified.
The special allograph ~t is used to indicate tokens,
e.g., 1(N01~t).
Note that the allograph mechanism is not the way that unusual sign
forms are notated in ATF; for this the normal exclamation mark
(!) is used. The allograph mechanism is provided to
support systematic subdivision of sign-forms relative to extant sign
lists or sign name descriptions.
formvar = element g:f { xsd:string { pattern = "[a-z0-9]+" } }
Form variants is the GDL name for minor differences in the construction of signs which may be of interest in analysis of a corpus for handwritings, but which are not important enough to be displayed or included in the version of the writing used for linguistic analysis.
Form variants are preceded by the backslash character
(\) and consist of lowercase letters and or digits.
Compound graphemes are combinations of sign names and operators; the definition is recursive meaning that compound grapheme atoms may be grouped and the group treated as a compound in its own right. Atoms and compounds may both have associated modifier and/or allograph qualifications. We call a single combination of a sign or compound sign and its qualifiers a constituent.
The possible operator types are:
X.The beside and joining operators are in
fact joiners which mark boundaries; any number of joiner/compound
pairs may be siblings.
The inside, above, crossing and
opposing operators all have binary scope: a compound
which contains an operator is constrained to having exactly two
compound children, one before and one after the operator.
The repeated operator is a unary prefix with the
content of the operator giving the repetition count. Compounds
containing this operator may have only one compound child.
The repeated operator is a unary postfix with the
content of the operator giving the number of degrees the sign is
rotated in a clockwise direction. Compounds containing this operator
may have only one compound child.
# Compounds
c.model = (compound , (o.join , compound)+) | unary | binary | ternary | (g , mods+)
c = element g:c { form? , g.meta , c.model , mods* }
g = element g:g { g.meta , c.model , mods* }
compound = single | unary | binary
single = n | s | c | (g,mods*) | q
unary = o.unary , single
binary = single , o.binary , single
ternary = single , o.binary , single , o.binary , single
o.join = element g:o { attribute g:type { "beside" | "joining" | "reordered" } }
o.unary = element g:o { attribute g:type { "repeated" } , xsd:integer }
o.binary =
element g:o {
attribute g:type {
"containing" | "above" | "crossing" | "opposing"
}
}
The difference between a simple sign and a compound sign is that a
a compound sign is a sequence of sign names which contains at least
one operator, i.e., a character which represents a relationship
between multiple graphemes. In ATF the set of characters used for
operators is: x % @ & . : +.
In ATF compound graphemes are enclosed at the outer level in
vertical bars ("pipes", |...|):
|KAxA|
Signs are frequently modified or operated on as a group; parentheses are used to group multi-part constituents:
|GA2x(ME.EN)| |(GI&GI)xSZE3|
Note that modifiers and allographs must not be placed after the closing pipe; instead, they must be put inside the pipe adding grouping characters if necessary:
|GA2~axEN| |GA2xEN~a| |(GA2xEN)~a|
Th examples above all mean different things. The first,
|GA2~axEN|, means: "the a-allograph of the sign GA2
containing sign EN". The second, |GA2xEN~a|, means: "GA2
containing the a-allograph of sign EN". The third,
|(GA2xEN)~a|, means: "the a-allograph of the group
consisting of sign GA2 containing sign EN. In example three the bad
form *|GA2xEN|~a would result in a parse error.
Each of the compound operations has its own ATF notation as summarized in the table below:
| GDL | ATF | Example | Sign |
|---|---|---|---|
| beside | . | |DU.DU| | |
| joining | + | |LAGAB+LAGAB| | |
| containing | x | |GA2xAN| | |
| containing/group | x | |GA2x(ME.EN)| | |
| above | & | |DU&DU| | |
| crossing | % | |GI%GI| | |
| opposing | @ | |LU2@LU2| | |
| repeated | 3x | |3xAN| | |
| repeated | 4x | |4xLU2| | ![]() |
Several types of cuneiform punctuation are supported in ATF and all
of them must be preceded and followed by a space (in the case of
* and / the punctuation may be immediately
followed by a sign name in parentheses and then the following space).
The recognized punctuation codes are:
.The vertical "colon" sign often found in commentaries.
N.B.: If the single colon occurs within a word it
must be transliterated with the grapheme name form P₂



/(P2).The punctuation signs may also be transliterated using the following names:
P1 (
); P2 (
); P3 (
); P4 (
); MZL592~b (as :').
# Punctuation
p.model =
attribute g:type { "*"|":"|":'"|':"'|":."|"::"|"|"|"/" } ,
g.meta ,
(v|q|s|n|c)?
namespace g = "http://emegir.info/gdl"
g.meta =
break? , status.flags? , status.spans? ,
paleography.attr? , linguistic.attr? , proximity.attr? ,
opener? , closer? , hsqb_o?, hsqb_c? , emhyph? ,
varnum? , utf8? , delim? , notemark? ,
attribute xml:id { xsd:ID }? ,
breakStart? , breakEnd? ,
damageStart? , damageEnd? ,
surroStart? , surroEnd? ,
statusStart? , statusEnd? ,
accented?
accented = attribute g:accented { text }
breakStart = attribute g:breakStart { "1" }
breakEnd = attribute g:breakEnd { xsd:IDREF }
damageStart = attribute g:damageStart { "1" }
damageEnd = attribute g:damageEnd { xsd:IDREF }
surroStart = attribute g:surroStart { "1" }
surroEnd = attribute g:surroEnd { xsd:IDREF }
statusStart = attribute g:statusStart { "1" }
statusEnd = attribute g:statusEnd { xsd:IDREF }
notemark = attribute notemark { text }
break = attribute g:break { "damaged" | "missing" }
opener = attribute g:o { text }
closer = attribute g:c { text }
hsqb_o = attribute g:ho { "1" }
hsqb_c = attribute g:hc { "1" }
emhyph = attribute g:em { "1" }
utf8 = attribute g:utf8 { text }
delim = attribute g:delim { text }
varnum = (
attribute g:varo { text }? ,
attribute g:vari { text }? ,
attribute g:varc { text }?
)
status.flags =
attribute g:collated { xsd:boolean } ? ,
attribute g:queried { xsd:boolean } ? ,
attribute g:remarked { xsd:boolean } ?
gloss = det | glo
pos = attribute g:pos { "pre" | "post" | "free" }
det = element g:d { pos , dtyp , delim? , emhyph? , (dingir | mister | word.content*)}
dtyp= attribute g:role { "phonetic" | "semantic" }
glo = element g:gloss { attribute g:type { "lang" | "text" } , delim? , pos , words }
status.spans =
attribute g:status {
"ok" | "erased" | "excised" | "implied" | "maybe" | "supplied"
}
paleography.attr =
attribute g:script { xsd:NCName }
linguistic.attr =
attribute xml:lang { xsd:language } ? ,
# attribute g:rws { "emegir" | "emesal" | "udgalnun" }? ,
(attribute g:role { "sign" | "ideo" | "num" | "syll" }
| (attribute g:role { "logo" } ,
attribute g:logolang { xsd:language }))
proximity.attr =
attribute g:prox { xsd:integer }
nongrapheme =
element g:x {
( attribute g:type { "newline" | "user" }
| ( attribute g:type { "ellipsis" } , status.spans ,
opener? , closer? , break? , notemark?)),
delim? , text? , varnum? ,
attribute xml:id { xsd:ID }? ,
breakStart? , breakEnd? ,
damageStart? , damageEnd? ,
surroStart? , surroEnd? ,
statusStart? , statusEnd?
}
This module defines attributes which are essentially graphemic metadata supplied by the editor of the text. They fall into several groups: properties of the grapheme imputed to derive from the scribe; properties assigned by the editor; physical preservation properties; paleographic properties; and linguistic properties. We describe these principally in the form of the tutorial aimed at end-users and allow the sequence of definitions in the schema to follow the tutorial.
namespace g = "http://emegir.info/gdl"
g.meta =
break? , status.flags? , status.spans? ,
paleography.attr? , linguistic.attr? , proximity.attr? ,
opener? , closer? , hsqb_o?, hsqb_c? , emhyph? ,
varnum? , utf8? , delim? , notemark? ,
attribute xml:id { xsd:ID }? ,
breakStart? , breakEnd? ,
damageStart? , damageEnd? ,
surroStart? , surroEnd? ,
statusStart? , statusEnd? ,
accented?
accented = attribute g:accented { text }
breakStart = attribute g:breakStart { "1" }
breakEnd = attribute g:breakEnd { xsd:IDREF }
damageStart = attribute g:damageStart { "1" }
damageEnd = attribute g:damageEnd { xsd:IDREF }
surroStart = attribute g:surroStart { "1" }
surroEnd = attribute g:surroEnd { xsd:IDREF }
statusStart = attribute g:statusStart { "1" }
statusEnd = attribute g:statusEnd { xsd:IDREF }
notemark = attribute notemark { text }
In ATF there are several general ways of specifying information about graphemes:
! ? * #. Flags may appear only immediately after a
grapheme (after the parenthetic part of a qualified grapheme), and are
not permitted within compound signs. Flags may be given after the
closing pipe of a compound sign.(...)
[...] {...} {(...)} <...> <<...>> <(...)>.
Brackets are not permitted within the body of graphemes or within
compound signs.%) and a
label which alter the current value of a property. The value remains
current until another shifter for the same property is encountered; or
until the closing of the nearest enclosing bracket; or until the end
of the line. Sample shifts include: %s %akk %1. A table
of all of the shifts, properties and values is given later on in this
tutorial. Shifts are not permitted within compound signs. Shifts
must always be followed by one or more spaces.Partially broken signs are flagged by putting a hash
(#) after the grapheme. Signs which are completely
missing from the tablet are enclosed in square brackets.
ba# [a]-ba mudx(|ZI&ZI.A|)#
break = attribute g:break { "damaged" | "missing" }
opener = attribute g:o { text }
closer = attribute g:c { text }
hsqb_o = attribute g:ho { "1" }
hsqb_c = attribute g:hc { "1" }
emhyph = attribute g:em { "1" }
utf8 = attribute g:utf8 { text }
delim = attribute g:delim { text }
varnum = (
attribute g:varo { text }? ,
attribute g:vari { text }? ,
attribute g:varc { text }?
)
Collation, uncertainty and remarkability are flagged by
*, ? and ! respectively. If a
grapheme is flagged as remarkable it may indicate a correction or an
unusual form. Corrections are often followed by the actual sign in
parenthesis, and this convention is supported but not required in ATF
transliterations.
a* ki? szum! ki!?*(DI)
status.flags =
attribute g:collated { xsd:boolean } ? ,
attribute g:queried { xsd:boolean } ? ,
attribute g:remarked { xsd:boolean } ?
ATF divides glosses into three types:
{...}; semantic determinatives require no special
marking, but phonetic glosses and determinatives should be indicated by adding a
plus sign (+) immediately after the opening brace, e.g.,
AN{+e}. Multiple separate
determinatives must be enclosed in their own brackets, but a single
determinative may consist of more than one sign (as is the case with
Early Dynastic pronunciation glosses).{{...}}.{(...)}.Glosses must have a space or hyphen on one side or the other. They may have spaces on both sides. Glosses may not touch directly both the preceding and following graphemes; nor may they have hyphens at both ends.
{d}utu larsa{ki} {+u3-mu2}u2-mu11 AN{+e}
du3-am3{{mu-un-<(du3)>}}
{(1(u))} {(%a he-pi2 esz-szu)}
The ATF processor sets type=text when
the gloss is enclosed in {(...)} and
type=lang when the gloss is enclosed in
{...}.
The ATF processor sets pos=pre when
the gloss has no space or boundary following it; pos=post
when the gloss has no space or boundary preceding it; and
pos=free when the gloss has spaces on both sides.
gloss = det | glo
pos = attribute g:pos { "pre" | "post" | "free" }
det = element g:d { pos , dtyp , delim? , emhyph? , (dingir | mister | word.content*)}
dtyp= attribute g:role { "phonetic" | "semantic" }
glo = element g:gloss { attribute g:type { "lang" | "text" } , delim? , pos , words }
The status of one or more graphemes' presence/absence often requires notation. The following bracketings support the common practices in Assyriological transliteration:
[x (x) x]
mu-un-<pa3>-da
mu-un-<<an>>-pa3-da
1) [AFTER BOUNDARY] The graphemes
are implied because the scribe has left a blank space on the tablet;
common in liturgies and some types of administrative texts.
BOUNDARY can be space, hyphen, etc.
2) [AFTER GRAPHEME] The graphemes
are the text meant by a surrogate grapheme such as MIN;
common in lexical texts.
1. {d}suen he2-me-en
2. {d}nanna <(he2-me-en)>
1. a = %a mu-u2
2. illu = %a MIN<(mu-u2)>
Surrogates are defined in the XTF2 schema because
their content model is l.inner.
Note: in all of these cases except the last there must be a space or hyphen before the opening bracket and after the closing bracket.
status.spans =
attribute g:status {
"ok" | "erased" | "excised" | "implied" | "maybe" | "supplied"
}
Programming note: Graphemic
elements which can carry graphemic content (i.e., g:v,
g:s, g:c, g:p,
g:q, g:n, and g:x where the
type is ellipsis) always have a g:status attribute. This
can be used to navigate to the preceding/following grapheme which can
have bracketing to determine when to open/close bracketing. Graphemes
which have no explicit presence-status have
g:status="ok".
A simple mechanism for identifying distinct scripts on a single
document is provided by the percent-digit commands which consist of a
percent sign followed by a single digit: %0 %1 %2 %3 %4 %5 %6 %7
%8 %9. The characteristics of the scripts identified by
numbers can be specified in the protocols section at the start of an
ATF file.
By default, the normal sized, normal form script is
%0; this only needs to be specified rarely. By default,
the smaller script used for glosses is %1. Two other
default scripts are %2 = Assyrian and %3 =
Babylonian as a contrastive pair in neo-Assyrian documents.
This example shows how to enter a gloss which is in smaller script:
mu-un-szum2{%1 szu}
paleography.attr =
attribute g:script { xsd:NCName }
Language attributes specify the language which the grapheme is being used to write (normally specified at a higher level, word, line or text); the role of the grapheme and in the case of logograms the source language from which the logogram derives.
Languages can be shifted in several ways. The simplest way is to use explicit shifters:
an = %a sza-me-e
A shift done in this way is terminated in one of three ways:
A second convention which is useful in
mixed-language contexts is to use matched pairs of underscore
characters (_) to shift in and out of the secondary
language (the ATF processor's definitions of primary/secondary
language relationships are usually enough, but you can configure them
with ATF protocols if necessary; when the current language is
Akkadian, the secondary is Sumerian and vice versa):
%a im-me-ra-am _szu ba-an-ti_
Normalized text, e.g., Akkadian transcription, can be indicated by
use of the %n shift, with an analogous %g
indicating graphemic text (though this is in practice rarely
necessary). If %n or %g follow a language
shift they can be separated using /: %akk/n.
Single-letter shorthands are defined for the common requirements of Sumero-Akkadian cuneiform, but by definition all two-letter and three letter shifters may be language codes. In order to implement effective validation, it is necessary to register in the protocols section the use of language codes which are not in the table below. Several lists of the standardized language codes exist on the web but GDL follows the ISO-639-3 registry at SIL.
We utilize codes in the range qaa to qtz to handle languages that are either not covered by ISO 639 or are not sufficiently determined to assign to one of the known languages.
| ATF | ISO 639 | Language |
|---|---|---|
| %a | akk | Akkadian |
| %e | sux-es | Emesal Sumerian |
| %g | Graphemic text | |
| %h | hit | Hittite |
| %n | Normalized text | |
| %s | sux | Emegir Sumerian |
| %u | sux-ugn | UD.GAL.NUN Sumerian |
| %x | qcu | Undetermined language in cuneiform |
| Code | Language |
|---|---|
| %akk | Akkadian |
| %arc | Aramaic |
| %elx | Elamite |
| %grc | Ancient Greek |
| %hit | Hittite |
| %peo | Old Persian |
| %qam | Amorite |
| %xcr | Carian |
| %qcu | Undetermined cuneiform |
| %qeb | Eblaite |
| %xhu | Hurrian |
| %xlc | Lycian |
| %xld | Lydian |
| %xlu | Cuneiform Luvian |
| %hlu | Hieroglyphic Luvian |
| %imy | Milyan (Lycian B) |
| %qpc | Proto-Cuneiform |
| %qpe | Proto-Elamite |
| %plq | Palaic |
| %xur | Urartian |
| %sux | Sumerian |
| %uga | Ugaritic |
The ATF processor maps single letter shifters to the explicit
lang/rws values which are expected in the schema. The
logolang attribute is set from the ATF processor's
secondary language when processing logograms.
The role of a grapheme may be annotated on the grapheme element,
but there is no ATF syntax for specifying it: the ideo,
num or syll values of the role
attribute should be determined by linguistic services processors and
added directly to the XTF version of the text.
The surface syntax for logograms is described under Sign Names above.
A normalization may be given after a word containing at least one
logogram by following the word immediately with (=...),
e.g., SAL(=mimma).
linguistic.attr =
attribute xml:lang { xsd:language } ? ,
# attribute g:rws { "emegir" | "emesal" | "udgalnun" }? ,
(attribute g:role { "sign" | "ideo" | "num" | "syll" }
| (attribute g:role { "logo" } ,
attribute g:logolang { xsd:language }))
A general facility for annotating graphemic proximity is provided
with the notation $<zone>, where zone is
an arbitrary region of the surface defined only by the transliterator
and represented by a single digit. Search engines may provide
facilities to find multiple graphemes with the same zone code (in the
same line) and possibly to relate grapheme proximity to the difference
between zone codes (i.e., in a$1 e$2 i$3, the
i grapheme may be considered closer to e
than to a (the ordering relationships of zone codes are
likely to be problematic, however). This is an experimental feature
intended for use in exploring the graphotactics of Early Dynastic
texts.
proximity.attr =
attribute g:prox { xsd:integer }
To indicate that there is any kind of newline within a
transliterated sequence of words or graphemes, use the semi-colon
(;).
To indicate that an unknown number of signs is missing, use an
ellipsis (...).
If it is necessary to indicate any other information which is not
part of the grapheme sequence the compound brackets
(#...#) may be used. This feature should be used rarely
if ever.
mu-un-;e3 [...] [(...)]
nongrapheme =
element g:x {
( attribute g:type { "newline" | "user" }
| ( attribute g:type { "ellipsis" } , status.spans ,
opener? , closer? , break? , notemark?)),
delim? , text? , varnum? ,
attribute xml:id { xsd:ID }? ,
breakStart? , breakEnd? ,
damageStart? , damageEnd? ,
surroStart? , surroEnd? ,
statusStart? , statusEnd?
}
namespace g = "http://emegir.info/gdl"
namespace n = "http://emegir.info/norm"
namespace syn = "http://emegir.info/syntax"
word.content = text | group | grapheme | nongrapheme
words = (word | sword.head | sword.cont | nonword | nongrapheme)*
word =
element g:w {
word.attributes,
word.content*
}
sword.head =
element g:w {
attribute headform { text },
attribute contrefs { xsd:IDREFS },
word.attributes,
word.content*
}
sword.cont =
element g:swc {
attribute xml:id { xsd:ID } ,
attribute xml:lang { xsd:language } ,
attribute form { text }? ,
attribute headref { xsd:IDREF },
attribute swc-final { "1" | "0" },
delim? ,
word.content*
}
word.attributes =
attribute xml:id { xsd:ID } ,
attribute xml:lang { xsd:language } ,
attribute form { text }? ,
attribute lemma { text }? ,
attribute guide { text }? ,
attribute sense { text }? ,
attribute pos { text }? ,
attribute morph { text }? ,
attribute base { text }? ,
attribute norm { text }? ,
delim? ,
syntax.attributes*
nonword =
element g:nonw {
attribute xml:id { xsd:ID }? ,
attribute xml:lang { xsd:language }? ,
attribute type { "comment" | "dollar" | "excised" | "punct" | "vari" }? ,
attribute form { text }? ,
attribute lemma { text }? ,
syntax.attributes* ,
break? , status.flags? , status.spans? , opener? , closer? , delim? ,
word.content*
}
group =
element g:gg {
attribute g:type {
"correction" | "alternation" | "group" | "reordering" | "ligature" | "logo"
} ,
g.meta ,
(group | grapheme)+
}
groupgroup =
element g:gg {
attribute g:type { "group" } ,
g.meta ,
(group | grapheme | normword)+
}
syntax.attributes =
(attribute syn:brk-before { text } |
attribute syn:brk-after { text } |
attribute syn:ub-before { text } |
attribute syn:ub-after { text } )
normword =
element n:w {
word.attributes ,
break? , status.flags? , status.spans? , opener? , closer? ,
hsqb_o? , hsqb_c? ,
(text | gloss | nongrapheme)* ,
syntax.attributes*,
breakStart? , breakEnd? ,
damageStart? , damageEnd? ,
statusStart? , statusEnd?
}
For the purposes of transliteration, a "word" is anything between spaces, including isolated and uninterpretable signs.
In GDL, words are sequences of graphemes or grapheme-groups. The following kinds of grapheme-groups are defined:
alternationKI/DI. An alternation may contain more than one choice,
but always applies to a sequence of single graphemes.reordering:) as a grapheme joiner in transliterations.
The original order of the signs on the tablet is not indicated within
a word; the structural mechanism Multiplexing must be used instead.In ATF words are separated by spaces, and graphemes within words
are joined by hyphens. Note that periods (.) are only
permitted inside compound graphemes.
Simple choices in the transliteration of single graphemes may be expressed
by separating the graphemes with a slash (/). More than
one choice may be given, but each sequence of choices only applies to
a single grapheme.
The fact that signs are inscribed on the object in a different
order than they are transliterated may be indicated by joining
graphemes with the colon (:) instead of the hyphen. This
mechanism is a convenient shorthand which is intended to cover cases
of occasional reversal in sign order. It is only available within
words, it is not permitted in compound signs, and it may only be used
with pairs of reversed graphemes. A different, completely general,
mechanism for indicating more complex reorderings is provided under
the concept of Multiplexing and is explained in the description of
document structure.
a-ba mu-un-ba-al-e KI/DI-bi LAGAB-DUL3 mu:un-du3
Note that there is no surface syntax for group generation in ATF; all non-compound groups are generated as necessary by the ATF processor.
namespace g = "http://emegir.info/gdl"
namespace n = "http://emegir.info/norm"
namespace syn = "http://emegir.info/syntax"
word.content = text | group | grapheme | nongrapheme
words = (word | sword.head | sword.cont | nonword | nongrapheme)*
word =
element g:w {
word.attributes,
word.content*
}
sword.head =
element g:w {
attribute headform { text },
attribute contrefs { xsd:IDREFS },
word.attributes,
word.content*
}
sword.cont =
element g:swc {
attribute xml:id { xsd:ID } ,
attribute xml:lang { xsd:language } ,
attribute form { text }? ,
attribute headref { xsd:IDREF },
attribute swc-final { "1" | "0" },
delim? ,
word.content*
}
word.attributes =
attribute xml:id { xsd:ID } ,
attribute xml:lang { xsd:language } ,
attribute form { text }? ,
attribute lemma { text }? ,
attribute guide { text }? ,
attribute sense { text }? ,
attribute pos { text }? ,
attribute morph { text }? ,
attribute base { text }? ,
attribute norm { text }? ,
delim? ,
syntax.attributes*
nonword =
element g:nonw {
attribute xml:id { xsd:ID }? ,
attribute xml:lang { xsd:language }? ,
attribute type { "comment" | "dollar" | "excised" | "punct" | "vari" }? ,
attribute form { text }? ,
attribute lemma { text }? ,
syntax.attributes* ,
break? , status.flags? , status.spans? , opener? , closer? , delim? ,
word.content*
}
group =
element g:gg {
attribute g:type {
"correction" | "alternation" | "group" | "reordering" | "ligature" | "logo"
} ,
g.meta ,
(group | grapheme)+
}
groupgroup =
element g:gg {
attribute g:type { "group" } ,
g.meta ,
(group | grapheme | normword)+
}
syntax.attributes =
(attribute syn:brk-before { text } |
attribute syn:brk-after { text } |
attribute syn:ub-before { text } |
attribute syn:ub-after { text } )
normword =
element n:w {
word.attributes ,
break? , status.flags? , status.spans? , opener? , closer? ,
hsqb_o? , hsqb_c? ,
(text | gloss | nongrapheme)* ,
syntax.attributes*,
breakStart? , breakEnd? ,
damageStart? , damageEnd? ,
statusStart? , statusEnd?
}
A simple entry point so that users don't have to include several separate schemas.
include "charset.rnc" include "grapheme.rnc" include "graphmeta.rnc" include "words.rnc"
The test harness uses the following schema to embed GDL in a document element.
namespace gx = "http://emegir.info/gdl-example"
start = gx
include "gdl.rnc"
gx = element gx:gdl { grapheme+ }
Questions about this document may be directed to Steve Tinney (stinney at sas dot upenn dot edu).