Home


Introduction


Grapheme

GDL & ATF


charset.rnc

Characters


grapheme.rnc

Preamble

Signs

Values

Names

Qualified

Number

Modifier

Allograph

Formvars

Compound

Punctuation


graphmeta.rnc

Preamble

Breakage

Other flags

Glosses

Presence

Scripts

Languages


Proximity


Intrusions


words.rnc

Words


gdl.rnc


example.rnc


Resources


Links

Top

Tutorial

GDL: Grapheme Description Language

(http://emegir.info/gdl)

Steve Tinney
Version of 2009-11-19

Introduction

Grapheme Description Language for embedding in higher-order document types such as text editions and signlists. A formal definition with RNC schema is given interwoven with the ATF conventions for representing each element in the schema.

Grapheme

The term "grapheme" as used in this document refers to a string of letters, numbers, modifiers and operators used to specify a Sumero-Akkadian cuneiform sign by name or to render one of the values of such a sign. While sign names are often glyph-descriptive (e.g., KA×A meaning sign A written inside sign KA), this document does not provide a glyph description language. Rather, we define a Grapheme Description Language.

GDL & ATF

GDL is not intended to be generated manually; rather, it is the XML result of processing ASCII Transliteration Format (ATF) with the ATF processor. This document includes implementor notes on ATF interwoven with the technical documentation. Unless you are an implementor, or are pathologically curious (or both), you don't need to read this document! Read the tutorial instead. If you are a developer who is new to GDL and ATF it is recommended that you first read the tutorial, and then this document.

An XSL script to convert from GDL back to ATF can be found in the resources section below. The script does not convert the character set from Unicode to ASCII.

charset.rnc

In this section we provide a model for constraining the lexical representation of graphemic atoms. This aspect of grapheme description does not constrain the validity of values within a given signiary; that is handled elsewhere.

Atoms are tightly constrained sequences of characters separated into distinct lowercase and uppercase sets to permit finer-grained constraints.

Characters

GDL does not support any of the common ASCII approximations of the various non-ASCII characters used in cuneiform transliteration; GDL uses only the specific Unicode codepoints listed below for the representation of these characters. Details and images of the Unicode characters can be found at http://www.unicode.org/charts.

Characters are combined into atom specifications by grouping them in classes which are used to place lexical constraints on the atoms.

lV = Permitted lowercase vowels
a e i u
uV = Permitted uppercase vowels
A E I U
lC = Permitted lowercase consonants
b d g h k l m n p q r s u w y z
U+014B LATIN SMALL LETTER ENG
U+1E2B LATIN SMALL LETTER H WITH BREVE BELOW
U+015B LATIN SMALL LETTER S WITH ACUTE
U+0161 LATIN SMALL LETTER S WITH CARON
U+1E63 LATIN SMALL LETTER S WITH DOT BELOW
U+1E6D LATIN SMALL LETTER T WITH DOT BELOW
U+02BE MODIFIER LETTER RIGHT HALF RING
uC = Permitted uppercase consonants
B D G H K L M N P Q R S U W Y Z
U+014A LATIN CAPITAL LETTER ENG
U+1E2A LATIN CAPITAL LETTER H WITH BREVE BELOW
U+015A LATIN CAPITAL LETTER S WITH ACUTE
U+0160 LATIN CAPITAL LETTER S WITH CARON
U+1E62 LATIN CAPITAL LETTER S WITH DOT BELOW
U+1E6C LATIN CAPITAL LETTER T WITH DOT BELOW
U+02BE MODIFIER LETTER RIGHT HALF RING
Si = Subscript initial characters
U+2081-U+2089 (Unicode subscript 1 through 9)
Sc = Subscript continuation characters
U+2080-U+2089 (Unicode subscript 0 through 9)
Sx = Subscript x character
U+208A SUBSCRIPT PLUS SIGN

This yields the following base character sets and definitions (dollar-variables are expanded by a preprocessor to generate the actual RNC schema):

$lV = [aeiu]
$lC = [\x{2BE}bdegh\x{1E2B}i\x{14B}klmnpqrs\x{161}\x{1E63}\x{15B}t\x{1E6D}uwyz]
$uV = [AEIU]
$uC = [\x{2BE}BDEGH\x{1E2A}I\x{14A}KLMNPQRS\x{160}\x{1E62}\x{15A}T\x{1E6C}UWYZ]
$Si = [\x{2081}\x{2082}\x{2083}\x{2084}\x{2085}\x{2086}\x{2087}\x{2088}\x{2089}]
$Sc = [\x{2080}\x{2081}\x{2082}\x{2083}\x{2084}\x{2085}\x{2086}\x{2087}\x{2088}\x{2089}]

$subscript = (${Si}${Sc}?|\x{208A})?

lV = xsd:string {
   pattern = "${lV}${subscript}"
}

lVCv = xsd:string {
  pattern = "(${lV}${lC})+${lV}?${subscript}"
}

lCVc = xsd:string {
  pattern = "(${lC}${lV})+${lC}?${subscript}"
}

lVCCvc = xsd:string {
  pattern = "(${lV}${lC}{1,2})+(${lV}${lC}?)${subscript}"
}

lCVCCvc = xsd:string {
  pattern = "(suen|kuara|${lC}(${lV}${lC}{1,2})+(${lV}${lC}?))${subscript}"
}

uV = xsd:string {
   pattern = "${uV}${subscript}"
}

uVCv = xsd:string {
  pattern = "(${uV}${uC})+${uV}?${subscript}"
}

uCVc = xsd:string {
  pattern = "(${uC}${uV})+${uC}?${subscript}"
}

uVCCvc = xsd:string {
  pattern = "(${uV}${uC}{1,2})+(${uV}${uC}?)${subscript}"
}

uCVCCvc = xsd:string {
  pattern = "${uC}(${uV}${uC}{1,2})+(${uV}${uC}?)${subscript}"
}

grapheme.rnc

namespace g = "http://emegir.info/gdl"

grapheme = v | q | s | n | c | gloss | g | nongrapheme | punct | gsurro
form     = attribute form { text }
sb       = element g:b { s.model }
vb       = element g:b { v.model }
punct    = element g:p { p.model }
lang     = attribute xml:lang { xsd:language }
gsurro   = 
  element g:surro {
    delim? , (s|c|n) , groupgroup
  }

# Values
#v.model  = "x" | lV | lVCv | lCVc | lVCCvc | lCVCCvc
v.model = text
v        = element g:v { form? , g.meta , lang? , (v.model | (vb , mods+)) }
dingir   = element g:v { g.meta , lang? , ("d") }
mister   = element g:v { g.meta , lang? , ("m") }

# Names
#s.model  =  "N" | "X" | uV | uVCv | uCVc | uVCCvc | uCVCCvc | lst | num
s.model  = text

lst    = xsd:string {
  pattern="(..?SL|ABZ|BAU|HZL|KWU|LAK|M|MEA|MZL|REC|RSP|ZATU)\d+[a-z]*"
}

#[ABCD] is a stop-gap until lateuruk numbers are fixed
num      = xsd:string { pattern = "N|N\d+[ABCD]?" }

s        = element g:s { form? , g.meta , (s.model | (sb , mods+)) }

# Qualified graphemes
q        = element g:q { form? , g.meta , (v|s|c) , (s|c|n) }

# Numbers
n.model  = r , (v|s|c|q)?

r        = element g:r {
             xsd:string {
	       pattern = "[nN]\+[0-9]+|[nN]|[0-9]+|[n1-9]+/[n1-9]" } }

n        = element g:n { form? , g.meta , n.model , mods* }

# Modifiers
mods     = modifier | allograph | formvar

modifier = element g:m { xsd:string { pattern = "[a-z]|[0-9]{1,3}" } }

allograph= element g:a { xsd:string { pattern = "[a-wyz0-9]+" } }

formvar = element g:f { xsd:string { pattern = "[a-z0-9]+" } }

# Compounds
c.model  = (compound , (o.join , compound)+) | unary | binary | ternary | (g , mods+)

c        = element g:c { form? , g.meta , c.model , mods* }

g        = element g:g { g.meta , c.model , mods* }

compound = single | unary | binary

single   = n | s | c | (g,mods*) | q

unary    = o.unary , single

binary   = single , o.binary , single

ternary   = single , o.binary , single , o.binary , single

o.join   = element g:o { attribute g:type { "beside" | "joining" | "reordered" } }

o.unary  = element g:o { attribute g:type { "repeated" } , xsd:integer }

o.binary =
  element g:o {
    attribute g:type {
      "containing" | "above" | "crossing" | "opposing"
    }
  }

# Punctuation
p.model =
    attribute g:type { "*"|":"|":'"|':"'|":."|"::"|"|"|"/" } , 
    g.meta , 
    (v|q|s|n|c)?

Preamble

As a design principle, all of the most common GDL elements have single character names. In order to minimize possible confusion with similar names in other vocabularies, it is recommended that GDL elements always be namespace-qualified. To reinforce this point, the definition of the GDL schema does not use a default namespace.

The examples in this document all assume that the prefix g is bound to the namespace of the GDL schema.

namespace g = "http://emegir.info/gdl"

grapheme = v | q | s | n | c | gloss | g | nongrapheme | punct | gsurro
form     = attribute form { text }
sb       = element g:b { s.model }
vb       = element g:b { v.model }
punct    = element g:p { p.model }
lang     = attribute xml:lang { xsd:language }
gsurro   = 
  element g:surro {
    delim? , (s|c|n) , groupgroup
  }

Signs

We call the core alphanumeric portion of a sign an atom. This is a single grapheme component which for the purposes of this grapheme description instance is not susceptible to further sub-description.

All sign values are by definition atoms.

Sign names consist of one or more atoms. In the grapheme A there is a single atom; in the grapheme KA×A there are two atoms, KA and A. In another context, that same grapheme might be named as NAG; this version of the name contains a single atom, despite the fact that a sign list might describe the sign as KA×A. In other words, atomicity in grapheme names is determined by the naming scheme rather than the underlying construction of the glyph.

Two simple elements are defined for atoms: g:v, for sign values, and g:s for sign names.

Values

# Values
#v.model  = "x" | lV | lVCv | lCVc | lVCCvc | lCVCCvc
v.model = text
v        = element g:v { form? , g.meta , lang? , (v.model | (vb , mods+)) }
dingir   = element g:v { g.meta , lang? , ("d") }
mister   = element g:v { g.meta , lang? , ("m") }

Names

# Names
#s.model  =  "N" | "X" | uV | uVCv | uCVc | uVCCvc | uCVCCvc | lst | num
s.model  = text

lst    = xsd:string {
  pattern="(..?SL|ABZ|BAU|HZL|KWU|LAK|M|MEA|MZL|REC|RSP|ZATU)\d+[a-z]*"
}

#[ABCD] is a stop-gap until lateuruk numbers are fixed
num      = xsd:string { pattern = "N|N\d+[ABCD]?" }

s        = element g:s { form? , g.meta , (s.model | (sb , mods+)) }

Two special classes of sign name are signlists and numerical sign names. Numerical sign names match the pattern N<DIGITS>. Signlist names consist of an uppercase alphabetic prefix and an ASCII digit suffix; the prefix is the name of the sign list and the suffix is the number of the sign in that list. Prefixes fall into one of two groups. Generic signlist prefixes consist of any one or two uppercase letters followed by SL; hence, CDSL, PSL, PCSL are all valid signlist prefixes. The second group is the built-in set of historic sign lists.

Built-in Sign List Names
NameBibliography
ABZR. Borger, Assyrisch-babylonische Zeichenliste (AOAT 33; Neukirchen-Vluyn 1978)
BAUE. Burrows, Archaic Texts (UET 2; London 1935)
HZLC. Ruster and E. Neu, Hethitisches Zeichenlexikon (Harrassowitz Verlag 1989)
KWUN. Schneider, Die Keilschriftzeichen der Wirtschaftsurkunden von Ur III (Rome 1935)
LAKA. Deimel, Liste der archaischen Keilschriftzeichen (WVDOG 40; Berlin 1922)
MEAR. Labat, Manuel d'épigraphie akkadienne (6th ed. Paris 1988)
MZLR. Borger, Mesopotamisches Zeichenlexikon (AOAT 305; Ugarit-Verlag 2003)
RECF. Thureau-Dangin, Recherches sur l'origine de l'écriture cunéiforme (Paris 1898)
RSPY. Rosengarten, Répertoire commenté des signes présargoniques sumériens de Lagash (Paris 1967)
ZATUM. Green and H. J. Nissen, Zeichenliste der Archaischen Texte aus Uruk (ATU 2; Berlin 1987)

Qualified

Qualifed graphemes consist of a sign value followed by a sign name in parentheses, e.g., pu(BU). (In normalized text the superficially similar construct is used to indicate the logograms used for the normalized form, e.g., %akk/n b=elu(EN).)

# Qualified graphemes
q        = element g:q { form? , g.meta , (v|s|c) , (s|c|n) }

Number

Numerical graphemes have a special form. Each numerical grapheme consists of at least two parts: the repetition count and the sign value, sign name or compound sign. A special case is made for numerical graphemes by allowing them to have modifiers even if the graphemic base is a sign value.

The repetition count must have one of the following forms:

digits
This is the normal case.
n
This is a special case for circumstances where the repetition is completely uncertain.
n+digits
This is a special case for circumstances where the repetition is partly uncertain.

While it would in principle be possible to constrain the value space of GRAPHEME in the schema we do not do so; instead, as with non-numerical graphemes, we constrain the lexical space and require the values of numerical graphemes to be validated elsewhere. This allows the schema to be open-ended with respect to the identification of new numerical systems.

# Numbers
n.model  = r , (v|s|c|q)?

r        = element g:r {
             xsd:string {
	       pattern = "[nN]\+[0-9]+|[nN]|[0-9]+|[n1-9]+/[n1-9]" } }

n        = element g:n { form? , g.meta , n.model , mods* }

Modifier

Sign names and numerical sign value atoms may be described by reference to modifications of the base sign, as summarized in the table below. The lexical representation of modifiers is restricted to either a single lower case letter or a sequence of one, two or three ASCII digits. The semantics of these modifiers is indicated in the table, but is irrelevant from the point of view of the schema. A single GDL element, g:m, contains the modifier.

Modifiers may not follow a compound sign's terminating pipe character; if an entire compound is to be modified, the compound's content must be grouped and the modifiers suffixed between the closing parenthesis and the closing pipe.

# Modifiers
mods     = modifier | allograph | formvar

modifier = element g:m { xsd:string { pattern = "[a-z]|[0-9]{1,3}" } }

Allograph

It is sometimes desirable to distinguish between grapheme instances which have otherwise been considered the same sign, or which actually are the same sign, for semantic or glyph-analytic reasons. This is expressed in GDL by the g:a element whose content is a sequence of one or more lowercase letters, excluding x, and ASCII digits. Sign list creators are free to assign whatever meanings they like to any combinations of these characters; in PCSL, for example, sequences such as a1a versus a1b and a2a versus a2b are used to implement multi-level distinctions between variants of a sign. An allograph may follow the closing parenthesis of a group within a compound sign, but may not follow the final vertical bar of the compound.

The reason for the exclusion of x in the allowable set of lowercase letters in an allograph is that allowing it would introduce an ambiguity at the ATF level between x in allographs and x as a compound operator.

allograph= element g:a { xsd:string { pattern = "[a-wyz0-9]+" } }

Formvars

formvar = element g:f { xsd:string { pattern = "[a-z0-9]+" } }

Form variants is the GDL name for minor differences in the construction of signs which may be of interest in analysis of a corpus for handwritings, but which are not important enough to be displayed or included in the version of the writing used for linguistic analysis.

Compound

Compound graphemes are combinations of sign names and operators; the definition is recursive meaning that compound grapheme atoms may be grouped and the group treated as a compound in its own right. Atoms and compounds may both have associated modifier and/or allograph qualifications. We call a single combination of a sign or compound sign and its qualifiers a constituent.

The possible operator types are:

beside
Constituents are written sequentially beside each other.
joining
Constituents are written such that they share at least one common wedge.
containing
The constituent preceding the operator contains the constituent following the operator; the containment may be partial.
above
The constituent preceding the operator is written in the upper part of the line, with the following constituent written beneath it in the lower part of the line.
crossing
The constituents cross one another similarly to the diagonals of an X.
opposing
The constituent preceding the operator is opposite the following constituent, which is turned upside down.
repeated
The following constituent is repeated N times

The beside and joining operators are in fact joiners which mark boundaries; any number of joiner/compound pairs may be siblings.

The inside, above, crossing and opposing operators all have binary scope: a compound which contains an operator is constrained to having exactly two compound children, one before and one after the operator.

The repeated operator is a unary prefix with the content of the operator giving the repetition count. Compounds containing this operator may have only one compound child.

The repeated operator is a unary postfix with the content of the operator giving the number of degrees the sign is rotated in a clockwise direction. Compounds containing this operator may have only one compound child.

# Compounds
c.model  = (compound , (o.join , compound)+) | unary | binary | ternary | (g , mods+)

c        = element g:c { form? , g.meta , c.model , mods* }

g        = element g:g { g.meta , c.model , mods* }

compound = single | unary | binary

single   = n | s | c | (g,mods*) | q

unary    = o.unary , single

binary   = single , o.binary , single

ternary   = single , o.binary , single , o.binary , single

o.join   = element g:o { attribute g:type { "beside" | "joining" | "reordered" } }

o.unary  = element g:o { attribute g:type { "repeated" } , xsd:integer }

o.binary =
  element g:o {
    attribute g:type {
      "containing" | "above" | "crossing" | "opposing"
    }
  }

Punctuation

Several types of cuneiform punctuation are supported in ATF and all of them must be preceded and followed by a space (in the case of * and / the punctuation may be immediately followed by a sign name in parentheses and then the following space). The recognized punctuation codes are:

* = Bullet
The "1" used at the start of each line in lexical texts.
*(GRAPHEME)
Generic punctuation; most often used where scribes use signs other than a "1" at the start of the line in lexical texts, but may be used to transliterate arbitrary or unusual kinds of punctuation that are not otherwise covered below.
: = cuneiform vertical colon.

The vertical "colon" sign often found in commentaries.

N.B.: If the single colon occurs within a word it must be transliterated with the grapheme name form P₂

:' (colon+right-quote) =
Borger MZL 592 variant b; a variant on the vertical two-wedge colon
:" (colon+double-quote) = cuneiform diagonal colon
The diagonal "colon" sign often found in commentaries. Note that the three different double-wedge colon signs are mnemonically two-dots, two-dots-prime and two-dots-double-prime
:. = cuneiform triple wedge colon
The triple-wedge "colon" sign sometimes found in commentaries.
:: = ??
(A colon convention defined in the SAA style manual, form unspecified.)
/ = word divider
Word divider; if unqualified, this is the single vertical wedge word-divider as used, e.g., in Old Assyrian texts. May be qualified as, e.g., /(P2).

Punctuation Sign Names

The punctuation signs may also be transliterated using the following names: P1 (cuneiform word divider); P2 (cuneiform colon); P3 (cuneiform diagonal colon); P4 (cuneiform triple wedge colon); MZL592~b (as :').

# Punctuation
p.model =
    attribute g:type { "*"|":"|":'"|':"'|":."|"::"|"|"|"/" } , 
    g.meta , 
    (v|q|s|n|c)?

graphmeta.rnc

namespace g = "http://emegir.info/gdl"
g.meta = 
  break? , status.flags? , status.spans? , 
  paleography.attr? , linguistic.attr? , proximity.attr? ,
  opener? , closer? , hsqb_o?, hsqb_c? , emhyph? ,
  varnum? , utf8? , delim? , notemark? ,
  attribute xml:id { xsd:ID }? ,
  breakStart? , breakEnd? ,
  damageStart? , damageEnd? ,
  surroStart? , surroEnd? ,
  statusStart? , statusEnd? ,
  accented?

accented = attribute g:accented { text }
breakStart = attribute g:breakStart { "1" }
breakEnd = attribute g:breakEnd { xsd:IDREF }
damageStart = attribute g:damageStart { "1" }
damageEnd = attribute g:damageEnd { xsd:IDREF }
surroStart = attribute g:surroStart { "1" }
surroEnd = attribute g:surroEnd { xsd:IDREF }
statusStart = attribute g:statusStart { "1" }
statusEnd = attribute g:statusEnd { xsd:IDREF }

notemark = attribute notemark { text }

break = attribute g:break  { "damaged" | "missing" }
opener = attribute g:o     { text }
closer = attribute g:c     { text }
hsqb_o = attribute g:ho    { "1" }
hsqb_c = attribute g:hc    { "1" }
emhyph = attribute g:em    { "1" }
utf8   = attribute g:utf8  { text }
delim  = attribute g:delim { text }
varnum = (
  attribute g:varo { text }? , 
  attribute g:vari { text }? ,  
  attribute g:varc { text }?
)

status.flags =
  attribute g:collated { xsd:boolean } ? ,
  attribute g:queried  { xsd:boolean } ? ,
  attribute g:remarked { xsd:boolean } ?

gloss = det | glo
pos = attribute g:pos { "pre" | "post" | "free" }
det = element g:d { pos , dtyp , delim? , emhyph? , (dingir | mister | word.content*)}
dtyp= attribute g:role { "phonetic" | "semantic" }
glo = element g:gloss { attribute g:type { "lang" | "text" } , delim? , pos , words }

status.spans =
  attribute g:status {
    "ok" | "erased" | "excised" | "implied" | "maybe" | "supplied"
  }

paleography.attr =
  attribute g:script      { xsd:NCName }

linguistic.attr =
  attribute xml:lang      { xsd:language } ? ,
#  attribute g:rws         { "emegir" | "emesal" | "udgalnun" }? ,
  (attribute g:role       { "sign" | "ideo" | "num" | "syll" }
  | (attribute g:role     { "logo" } ,
     attribute g:logolang { xsd:language }))

proximity.attr = 
  attribute g:prox { xsd:integer }

nongrapheme = 
  element g:x {
    ( attribute g:type { "newline" | "user" }
    | ( attribute g:type { "ellipsis" } , status.spans ,
        opener? , closer? , break? , notemark?)),
    delim? , text? , varnum? ,
    attribute xml:id { xsd:ID }? ,
    breakStart? , breakEnd? ,
    damageStart? , damageEnd? ,
    surroStart? , surroEnd? ,
    statusStart? , statusEnd?
    }

Preamble

This module defines attributes which are essentially graphemic metadata supplied by the editor of the text. They fall into several groups: properties of the grapheme imputed to derive from the scribe; properties assigned by the editor; physical preservation properties; paleographic properties; and linguistic properties. We describe these principally in the form of the tutorial aimed at end-users and allow the sequence of definitions in the schema to follow the tutorial.

namespace g = "http://emegir.info/gdl"
g.meta = 
  break? , status.flags? , status.spans? , 
  paleography.attr? , linguistic.attr? , proximity.attr? ,
  opener? , closer? , hsqb_o?, hsqb_c? , emhyph? ,
  varnum? , utf8? , delim? , notemark? ,
  attribute xml:id { xsd:ID }? ,
  breakStart? , breakEnd? ,
  damageStart? , damageEnd? ,
  surroStart? , surroEnd? ,
  statusStart? , statusEnd? ,
  accented?

accented = attribute g:accented { text }
breakStart = attribute g:breakStart { "1" }
breakEnd = attribute g:breakEnd { xsd:IDREF }
damageStart = attribute g:damageStart { "1" }
damageEnd = attribute g:damageEnd { xsd:IDREF }
surroStart = attribute g:surroStart { "1" }
surroEnd = attribute g:surroEnd { xsd:IDREF }
statusStart = attribute g:statusStart { "1" }
statusEnd = attribute g:statusEnd { xsd:IDREF }

notemark = attribute notemark { text }

Breakage

break = attribute g:break  { "damaged" | "missing" }
opener = attribute g:o     { text }
closer = attribute g:c     { text }
hsqb_o = attribute g:ho    { "1" }
hsqb_c = attribute g:hc    { "1" }
emhyph = attribute g:em    { "1" }
utf8   = attribute g:utf8  { text }
delim  = attribute g:delim { text }
varnum = (
  attribute g:varo { text }? , 
  attribute g:vari { text }? ,  
  attribute g:varc { text }?
)

Other flags

status.flags =
  attribute g:collated { xsd:boolean } ? ,
  attribute g:queried  { xsd:boolean } ? ,
  attribute g:remarked { xsd:boolean } ?

Glosses

ATF divides glosses into three types:

Determinatives
Determinatives include semantic and phonetic modifiers, which may be single graphemes or several hyphenated graphemes, which are part of the current word. Determinatives are enclosed in single brackets {...}; semantic determinatives require no special marking, but phonetic glosses and determinatives should be indicated by adding a plus sign (+) immediately after the opening brace, e.g., AN{+e}. Multiple separate determinatives must be enclosed in their own brackets, but a single determinative may consist of more than one sign (as is the case with Early Dynastic pronunciation glosses).
Linguistic
Linguistic glosses are defined for the purposes of this specification as glosses which give an alternative to the word(s) in question. Such alternatives are typically either variants or translations. Linguistic glosses are enclosed in the double brackets {{...}}.
Document-oriented
Document-oriented glosses are used for scribal comments on the document including 10-marks, line-count summaries and asides such as he-pi2 ("(text) broken"). Document-oriented glosses are enclosed in the compound brackets {(...)}.

Glosses must have a space or hyphen on one side or the other. They may have spaces on both sides. Glosses may not touch directly both the preceding and following graphemes; nor may they have hyphens at both ends.

{d}utu   larsa{ki}   {+u3-mu2}u2-mu11    AN{+e}

du3-am3{{mu-un-<(du3)>}}

{(1(u))}    {(%a he-pi2 esz-szu)}

The ATF processor sets type=text when the gloss is enclosed in {(...)} and type=lang when the gloss is enclosed in {...}.

The ATF processor sets pos=pre when the gloss has no space or boundary following it; pos=post when the gloss has no space or boundary preceding it; and pos=free when the gloss has spaces on both sides.

gloss = det | glo
pos = attribute g:pos { "pre" | "post" | "free" }
det = element g:d { pos , dtyp , delim? , emhyph? , (dingir | mister | word.content*)}
dtyp= attribute g:role { "phonetic" | "semantic" }
glo = element g:gloss { attribute g:type { "lang" | "text" } , delim? , pos , words }

Presence

status.spans =
  attribute g:status {
    "ok" | "erased" | "excised" | "implied" | "maybe" | "supplied"
  }

Programming note: Graphemic elements which can carry graphemic content (i.e., g:v, g:s, g:c, g:p, g:q, g:n, and g:x where the type is ellipsis) always have a g:status attribute. This can be used to navigate to the preceding/following grapheme which can have bracketing to determine when to open/close bracketing. Graphemes which have no explicit presence-status have g:status="ok".

Scripts

paleography.attr =
  attribute g:script      { xsd:NCName }

Languages

Language attributes specify the language which the grapheme is being used to write (normally specified at a higher level, word, line or text); the role of the grapheme and in the case of logograms the source language from which the logogram derives.

Languages can be shifted in several ways. The simplest way is to use explicit shifters:

an = %a sza-me-e

A shift done in this way is terminated in one of three ways:

  • another shifter of the same type;
  • the end of the line;
  • the end of the enclosing determinative or gloss, if the shift is given within the determinative or gloss.

A second convention which is useful in mixed-language contexts is to use matched pairs of underscore characters (_) to shift in and out of the secondary language (the ATF processor's definitions of primary/secondary language relationships are usually enough, but you can configure them with ATF protocols if necessary; when the current language is Akkadian, the secondary is Sumerian and vice versa):

%a im-me-ra-am _szu ba-an-ti_

Normalized text, e.g., Akkadian transcription, can be indicated by use of the %n shift, with an analogous %g indicating graphemic text (though this is in practice rarely necessary). If %n or %g follow a language shift they can be separated using /: %akk/n.

Single-letter shorthands are defined for the common requirements of Sumero-Akkadian cuneiform, but by definition all two-letter and three letter shifters may be language codes. In order to implement effective validation, it is necessary to register in the protocols section the use of language codes which are not in the table below. Several lists of the standardized language codes exist on the web but GDL follows the ISO-639-3 registry at SIL.

We utilize codes in the range qaa to qtz to handle languages that are either not covered by ISO 639 or are not sufficiently determined to assign to one of the known languages.

Single Letter Language/Writing-system Shifts
ATFISO 639Language
%aakkAkkadian
%esux-esEmesal Sumerian
%gGraphemic text
%hhitHittite
%nNormalized text
%ssuxEmegir Sumerian
%usux-ugnUD.GAL.NUN Sumerian
%xqcuUndetermined language in cuneiform

Predefined Language Codes
CodeLanguage
%akkAkkadian
%arcAramaic
%elxElamite
%grcAncient Greek
%hitHittite
%peoOld Persian
%qamAmorite
%xcrCarian
%qcuUndetermined cuneiform
%qebEblaite
%xhuHurrian
%xlcLycian
%xldLydian
%xluCuneiform Luvian
%hluHieroglyphic Luvian
%imyMilyan (Lycian B)
%qpcProto-Cuneiform
%qpeProto-Elamite
%plqPalaic
%xurUrartian
%suxSumerian
%ugaUgaritic

The ATF processor maps single letter shifters to the explicit lang/rws values which are expected in the schema. The logolang attribute is set from the ATF processor's secondary language when processing logograms.

Roles

The role of a grapheme may be annotated on the grapheme element, but there is no ATF syntax for specifying it: the ideo, num or syll values of the role attribute should be determined by linguistic services processors and added directly to the XTF version of the text.

Logograms

The surface syntax for logograms is described under Sign Names above.

A normalization may be given after a word containing at least one logogram by following the word immediately with (=...), e.g., SAL(=mimma).

linguistic.attr =
  attribute xml:lang      { xsd:language } ? ,
#  attribute g:rws         { "emegir" | "emesal" | "udgalnun" }? ,
  (attribute g:role       { "sign" | "ideo" | "num" | "syll" }
  | (attribute g:role     { "logo" } ,
     attribute g:logolang { xsd:language }))

Proximity

proximity.attr = 
  attribute g:prox { xsd:integer }

Intrusions

nongrapheme = 
  element g:x {
    ( attribute g:type { "newline" | "user" }
    | ( attribute g:type { "ellipsis" } , status.spans ,
        opener? , closer? , break? , notemark?)),
    delim? , text? , varnum? ,
    attribute xml:id { xsd:ID }? ,
    breakStart? , breakEnd? ,
    damageStart? , damageEnd? ,
    surroStart? , surroEnd? ,
    statusStart? , statusEnd?
    }

words.rnc

namespace g = "http://emegir.info/gdl"
namespace n = "http://emegir.info/norm"
namespace syn = "http://emegir.info/syntax"

word.content = text | group | grapheme | nongrapheme

words = (word | sword.head | sword.cont | nonword | nongrapheme)*

word = 
  element g:w {
    word.attributes,
    word.content*
  }

sword.head = 
  element g:w {
    attribute headform { text },
    attribute contrefs { xsd:IDREFS },
    word.attributes,
    word.content*
  }

sword.cont = 
  element g:swc {
    attribute xml:id { xsd:ID } ,
    attribute xml:lang { xsd:language } ,
    attribute form  { text }? ,
    attribute headref { xsd:IDREF },
    attribute swc-final { "1" | "0" },
    delim? ,
    word.content*
  }

word.attributes = 
    attribute xml:id { xsd:ID } ,
    attribute xml:lang { xsd:language } ,
    attribute form  { text }? ,
    attribute lemma { text }? ,
    attribute guide { text }? ,
    attribute sense { text }? ,
    attribute pos   { text }? ,
    attribute morph { text }? ,
    attribute base  { text }? ,
    attribute norm  { text }? ,
    delim? ,
    syntax.attributes*

nonword = 
  element g:nonw {
    attribute xml:id { xsd:ID }? ,
    attribute xml:lang { xsd:language }? ,
    attribute type { "comment" | "dollar" | "excised" | "punct" | "vari" }? ,
    attribute form { text }? ,
    attribute lemma { text }? ,
    syntax.attributes* ,
    break? , status.flags? , status.spans? , opener? , closer? , delim? ,
    word.content*
  }

group = 
  element g:gg {
    attribute g:type { 
      "correction" | "alternation" | "group" | "reordering" | "ligature" | "logo"
    } ,
    g.meta ,
    (group | grapheme)+
  }

groupgroup = 
  element g:gg {
    attribute g:type { "group" } ,
    g.meta ,
    (group | grapheme | normword)+
  }

syntax.attributes = 
  (attribute syn:brk-before { text } |
   attribute syn:brk-after  { text } |
   attribute syn:ub-before  { text } |
   attribute syn:ub-after   { text } )

normword = 
  element n:w { 
    word.attributes , 
    break? , status.flags? , status.spans? , opener? , closer? , 
    hsqb_o? , hsqb_c? ,
    (text | gloss | nongrapheme)* ,
    syntax.attributes*,
    breakStart? , breakEnd? ,
    damageStart? , damageEnd? ,
    statusStart? , statusEnd?
  }

Words

For the purposes of transliteration, a "word" is anything between spaces, including isolated and uninterpretable signs.

In GDL, words are sequences of graphemes or grapheme-groups. The following kinds of grapheme-groups are defined:

alternation
Simple alternation of the common transliterational form KI/DI. An alternation may contain more than one choice, but always applies to a sequence of single graphemes.
reordering
Reordering of graphemes within a word commonly expressed by use of the colon (:) as a grapheme joiner in transliterations. The original order of the signs on the tablet is not indicated within a word; the structural mechanism Multiplexing must be used instead.
namespace g = "http://emegir.info/gdl"
namespace n = "http://emegir.info/norm"
namespace syn = "http://emegir.info/syntax"

word.content = text | group | grapheme | nongrapheme

words = (word | sword.head | sword.cont | nonword | nongrapheme)*

word = 
  element g:w {
    word.attributes,
    word.content*
  }

sword.head = 
  element g:w {
    attribute headform { text },
    attribute contrefs { xsd:IDREFS },
    word.attributes,
    word.content*
  }

sword.cont = 
  element g:swc {
    attribute xml:id { xsd:ID } ,
    attribute xml:lang { xsd:language } ,
    attribute form  { text }? ,
    attribute headref { xsd:IDREF },
    attribute swc-final { "1" | "0" },
    delim? ,
    word.content*
  }

word.attributes = 
    attribute xml:id { xsd:ID } ,
    attribute xml:lang { xsd:language } ,
    attribute form  { text }? ,
    attribute lemma { text }? ,
    attribute guide { text }? ,
    attribute sense { text }? ,
    attribute pos   { text }? ,
    attribute morph { text }? ,
    attribute base  { text }? ,
    attribute norm  { text }? ,
    delim? ,
    syntax.attributes*

nonword = 
  element g:nonw {
    attribute xml:id { xsd:ID }? ,
    attribute xml:lang { xsd:language }? ,
    attribute type { "comment" | "dollar" | "excised" | "punct" | "vari" }? ,
    attribute form { text }? ,
    attribute lemma { text }? ,
    syntax.attributes* ,
    break? , status.flags? , status.spans? , opener? , closer? , delim? ,
    word.content*
  }

group = 
  element g:gg {
    attribute g:type { 
      "correction" | "alternation" | "group" | "reordering" | "ligature" | "logo"
    } ,
    g.meta ,
    (group | grapheme)+
  }

groupgroup = 
  element g:gg {
    attribute g:type { "group" } ,
    g.meta ,
    (group | grapheme | normword)+
  }

syntax.attributes = 
  (attribute syn:brk-before { text } |
   attribute syn:brk-after  { text } |
   attribute syn:ub-before  { text } |
   attribute syn:ub-after   { text } )
normword = 
  element n:w { 
    word.attributes , 
    break? , status.flags? , status.spans? , opener? , closer? , 
    hsqb_o? , hsqb_c? ,
    (text | gloss | nongrapheme)* ,
    syntax.attributes*,
    breakStart? , breakEnd? ,
    damageStart? , damageEnd? ,
    statusStart? , statusEnd?
  }

gdl.rnc

A simple entry point so that users don't have to include several separate schemas.

include "charset.rnc"
include "grapheme.rnc"
include "graphmeta.rnc"
include "words.rnc"

example.rnc

The test harness uses the following schema to embed GDL in a document element.

namespace gx = "http://emegir.info/gdl-example"
start = gx
include "gdl.rnc"
gx = element gx:gdl { grapheme+ }

Resources

Graphemes.pm
Perl module CDL::GDL::Graphemes.
charset.rnc
Charset Relax NG Compact Syntax grammar.
charset.rng
Charset Relax NG grammar.
example.rnc
Example Relax NG Compact Syntax grammar.
example.rng
Example Relax NG grammar.
gdl-ASL.xsl
XSL transform from GDL to ASL.
gdl-ATF.xsl
XSL transform from GDL to ATF.
gdl-HTML.xsl
XSL transform from GDL to HTML.
gdl-chunk-renderer.xsl
XSL transform from GDL to chunk-renderer.
gdl.rnc
GDL Relax NG Compact Syntax grammar.
gdl.rng
GDL Relax NG grammar.
gdl.xdf
XDF source for this documentation.
grapheme.rnc
Grapheme Relax NG Compact Syntax grammar.
grapheme.rng
Grapheme Relax NG grammar.
graphmeta.rnc
Graphmeta Relax NG Compact Syntax grammar.
graphmeta.rng
Graphmeta Relax NG grammar.
metadata.rnc
Metadata Relax NG Compact Syntax grammar.
test.plx
Simple test program for GDL.
words.rnc
Words Relax NG Compact Syntax grammar.
words.rng
Words Relax NG grammar.

Links

Top

Tutorial


Questions about this document may be directed to Steve Tinney (stinney at sas dot upenn dot edu).