Home


Introduction


Fundamentals

Catalogue

Corpus

Glossaries

Pager

Website

Static HTML

Text Lists


Structure

Users

Subprojects

Folders


Management

Interfaces

cdlproject


Procedures

Catalogue

Corpus

Glossaries


More Documents

Projects and Emacs

Project Catalogs

Project Configuration


Resources

CDL Projects

Steve Tinney
Version of 2009-11-22

Introduction

The basic unit in the CDL system is the project. This document gives a basic introduction to how projects are organized and how to create and work with them.

Fundamentals

Before we begin, it is useful to explain the fundamentals which are available to all projects.

Catalogue

While it may not be obvious, the most fundamental part of any project is the catalogue which provides the text metadata--at the very least the CDLI ID and a human-readable designation--which provides the organizational basis for all other components of the project.

The easiest way to provide a catalogue for a corpus is to derive the project dynamically from the CDLI catalogue. However, some projects have special needs and in those cases it is possible to tailor the catalogue processing software to the required metadata fields and values.

Corpus

Most projects relate in some way to a text corpus. The texts are entered or converted to the ATF format and may have translations. The project management software takes care of turning the ATF sources into the various formats used for web display and other purposes.

Glossaries

The ATF format supports lemmatization, which is the process of adding references to dictionary headwords into the texts. If a corpus is lemmatized, it can be used to generate glossaries directly from the texts with no glossary-editing at all. Normally, however, the glossary and text corpus are used together: the glossary is maintained and may be edited or augmented with bibliography, and the corpus is synchronized with the glossary so that all of the instances of terms are instantly reachable from the glossary articles.

Pager

The pager is the name given to the web-interface which enables users to interact with the corpus. The pager understands how to present long lists of results in pages, and also how to assemble metadata, texts and translations into pages displaying individual texts.

The link to the pager display for a project `cams' is:

http://cdl.museum.upenn.edu/cgi-bin/cdlpager?project=cams

Website

A project may use the pager directly as the user interface, or it may have additional pages some of which contain links to the pager. The website may be on the same server as the project data, or it may be located elsewhere.

ANY FILES FOR A PROJECT WEBSITE WHICH ARE LOCATED ON THE CDL SERVER *MUST* BE PLACED IN THE websources/ DIRECTORY.

The link to the website for a project `cams' is:

http://cdl.museum.upenn.edu/cams

The initial installation redirects this URL to the pager.

Static HTML

The effect of a static HTML page for any given text in any given project is achieved via the following CGI call:

http://cdl.museum.upenn.edu/cgi-bin/cdlpager?prod=html&project=PROJECT&item=PQID

Where PROJECT is the project name and PQID is the P- or Q-number of the text. Thus, to retrieve the HTML version of SAA 01 01, with all SAA styling, the call would be:

http://cdl.museum.upenn.edu/cgi-bin/cdlpager?prod=html&project=saa&item=P336297

This form is suitable for referencing in the <object> tag. A typical sample code fragment would look something like this (the example has been formatted to fit the width of the text; delete backslash-newline-space sequences to use this example in your HTML):

 <object type="text/html" 
         data="http://cdl.museum.upenn.edu/cgi-bin/cdlpager\
               ?prod=html&project=saa&item=P334164"
         style="height: 1350px; width: 600px; display: block;">
    <p>You are seeing this message because your browser does not 
    support the <object> tag.  The transliteration and 
    translation of this text is available at <a
    href="http://cdl.museum.upenn.edu/cgi-bin/cdlpager\
          ?prod=html&project=saa&item=P334164"
    class="external" title="Link opens in new 
    window">http://cdl.museum.upenn.edu/cgi-bin/cdlpager\
            ?prod=html&project=saa&item=P334164"</a><span
class="externallinktext">
[http://cdl.museum.upenn.edu/cgi-bin/cdlpager\
 ?prod=html&project=saa&item=P334164"]</span></p>
 </object>

An example of how to use to use this may be found on Knowledge and Power Highlights page.

Text Lists

The user-callable mechanism for emulating the lists of texts displayed by the pager is the adhoc producer which has the following paradigmatic form (long lines split as above):

http://cdl.museum.upenn.edu/cgi-bin/cdlpager\
  ?prod=adhoc&caller=PROJECT&input=PQIDS&project=PROJECT

Where PROJECT is the project name and PQIDS is a comma-separated list of P- or Q-ids. To get an pager display of P334278 and P334279 in project SAA you might say (long lines split as above):

http://cdl.museum.upenn.edu/cgi-bin/cdlpager\
 ?caller=saa;prod=adhoc&input=P334278,P334279&project=SAA

Structure

Users

The project organization is intended for use with multi-user systems. At the operating system level, each project is a user with a password and a home directory.

Subprojects

Projects can also own subprojects, which also means that regular users on a system can have their own personal projects.

Folders

The files used by a project live in several different folders (aka directories). The most important of these are:

sources/
Contains ATF files, conventionally with a .atf or .txt extension.
lib/
Contains project configuration files and the glossary files.
web/
The live website folder for the project; files should not be edited or placed directly in this folder because it is recreated every time the project is rebuilt.
websources/
Contains web pages and web configuration files which are copied into web/ when the project is rebuilt.

Management

Interfaces

Two interfaces are presently provided for project management tasks: the command-line interface (CLI) and the menu-driven Emacs interface. The latter is not documented on a separate page.

cdlproject

Access to the CLI is generally provided via the Secure Sockets Layer (SSL) program ssh, either from the user's computer's commandline or from a graphical user interface.

Once logged in as the project-user on the server, most tasks are accomplished via the program cdlproject. Each of the following headings corresponds to an invocation of cdlproject followed by the heading--for example, the heading `rebuild' means that in the CLI you type:

cdlproject rebuild

N.B.: you must be in the home folder/directory when using cdlproject.

catalog
updates catalogue installation.
check
perform various checks; see 'check' for more information.
clean
remove unneeded files; currently just */*~ (i.e., emacs backup files).
harvest
collect new words from the corpus and place them in lib/<LANG>.new; the new material can be reviewed but corrections *MUST* be made in the corpus sources or in lib/<LANG>.glo (the main glossary file).
merge
redo the harvest and then merge the lib/<LANG>.new files with the main glossary files.
rebuild
rebuild the corpus, glossaries and website; this does not doing any harvesting or merging.

Procedures

Catalogue

If you are using the CDLI catalogue then no action is required. If you are using your own catalogue, the project must be correctly configured, then the catalogue updates must be placed in the catalogue folder with the file name(s) the project has been configured to use.

There is a separate page about setting up your own project catalogue.

Corpus

Transliterations should be placed in the sources/ folder. There can be one big file, one file per text, or something in between; the rebuild process uses all the relevant files in sources/.

When new texts are added, simply run:

cdlproject rebuild

to update the website, indexes, etc.

Glossaries

The recommended workflow for glossary building is:

  1. begin with text data which is ATF-clean.
  2. lemmatize the texts; ensure they are ATF-clean with lem-checking, then add them to the 'sources' directory.
  3. run 'cdlproject harvest'.
  4. review lib/*.new and fix sources/*.atf or lib/*.glo as required.
  5. run 'cdlproject merge' (this automatically redoes the harvest).
  6. run 'cdlproject rebuild' if all seems well.
  7. if something goes wrong, you can retrieve the previous *.glo file from the 'backups' directory--multiple 'cdlproject merge' commands on the same day overwrite the same file.

More Documents

Projects and Emacs

Project Catalogs

Project Configuration

Resources

projects.xdf
XDF source for this documentation.

Questions about this document may be directed to Steve Tinney (stinney at sas dot upenn dot edu).