Common grammar manifesto
Un article de Loria Wiki.
We propose a common format for a lexicon which can be exploited by both a linguistic parser and surface realiser, as well as a mechanism for anchoring lexical items to grammatical entries. This version of the manifesto assumes that we are working with TAG grammars.
This document is a summary of a series of meetings between the following LeD and Calligramme people:
- Azim - (LLP) TAG parser in Java
- Guy - (Leopar) IG parsing in OCaml
- Claire and Eric - (GenI) TAG generation in Haskell
- Claire and Yannick - TAG parsing in DyAlog
Lexicon: three layer architecture
We will use a basic three layered model containing morphological, syntactic and grammatical information.
- morphological lexicon - from inflected form to (lemma,features)
mangeons -> manger, [cat=v pers=1 num=pl mod=ind tense=pres]
- syntactic lexicon - from lemma to (interface, features,semantics)
manger -> [family=n0Vn1 passive=+]
- grammar - entries of the form (description, interface, semantics)
s:E | +-----+-----+ | | | n:X v n:Y
[family=n0Vn1, passivisation=-, sempred=P]
P(E) Θ1(E,X) Θ2(E,Y)
Filtering versus enrichment
The basic policy is to cleanly separate the notions of tree filtering and enrichment:
Filtering is used to select trees; we implement it as unification between the interfaces of the syntactic lexicon and the grammar. Given a lemma, the filtering step selects all trees whose interface can successfully unify the lemma's interface. The attributes used for filtering are family (at least) and typically FIXME:list of attributes?. Lemmas may appear more than once in a lexicon. For example, to represent a lemma which selects trees from multiple families, we use multiple entries, each of which selecting a different family.
Note that filtering is done by unification over open feature structures (as opposed to subsumption, and to closed structures). This is to allow for lexical entries which do not constrain all attributes in the tree interface, as trees whose interface do not contain all attributes specified by the lexical entry. Note also, that as a consequence of this, the filtering mechanism can be (ab)used to add information to the tree.
For example, if we do not care (know) if the lexical item is passive, we do not add this attribute to its interface. Both passive and active trees will be selected. The inverse example, is that we might require a passive form by setting passive=+ in the lexical item's interface, but if passivisation is not a relevant concept for the tree -- passive is not even in its interface -- unification on that tree's interface should succeed anyway:
Flat or recursive feature structures?
One thing which we have not yet decided on is if the interface should be a flat feature structure, or if values may themselves be feature structures. I (Eric) believe that a flat, one-level structure was our tentative decision.
- SelectTAG allows for recursive interfaces.
- GenI only handles one-layer interfaces
- SemFraG seems to only use a one-layer interface.
Enrichment is the process of adding lexical information to trees; we implement it by the application of path equations such as subj.hum=+.
- Deprecated: For convenience, path equations are applied to the anchor by default. That is, the equation pers = 1 is equivalent to anchor.pers= 1.
- We don't actually seem to be using this anywhere. Can we remove it from the manifesto? -- Kowey 27 avr 2006 à 19:07 (CEST)
The motivation for using path equations (instead of a global tree fs) is to simplify grammar development by avoiding us having to (1) predict every attribute that must appear in the tree (2) propagate these attributes by coindexation in the interface. Note on the other hand that the interface may also be used for enrichment, as well as filtering! See #Semantics
- Enrichment occurs by unification. If there is a unification error, for example, due to two path equations that assign different values to the same variable, then enrichment fails, and tree is not selected
- If the node name does not appear in the tree, we simply ignore it (and print a warning if we're very good). The tree succeeds otherwise.
- Implementation notes:
Anchors and co-anchors
The enrichment mechanism can be used to set the lexical entries of the anchor and its co-anchors. Nodes could be assigned the name anchor, coanchor1, coanchor2, etc; while the corresponding lexical entries would provide path equations of the form anchor.flex=look, coanchor1.flex=up, coanchor2.flex=up. Note that some trees may also contain hard-coded lexemes that serve a purely grammatical role independent of the lexicon. For example, the family n0Van1 in French would have the preposition "à" hard-coded into its trees.
- we assume that no two nodes in a tree will have the same name.
- the attribute that you use,
lexand so forth is completely up to you, as the process of adding a co-anchor merely consists of adding an attribute to the feature structure.
Trees in the grammar are associated with a semantics of the form P(E), Θ1(E,X), Θ2(E,Y). This semantics will be instantiated from the lexicon through enrichment on the tree interface and feature percolation to the semantics. A typical interface would be as follows:
[family=n0Vn1 passivisation=- sempred=P event=E theta1=X theta2=Y]
Note: The semantics for a lexical item will be specified separately from its interface using a more compact representation. Instead of writing the following path equations
interface.pred=love, interface.theta1=x, interface,theta1=y
we will write something like love(l,x,y). See #Lexicon format for details.
A tree semantics may contain multiple literals. Handling these semantics is not completely clear. Current policy is to assume that lexical items with multiliteral semantics do not exist.
Presumably, we could define attributes that allow lexical entries to instantiate secondary literals if they exist, but what makes things unclear is that the linguist will always be able to predict when a tree will have multiliteral semantics. For example, the word "expensive" could have the semantics cost(E,X), high(E), but there would be nothing which would let us distinguish it from any other uniliteral adjective. One solution discussed was to allow the attribute sempred to refer to a list of predicates, but it is not completely clear that this is adequate.