An introduction to Chomsky's recursive theory of grammar.
The closed world assumption.
The Chomp commands and suggested conventions.
The Chompy classes.
Chomsky (1957) presented a structured syntactic theory of grammar. The aim of syntactic theories is to model of a system that enables us to to tell which sentences are part of our language and which are not.That is the theory tried to explain how we can tell if a series of worlds is a permissible sentence in our language or not.
However a difficulty with creating such a model is that there are an in finite number of possible grammatical sentence, and the model needs to be able to identify these from the infinite number of non-grammatical sentences that may also be generated. In order to address this problem Chomsky suggested that sentences are comprehended using a recursive procedure. A recursive procedure is one which refers to itself an a repetitive manner. Hence allowing for infinite generatively via a finite set of procedures. Recursion is used widely in computer programming and forms the basis of Chompy, which is in effect a recursive engine.
The easiest way to explain how a recursive model of language works is with an example. If we imagine that there is a very simple language in which all valid sentences have to components A and B which can be written as:
"A valid sentence has an A part and a B part"
Sentence = A + B (rule 1.1)
"A valid B part may be mad of an A part and a further B part"
B = A + B (rule 1.2)
With these two simple rules in the language we can then decide what combinations of 'A' and 'B' make valid sentences. For example we know that a combination of "A B" is a valid sentence since it may be rewritten (or replaced) by "Sentence" according to rule 1:
"A B" can be rewritten as Sentence by rule 1.1
Also we know that "A A B" is a valid sentence since the last two components may be replaced by "B" according to the second rule, to make "A B" and as we already know this is a valid sentence:
"A A B" can be rewritten as "A B" by rule 1.2
"A B" can be rewritten as Sentence by rule 1.1
By repeating this procedure we can tell that the following are all valid sentences according to the grammar of our simple language:
and hence by continuing this series we can see that an infinite number of combinations of the components 'A' and 'B' can be identified as grammatically correct in the simple language. What's more non-grammatical combinations of 'A' and 'B' may also be identified. For example consider "A B A":
"A B A" can NOT be rewritten as a valid sentence by any rule.
"A B A" can be rewritten as "B A" b rule 1.2
"B A" can NOT be rewritten as a valid sentence by any rule.
So any non-grammatical sentence can also be recognised as such since it can not be reduced to a valid sentence. Therefor using a finite set of rules an infinite set of grammatical or non-grammatical sentences may be identified.
If we next consider a slightly more complex set of rules, we can see how this theory may be applied to identifying some grammatical sentential forms in the English language. A common sentential structure found in English sentences is a noun phrase (NP) followed by a verb phrase (VP), which may be written in rule form as:
SENTENCE = NP + VP rule 2.1
Next two simple noun phrase structures in the English language are; a noun phrase may be comprised of a single noun and a noun phrase may be comprised of an article and a noun. Written in rule form that is:
NP = NOUN rule 2.2
NP = ARTICLE + NOUN rule 2.3
While a simple verb phrase may be comprised of a verb followed by a noun phrase, so in rule form:
VP = VERB + NP rule 2.4
Finally then some examples of English nouns are, 'girl', 'vase', 'cup' and 'computer'. An example of a article is 'the' and examples of verbs are, 'threw', 'ran', 'passed'. Written in form they are:
NOUN = girl rule 2.5
NOUN = vase rule 2.6
NOUN = cup rule 2.7
NOUN = computer rule 2.8
ARTICLE = the rule 2.9
VERB = threw rule 2.10
VERB = ran rule 2.11
VERB = passed rule 2.12
While a comprehensive set of grammatical rules for the English language would be far more elaborate and complex than this, these few rule can still be used to identify a large number of grammatical sentences. For example the sentence "the girl ran the computer" may be recognised as grammatically correct as follows:
"the girl ran the computer" may be rewritten as
"ARTICLE NOUN VERB ARTICLE NOUN"
using rules 2.9, 2.5, 2.11, 2.9 and 2.8
"ARTICLE NOUN VERB ARTICLE NOUN" may then be rewritten as
"NP VERB NP" by rule 2.3
"NP VERB NP" may then be rewritten as
"NP VP" by rule 2.4
and finally "NP VP" may then be rewritten as "SENTENCE"
by rule 2.1
While the sentence "the girl computer ran" may not be identified as a valid sentence using the above rules (this task is left as an exercise for the reader).
These re-write rules also allow us to identify some of the underlying structure of sentences. For example the sentences "the girl ran the computer", "the girl threw the cup" and "the vase passed the girl" may all be re-written as "NP VERB NP", which demonstrates that they all derive from the same basic sentential structure. This information may be used by linguists who are interested in relating syntax to semantics (or meaning). Also if a computer program was able to recognise that the sentences "draw a horse", "plot a long thin box" and "print some seagulls", were of the general form "DRAW NP" then a general procedure may be designed to deal with this form of request.
Considering these few simple examples it can be seen how useful Chomsky's theory is. This is why the approach has had such an impact within the areas of linguistics, psycholinguistics, artificial intelligence and user interface design.
2. The closed world assumption.
The closed world assumption is important to bare in mind when using Chompy. As we saw in the last section Chompy works by trying to re-write a source string into the form of the target string by recursively applying the rules in the Chompy world. That is in the above example Chompy tries to re-write the sentence "the girl ran the computer" into the form "SENTENCE" by recursively applying the rules in the world. When Chompy is unable to do this it reports that the source string is not of the form of the target string. For example the sentence "the computer the girl passed" can not be re-written into the form "SENTENCE" given the above rules. Even though we may recognise this sentence as grammatically correct. This is because Chompy works on the closed world assumption, that is it assumes all the forms that exist in the Chompy world rules or can be derived from the Chompy world rules, are all the forms that exist in the Chompy world.
New forms however may be temporarily added to the Chompy world when they are included in a query. For example again using the rules in the previous section, if Chompy is asked:
is the sentence "when the girl ran the computer" of the form "when SENTENCE"
Chompy will return YES although the component 'when' is not part of the Chompy world. This is because the new component is present in both the source and target sentential forms such that the source sentence can be re-written into the form of the target sentence. So Chompy does not really consider the new component as long as the rest of the sentence can be re-written until the source and target sentences are the same.
3. The Chomp commands and suggested conventions.
All command line input used with Chomp is converted to upper case, hence 'about', 'ABOUT' and 'aBoUt' are all recognised by Chompy as 'ABOUT'. Upper case will be used here to distinguish all commands for clarity.
This command displays the online help file, which may also be vied as a text file ChompHelp.txt which can be found in the Chompy package. The online help is displayed using the paused output format. This means that the PAUSE command can be used to set how many lines of the help file are show before you are prompted for a key press. For example:
Which will display the help file 10 lines at a time.
The ABOUT command displays information about the current Chompy world that is being interfaced by Chomp. For example:
Welcome to Chompy :) ====================== Version 1.0 Copyright Neil Blue (1998) Chompy is a natural language parser. The current world contains: 0 rules.
The ADD command is used with one or more arguments to add rules either directly or indirectly (as a world file) to the current Chompy world. Rules in Chompy are composed of 'atoms' and operators. Atoms are the smallest whole units recognised by Chompy. An atom is made up of one or more alphanumeric or underscore '_' characters with no spaces. For example:
There are two operators recognised by Chomp, they are the equivalence operator '=' and the addition operator '+'. They are used to format rules with one atom as the left hand argument and one or more atoms joined with the addition operator forming the right hand argument(s). For example:
TheLeftHandArgument = RightHandArgument1 + RightHandArgument2
:->ADD TheLeftHandArgument = RightHandArgument1 + RightHandArgument2 Rules parsed: 1 OK Rules added: 1 OK :->LIST 0 THELEFTHANDARGUMENT = RIGHTHANDARGUMENT1 + RIGHTHANDARGUMENT2 :->
As we can see here, once rules have been added, they can be viewed using the LIST command.
Due to the recursive nature of Chompy, it does not distinguish between real words (e.g. 'girl', ' 'cup', 'computer', 'ran') and sentence structure descriptive terms (e.g. 'NOUN', 'NP', 'VERB'). Also sentence structure descriptive terms can I some circumstances be used as real words, for example in the sentence "what is a noun". In order to distinguish between sentence structure descriptive and real world terms it is advised the an underscore character is used to distinguish sentence structure descriptive:
|Sentence structure descriptive||Real world terms|
Note: Since all input is converted to upper case by Chomp, case can not be used to distinguish between sentence structure descriptive and real world terms.
As well as adding rules directly, the ADD command may be used to add a world text file. In which case the filename (with no spaces) is the only argument that is supplied with the command. For example:
World text files may have been previously created by Chomp and saved using the SAVE command, or externally using an ASCII text editor. When you create or save a Chompy world, it is advised that you use the '.wld' extension, however this is optional and any valid extension such as '.txt' may be used instead. After any rule(s) have been added to Chomp, a message is displayed to show that the rules have been parsed correctly and have been added to the current world. If an error occurs when parsing the input, then an error message is displayed (with the file name and line number where applicable) and the rules are not added:
:->ADD small.wld Rules parsed: 8 OK Rules added: 8 OK
Instead of adding rules directly to Chomp, world files may be created using an ASCII text editor and then added to the current world. In which case the same rule structure must be followed as when adding rules directly, and each rule or file name should appear on a separate line. If a file name is include in the file, then the rules from the specified file will be inserted into the original file at that point. Filenames may also be nested, however no internal check is made for recursively included files. Also single line comments may be included in a world file (and also in the command line, however there is little point in this). Single line comments begin with a double forward slash '//' and any items following the '//' are ignored. Here is an example world file:
file named basic.wld:
// basic.wld file
// this is a whole line comment
_S = _NP + _VP // this is a single line comment following a rule
_NP = _ART + _NOUN
_VP = _VERB
_ART = the // the only real word in this file
clothes.wld // a world file to include at this point
_S = _NOUN // a very simple rule, this will be the tenth rule
file named clothes.wld which will be included in the basic.wld beginning at the fifth rule:
_VERB = wear
_VERB = tie
_NOUN = tie
_NOUN = socks
_NOUN = jumper
Here we can see how a second file may be included in the first. However you are more likely to add a basic world structure file using one ADD command and a second specific one using a further ADD command:
After removing the clothes.wld reference from basic.wld and making it more reusable. Also note that when rules are added to Chomp, they are numbered beginning with 0 (zero) to the first rule will be rule numbered 0 (zero) and the fifth rule will be rule numbered 4.
Examples of the ADD command:
A simple rule added directly to Chomp
ADD _SENTENCE = _NP + _VP + _NP
A file added which is present in the current directory
A file added with an absolute path name
A file added with a relative path name
The DELETE command on its own can be used to remove all the rules currently in the Chompy world. DELETE may also be used with a valid rule index, to remove one rule. For example:
will clear the whole world, and
will remove rule 2 (the third rule counting from zero). You will always be prompted to confirm before any rules are deleted.
Any of these three commands can be used to exit Chomp after a confirmation prompt.
The IS command is used to send "isa" queries to the Chompy world. The "isa" query asks:
IS <sentence1> a form of <sentence2> ?
The query is made in the form:
IS "<source sentence>" [A] "<target sentence>"
where <source sentence> and <target sentence> comprise a series of valid atoms (if valid atoms are not used then the query will return false). The 'A' is optional and is only included here for clarity. Following the leading 'IS' command, Chomp then looks for the first two occurrences of quote enclosed strings. Hence:
IS "this" "that"
IS the sentence "this" of the form "that"
are both valid queries.
For example using the "small.wld":
Welcome to Chomp :) =================== Chomp is the command line interface for Chompy. For a list of commands type: help :->ADD small.wld Rules parsed: 8 OK Rules added: 8 OK :->IS "the dog ran" A "_S" Yes! The string "THE DOG RAN" is a form of "_S". This query took: 0.06 seconds to complete. :->
The query, IS "the dog ran" A "_S", returns true and the query took 0.06 second to evaluated. The timer can be disabled using the TIMER OFF command.
When Chompy is given an "isa" query it recursively tries to re-write the source sentence into the same form as that of the target sentence. If the source sentence can be re-written in the form of the target sentence then Chompy will return 'true' and Chomp will display this. However if the source sentence can not be re-written in the form of the target sentence then Chompy will return 'false' and Chompy will display this result also. The meaning of 'false' is that Chompy can not re-write the source sentence in the form of the target sentence. When you consider the meaning of a 'false' result, you should consider that Chompy uses the closed world assumption.
In trying to re-write the source string, Chompy recursively attempts to apply each rule from the world to the source string in the order the rules appear in the LIST of rules. For example the above query is resolved in the following way:
|Source String||Rule used|
|the dog ran||none|
|_ART dog ran||_ART = the|
|_ART _NOUN ran||_NOUN = dog|
|_ART _NOUN _VERB||_VERB = ran|
|_NP _VERB||_NP = _ART + _NOUN|
|_NP _VP||_VERB = _VP|
|_S||_S = _NP + _VP|
As noted earlier, the IS command may also include atoms that don't appear in the world, in which case they are temporarily included in the world during the execution of that query only. For example, if Chompy is queried:
IS "when the dog ran" A "when _S"
Chompy will return YES although the component 'when' is not part of the Chompy world. This is because the new component is present in both the source and target sentences such that the source sentence can be re-written into the form of the target sentence. Hence Chompy really ignores the new component as long as the rest of the sentence can be re-written until the source and target sentences are the same.
When the LIST command is used with no arguments it displays an indexed list of all the rules in the Chompy word:
:->LIST 0 _S = _NP + _VP 1 _S = _NP 2 _NP = _ART + _NOUN 3 _NP = _NOUN 4 _ART = THE 5 _NOUN = DOG 6 _VP = _VERB 7 _VERB = RAN
The list may be directed to an output file by including a valid file name (with no spaces) as a single argument:
:->LIST output.txt Output to file: OUTPUT.TXT
The output file can then be vied as a text file.
The PARSE command is used in the same way as the ADD command, without adding any rules to the world. The PARSE command may be used to check that a rule or file is correctly formatted before adding it to the world:
:->PARSE small.wld Rules parsed: 8 OK
The PAUSE command can be used to view or set the current state of paused output. The PAUSE command without any arguments will report the current pause setting:
:->PAUSE Current pause set to: 2 lines.
An integer argument is used with the PAUSE command to change the current pause setting (default is 20). For example:
changes the current pause setting to 10 lines. If pause is set to 0 (zero) then the output is not paused.
The SAVE command used without any arguments will save the current Chompy world using the last <filename> that was specified. SAVE may be used with a single argument that is a valid file name (with no spaces). In which case the current Chompy world is saved using the given file name. For example:
:->SAVE new.wld 'NEW.WLD' saved!
:->SAVE 'NEW.WLD' saved!
The TIMER command can be used to view or set the current state of timed queries. If the timer is on then the length of time a query takes to process will be displayed (see The IS command). The TIMER command without any arguments will report the current timer setting:
:->TIMER Timer is on.
The default setting is ON. TIMER may be used with either the ON of OFF argument to set the current state of the timer:
:->TIMER ON Timer is on.
:->TIMER OFF Timer is off.
Using the TIMER function in Chomp, it is possible to record how long Chompy takes to process a query. The fist "isa" query that is made carry the additional initialisation over heads, so it's time should not be used for comparison:
:->is "the dog ran" a "_S" Yes! The string "THE DOG RAN" is a form of "_S". This query took: 0.11 seconds to complete. :->is "the dog ran" a "_S" Yes! The string "THE DOG RAN" is a form of "_S". This query took: 0.05 seconds to complete.
The rest of this section will assume that Chompy has already been initialised with its first query.
In an everyday context when we are presented with sentences it is often quicker to identify those that are not grammatically correct than those that are. Since in order to find a sentence grammatically correct one needs to verify the entire sentential structure, while for non-grammatical sentences one only needs to find a single non-grammatical future in the sentence. This can be demonstrated using Chomp, by presenting grammatical and non-grammatical sentence queries:
using the small.wld file
:->IS "the pig ran" A "_S" No! The string "THE PIG RAN" is not a form of "_S". This query took: 0.0 seconds to complete. :->IS "the dog ran" A "_S" Yes! The string "THE DOG RAN" is a form of "_S". This query took: 0.05 seconds to complete.
Here we can see how an unrecognised world may allow for the rapid rejection of a sentence which can not be recognised as grammatically correct. Similarly a badly formed sentence, containing only recognised terms may also be rapidly rejected:
:->IS "dog the ran" A "_S" No! The string "DOG THE RAN" is not a form of "_S". This query took: 0.0 seconds to complete.
However if a non-grammatical sentence contains recognised terms and is similar in structure to a valid sentence then it may take longer than usual to be rejected, and may also take longer to reject than an similar valid sentence takes to recognise:
a double sentence structure is used here in order to present a longer non-grammatical query:
:->IS "the dog ran the dog ran" A "_S _S" Yes! The string "THE DOG RAN THE DOG RAN" is a form of "_S _S". This query took: 0.33 seconds to complete. :->IS "the dog ran the the the" A "_S _S" No! The string "THE DOG RAN THE THE THE" is not a form of "_S _S".
This is because Chompy will almost be able to re-write the source sentence in the form of the target sentence, however when it fails late and then tries another combination of re-write rules almost to completion.
5. The Chompy classes.
In the current version of Chompy (version 1.0) the Java classes are supplied in there normal package structure as a resource for other Java applications. that is they have not been compiled into a .JAR or .ZIP archive. The documents relating to these classes are also included in the Chompy package. In order to use the chompy package, the tools package must also be included in the classpath.
6 Future plans.
Chompy is very much in a state of development, and any suggestions are welcome :-).
While the speed difference between the processing of grammatical and non-grammatical sentences may be useful for demonstration purposes, it makes Chompy very difficult to use in any other applications. Future plans therefore include making Chompy more efficient, with the use of some optional features.
The command line interface prevents Chomp from being run as an applet, so future plans also include designing a GUI interface for Chomp.
New queries and the ability to generate grammatical sentences will also be added to future versions as well as the ability to display the re-write rules used in order to verify a given query.
Chomsky, N.(1957). Syntactic structures. Gravenhage: Mouton.
Forrester, M. A. (1996). Psychology of Language. Sage.
Bratko, I. (1990). PROLOG Programming For Artificial Intelligence. Addison-Wesley.
Thanks go out to Eyepoppers for there fine wallpapers.
Chompy is Copyright 1998 (Neil Blue) SoftBlue. All rights reserved.