Minimization Strategies in NeuroTran®

Nenad Končar, Sławomir Pawłowski, Danko Šipka, Vladimir Šipka

Translation Experts Ltd.

 

Abstract

This paper presents the minimization strategies used by NeuroTran®, a software program based on the principle of minimal effort, post-fordism and the reusability of lexicographic data. NeuroTran® uses a set of formal Minimal Information Grammar rules to manipulate various linguistic data, the purpose of which is to reduce both the information needed for the software to function both quickly and accurately as well as the effort required to acquire information. Additionally, NeuroTran® relies on a set of usage and thematic labels and corpora linked to the program. Finally, it makes an extensive use of artificial neural networks in its translating and parsing tasks.

Introduction

NeuroTran® is a software program developed by Translation Experts Ltd. It is intended to "do things with words" unlike anything that has come before it. It is typically post-fordist and utilizes its knowledge base in various ways depending upon the specific options selected. Its includes a morphological generator and analyzer, a dictionary with lexical lists extraction capability, sentence parsing and translation capabilitiesand quantitative and qualitative analysis . In its sentence translation mode, the software acts as a bilingual transfer system.

As part of the development of NeuroTran's knowledge base , we developed an adjunct software tool called Dictman (Dictionary Manager) which minimizes the effort required to compile the knowledge base. Since certain minimization strategies rely on transferring features and equivalents from one language to another, one could say that the project as a whole has some properties of a multilingual system.

NeuroTran® operates in a Windows environment and a cgi Internet version is currently in development. It includes modules for English (in L1 and L2 positions) and German, Polish, Russian and Serbo-Croatian, together or in its Bosnian, Croatian, and Serbian variants (in L1 and L2 positions).

This presentation consists of three parts. In the first, we will outline the framework of NeuroTran's underlying ideas and rationale, in the second, we will discuss the formal rules and strategies which enable NeuroTran® to perform its intended functions and, finally, we will describe how NeuroTran® and Dictman actually operateand provide a presentation of the software.

Ideas

When faced with the high complexity of their environment, people often reach to the heuristics and schemata in an effort to reduce the effort required to cognitively process such data (Fiske and Taylor, 1991; Kahneman, Slovic and Tversky 1982). We must firstrecognize the high complexity of the tasks NeuroTran® is intended to perform and then attempt to make it operate in a manner similar to the way in whichhuman beings actually process information. Furthermore, we must at least attempt to utilize the properties of human cognitive processing in the process of preparing the knowledge base for the program.

With a certain degree of simplification, the ratio between the quantity of information needed and quality of any machine translation project can be presented as follows:

The main strategy in the development of NeuroTran® was to apply a solution in such a way as tochange this ratio by reducing the quantity of information the program requires to perform its task. To accomplish this, we had to reject available generative models--such as minimalism (Chomsky 1995) and HPSG (Pollard and Sag 1994)--which tend to increase the quantity of the information needed.

NeuroTran® is based upon three crucial ideas. The first is that one needs to minimize to the greastest extent possible all the information required by the software to function and, then, allow it to acquire new information by reading text and communicating directly with the user.,The program sucessfully accomplishes this by using artificial neural networks,that is to say, it starts as a "cognitive miser" and then uses schemata to assimilate new pieces of information while at the same time adapting and changing old ones according to the new information.

The second main idea behind NeuroTran® is that one needs to reduce the effort required by the lexicographer, again by requiring minimum information or minimal input. In other words, the dictionary and program creatorsfunction as "cognitive misers" on behalf of the user.

Finally, all NeuroTran data needs to be reusable so thatall knowledge bases and functions are usable in various situations and for various fields of endeavoras well as for different languages.

Rules and Strategies

NeuroTran® uses a set of formal rules we have named Minimal Information Grammar (MIG). These rules operate on the basis of a bilingual labeled lexical list (the list of equivalents with their respective grammar and usage labels together with frequency data, etc.) and a representative corpus for both languages in any given pairing. The architecture of MIG is subordinate to the fundamental ideas behind NeuroTran®.

The grammar was named minimal because it reduces the information required for the software to perform its functions at a high level of speed and accuracy. It does this by: (1) balancing information in both the rules and the data it operates under; (2) using different classes of rules (constructors, mutators, selectors, etc.) to manipulate existing linguistic material, and; (3) using artificial neural networks to provide new information as a direct result of the learning process which occurs anytime the software "reads" a text or communicates with a user. Existing information is virtually and continuously "recycled". MIG operates with the following classes of rules:

a. constructors - use dictionary labels to construct all possible forms of a word

b. mutators - change already generated forms

c. finders - find the form or word required

d. definers - decide what is what

e. coordinators - determine how one form coincides with others

f. choppers - divide larger units into smaller ones

g. binders - unite smaller units into larger ones

h. transformers - replace one word or form with another, for example, by translating a word in one language into a word in the other

i. counters - keep track of all statistics

j. doubters - detect situations where there are more possibilities than would allow the program to proceed

k. gamblers - choose the solution that (based upon everything in the database) is the most probable even though other options remain viable

l. teachers - change existing information (rules and figures) after reading different texts and translations

m. chatters - ask the user when they need a piece of information or if user wants to change something

n. conductors - direct the order in which the rules are to be applied

Every rule consists of a head(stating the input of the rule)and a body (providing details of how the output is calculated), This is represented in the Diagram 2 using an example of the English to Serbo-Croatian translation transfer rule for number-gender coordination between the nominal head and its adjective modifier:

 

Rule

Example

<rule head> =>

ENGSCR GRM N[ADJECTIVE|PRONOUN] NOUN =>

<rule body line 1>;

COPY(2>1:NUMBER,GENDER)

<rule body line 2>;

 

....

 

<rule body line N>

 

Diagram 2

The entry in the labeled list of equivalents has the following structure:

<entry><grammatical labels><frequency data><usage labels>

<equivalent 1><grammatical labels><frequency data><usage labels>

<equivalent 2><grammatical labels><frequency data><usage labels>

...

<equivalent n><grammatical labels><frequency data><usage labels>

The text corpus is attached to the program as text with an index pointing from each form in the text to each starting and ending byte in the text.

NeuroTran's minimization strategies fall into two major groups: the first concerns the process of compiling the knowledge bases for the dictionaries while the second involves the very architecture of the program itself and the rules it operates under.These two groups will be explained in turn.

Within the process of compiling the knowledge bases, there are three main minimization strategies:

A. requiring minimal knowledge from the lexicographer

B. automatic labeling and corpus processing

C. automatic extraction

D. automatic transferring of the labels between different language modules

The architecture of the program and its rules use the following minimization strategies:

A. optimal information distribution between the rules and dictionary entries

B. coordination of the different rule classes

C. using neural network in learning and generalizing

D. using representative corpora as an integral part of the software

Process of Compiling the Knowledge Bases

Our main principle is that most of the information provided by the lexicographer should be easily retrievable from their long term memory. This in turn minimizes the time required to compile the knowledge bases. For example, in Slavic languages, verbs are not provided with a code specifying their conjugation type but, rather, by concrete forms which are readily available in the lexicographer’s LTM. Consequently, instead of labeling the Serbo-Croatian kucnuti as 3 v;, we label it as kucnuti,nem,nu v which means that the lexicographer does not lose processing time associating this verb form as a third declension type because the program already provides all inflectional forms.

Another minimization strategy in compiling the knowledge bases for NeuroTran® is the automatic labeling of whole lists based upon a small, representative sample. For example, grammatical labels for Slavic languages are accomplished by first labeling ten percent of the full list and, then, running an algorithm on the remaining ninety percent. This algorithm searches for the ideal point to correlate a series of characters at the end of a lexical entry with labels. Similar to this is automatic corpus indexation which enables a representative corpus to beused in both the dictionary module (to provideexamples of actual word usage ) and the MT module (to control the use of prepositions and articles).

A specific function performed by Dictman is automatic extraction of those entries containing certain features. This greatly minimizes the work required to compile specialized dictionaries from general ones. We can, for example, extract all lexems labeled "computing" from any general dictionary and, then, use these extracted entries as a core for creating a specialized computing dictionary for a given language pair. Using this method to extract entries containing a certain grammatical label or those sharing the same string of characters is used to correct existing knowledge bases.

Finally, almost all features can be translated from the knowledge base of one language to the knowledge bases of all other languages. For example, using such labels as "medical", "computing", etc.,with an English - Russian bilingual knowledge base, one need onlylabel the English side because Dictman will transferthe label to both the Russian side of the knowledge base as well as to the English side of any other language knowledge base.

In both automatization and transfer minimization strategies, the lexicographer must accomplish the final proofreading and editing but this is only a small portion of the work which would have been required without the use of these two strategies. Proofing and editing the transferred usage labels consists of changing or modifying those areas in which two languages or knowledge bases differ.

The Architecture of the Program and its Rules

Using the example of a single Polish noun rule to illustrate the first two minimization strategies within the architecture of the program and its rules,the combination of the constructor (presented in the Diagram 3) and the three constructors (presented in the Diagram 4) generate the inflection for a whole range of Polish feminine nouns while at the same time accounting for a broad range of both stem and ending alternatives. This combination of rules is sufficient for such diverse examples as teczka-GPl teczek (‘portfolio’), noga-GPl nóg (‘leg’); kobieta-GSg kobiety (‘woman’)and apteka-GPl apteki (‘drugstore’). The labels contained in dictionary entries require only basic information which cannot be inferred from the form of the lexeme. All other information is inferred from the form of the entry and dealt with by coordinating constructors and mutators.

Constructor rule

Explanation

POL PARA *a,V1,V2 f =>

Head: if the entry ending like this is discovered

NOUN;FEMININE;O1=(1->','-1);

Body: it is a feminine noun and its stem is the part preceding the comma with the final character deleted

SINGULAR;

 

NOM=O1+a;

the Nominative Singular is constructed by adding ‘a’ to the stem

GEN=O1+V1;

In the Genitive Singular, the vowel after the first comma has been added to the stem

DAT=PAL(O1)+e;

In the Dative Singular, the mutator called PAL has been used

ACC=O1+ę;

 

INS=O1+ą;

 

LOC=PAL(O1)+e;

In the Locative Singular, the mutator called PAL has been used

VOC=O1+o;

 

PLURAL;

 

NOM=O1+V2;

In the Nominative Plural, the vowel following the second comma has been added to the stem

GEN=OU(KEK(O1));

In the Genitive Plural both, mutators OU and KEK have been applied to the stem

DAT=O1+om;

 

ACC=O1+V2;

 

INS=O1+ami;

 

LOC=O1+ach;

 

VOC=O1+V2

 

Diagram 3

Transformer rule

Explanation

POL FUN PAL =>

PAL[O]=LAST[O][(t,d,r,sz,ż,rz,k,g,ch,ł,p,b,w,m,n,s,z,c)=>(ci,dzi,rz,si,zi,zi,c,dz,sz,l,pi,bi,wi,mi,ni,si,zi,ci)]

Function PAL. If the last character of the stem is one of the characters before the => sign, then the function changes it into the one after that sign. Otherwise, nothing happens

POL FUN OU => OU[O]=[O][(*KoK_)=>(*KóK_)]

Function OU. If the stem ends in a sequence: consonant-’o’-consonant, then this ‘o’ has been changed into ó

POL FUN KEK =>KEK[O]=[O][(*KK_)=>(*KeK_)]

Function KEK. If the stem ends in a sequence of two vowels, then ‘e’ has been inserted in between these two consonants

Diagram 4

The basic operating principle is that if the conditions are met for a mutator to be applied, then it changes the stem. If no such conditions are present, nothing happens. If we look at this rule applied to the Genitive Plural, we can see that (in the case of the entry teczka,i,i, f) the constructor generates the stem teczk -- in the Genitive plural, there are two consonants at the end of the stem and the mutator KEK inserts ‘e’ between them thereby providing teczek . The entry noga,i,i f; does not fulfill this criterion so there is no similar ‘e’ insertion. But the conditions to change ‘o’ into ‘ó’ are present so the mutator OU has been applied and the final Genitive Plural form becomes nóg. Finally, the entry apteka,i,i f does not fulfill either of the criteria so the stem remains unchanged and the Genitive Plural becomes aptek.

The advantage of using artificial neural networks can best be appreciated in sentence translation . NeuroTran's general sentence translation strategy consists of the following seven steps.

A. Breaking up text into sentences

B. Identifying finite verbs and predicates

C. Breaking up sentences into clauses

D. Identifying subjects

E. Identifying and transferring phrases

F. Performing transformations of translated clauses

G. Binding translated clauses into an output sentence

At any of the above steps, the heuristics and algorithms used by the program can provide several possible solutions but all such possibilities are kept open and available and transferred to the next step(s) until a clear indicator of their impossibility is encountered. In other words, an artificial neural network is propagated until such time as the conditions have been met that will allow the nodes which contain wrong or incorrect solutions to be discarded. At the same time, the program keeps track of its errors (providing a kind of learning mechanism) which allows it to translatewith increasingly greater accuracy with each new translation, much like human being who constantly adapts to his or her environment.

The main organization of the artificial neural network used by NeuroTran® has been accomplished within the C++ programming language structure. Called Doubly Linked List Both, a doubly-linked list and a tree structure is used in order to fully represent the parse tree that the software builds when analyzing any source language sentence.

Each node in the list contains one minimal translation component that can either be a single word or an idiomatic phrase. This data structure containing the parse tree is then further manipulated during the translation process so as to produce the parse tree of the target language. Basic elementary tree manipulation commands are given in a sequence provided by the grammar translation module which decides the sequence of rule application as well as which particular rules actually apply to a given sentence. This sequence is remembered and can be displayed to the user upon completion of a sentence translation. Once the manipulation of the parse tree is complete, the translated sentence can be easily read from the tree by accessing it via the doubly-linked list -- which allows quick left to right and right to left examination of the leaves of the parse tree.

Finally, a type of ad hoc corpus analysis is performed to account for certain phenomena in the translation so as to avoidcomplicated sets of rules. For example, if we needed to find the English equivalent of the Polish "w pracy" or the Serbo-Croatian "na poslu", it would first be translated as "in work" and "on work" (in step E of the sentence translation process), then, the corpus is consulted and both translations arechanged to "at work." This can be seen in Diagram 5:

 

 

 

NeuroTran® and Dictman at Work

The minimization strategies described previously are best seen in the ways in which Dictman and NeuroTran® operate. The following two simple procedures explain how Dictman and NeuroTran operate in practice:

A. Automatic Labeling in Dictman

This Dictman option performs automatic labeling of the entire dictionary file in a written textual format. If the beginning of the dictionary file is already labeled, automatic labeling can be conducted using the correlation between the form of the labeled words at the beginning of the file and their corresponding labels. In such a case, Dictman matches the ending of an unlabeled word with already labeled words. If there is a labeled word with the same ending as the word that is being automatically labeled, the grammar label is copied to it. The larger the already labeled "beginning" section of the file, the more accurate automatic labeling will be.

Another feature of this option is the possibility of automatic labeling an entirely unlabeled dictionary file by matching endings of the unlabeled words with the endings noted in the rule dictionary. The matching/checking order, as well as the word endings characteristic of the specific language with corresponding labels, are specified in the LABEL-ORDER rule of the rule dictionary. If the suiting ending is found in the LABEL-ORDER rule, the corresponding label is copied to the word in the dictionary.

Additionally, the auto-labeling option creates a file containing all word endings located in the text (together with their corresponding grammar labels) based upon the labeling of remainder of the dictionary or the whole dictionary.

A partial example of the LABEL-ORDER rule:

SCR LABEL-ORDER =>

ž/,a m;

z/,a m;

šov/,a m;

lov/,a m;

ov/,a,o--;

tiv/,a m;

hiv/,a m;

iv/,a,o--;

ev/,a,o--;

rav/,a m;

klav/,a m;

av/,a,o--;

nu/,ua m;

rut/,a m;

loput/,a,o--;

put/,a +ev m

For every word Dictman searches for an appropriate grammar label, it compares its ending with those already entered in the LABEL-ORDER rule (character string before the forward slash ‘/’). At the beginning of the search, the ending consists of the whole word.If there is no matching ending in the LABEL-ORDER rule, the program deletes the first character and again searches the LABEL-ORDER rule . This procedure is repeated until the matching ending is found or until such time as the program fails to find any appropriate ending for the word. If the matching ending is found, the corresponding label (character string from the forward slash ‘/’ to the end of the line) is copied to the end of the word.For example, for the word arhiv (archive), the program finds the label ",a m" according to the word's specific ending "hiv". For the word "dirljiv" (moving, touching) it finds the more generally matching ending "iv" and labels it with ",a,o--;".

In most cases, this method of automatic labeling has proven to be highly accurate in choosing appropriate labels.

B. NeuroTran® Sentence Translation

NeuroTran® first determines the type of text being translated (e.g. 'technical', 'medical', etc.) as well as which of its specific translation dictionaries would best translate the given text. Thereafter, NeuroTran's sentence translation proceeds by first identifying simple clauses from complex clauses -- those containing more than one predicate. Once a simple clause is identified (together with any contextual information provided from the previous clause or surrounding text), it is fed to the main translation routine that parses the sentence into a tree structure overlaid with a doubly-linked list for maximum processing speed and flexibility.

The parsing process involves identifying all noun and verb phrases, the subject, the predicate and the direct or indirect object and their respective attributes. The parse tree is then manipulated by translation rules stored in the grammar dictionary. All rules are external to the program and can, therefore, be changed by a skilled linguist without any computer programming knowledge.

The rules are written in a formal language (akin to Prolog) and are compiled into the grammar dictionary with a compiler provided with the Dictionary Manager software package that accompanies NeuroTran®.. Once each simple clause has been translated, the program takes a further step.If these simple clauses make up a more complex sentence, the translations are concatenated onto each other and then written to the screen or output file before the software proceeds to translate the next sentence.

References

Chomsky, N. (1995) The Minimalist Program, Cambridge:MIT Press

Fiske S.T., Taylor S.E. (1991) Social cognition, New Yourk: McGraw-Hill

Kahneman D., Slovic P., Tversky A. (1982) Judgment under uncertainty: Heuristics and biases, New York: Cambridge Univesity Press.

King, Margaret (ed.) (1987) Machine Translation Today, Edinburgh: Edinburgh University Press

Newton J (ed.) (1992) Computers in translation: A practical appraisal, London: Routledge

Pollard C., Sag I. (1994) Head-Driven Phrase Structure Grammar, Chicago:University of Chicago Press