How it all got started...

The text of a paper delivered at a scholarly meeting of the Poznań School of English staff members in March 1995.


     
     On Friday, January 27 this year Prof. W. Sobkowiak and I
boarded the deck of a Sabena airliner that was to carry us to the
venue of a most interesting scholarly event: a one-day symposium on
"Exploiting computer learner corpora".  Upon registering in the
Hotel de Lauzelle, beautifully set on the edge of a wood, only 15
minutes away from the Erasmus College and the university campus, we
were presented with the information packs and allowed to retire in
the comfortable interiors of our twin-bedded rooms (we each got
one). However, we preferred to spend the end of that long, tiring
day of travel in the hotel bar, savouring strong Belgian beer and
scanning other guests in the hope of identifying fellow
participants of our symposium. 
     What had drawn Professor Sobkowiak and myself to that small
university town of Louvain, some twenty miles south-east off
Bruxelles, was our interests in the freshly-new and promising
discipline of corpus linguistics. Although we tend to look at its
challenge from two complementary viewpoints - his focus is
primarily that of a phonetician and phonologist, and mine - that of
a writing teacher, I think we both share a dream about a day when
the storage capacity on our University hard disks will be bursting
at the seams with plenty of language corpora lending themselves for
a multitude of analyses. I for myself dream also of the day when
those analyses, when complete, start to exert impact on my own
teaching: on that ceaseless, hopeless struggle for clarity,
cohesion, coherence, collocability, good word choice, sentence-
length and register in the tonnes of "Polglish" essays which my
students submit to each week. Would you yourselves not like to see
some of that work done fast, efficiently and under nicer
circumstances? Then, maybe involvement with the ICLE might be an
answer for you.
     The heart and the soul of the ICLE - International Corpus of
Learner English - project is Prof. Sylviane Granger, the one who
had organised the one-day meeting in Louvain. Most of the
information that I will be passing to you today is taken from her
publications, as well as from some of the papers delivered by
participants of the memorable Belgian event. Unfortunately, it is
too early for me to be any original on the subject.
     The International Corpus of Learner English, ICLE for short,
is a computerised corpus of argumentative essays and, in certain
cases, also of literary papers written in English by advanced adult
foreign learners of the language. It has now been partially
completed (contributions include the French, German, Dutch,
Spanish, Swedish, Finnish, Czech, Japanese and Chinese corpora),
but it is still growing in size. First comparative analyses and
investigations have been carried out on the corpus which look
mainly at some surface features and processes happening on route
from L1 to L2 as a result of learning.
     ICLE was the first project of its type. Before October 1990,
when it was launched, only corpora of national varieties of English
from first and second language countries had been amassed and
studied, much to the benefit of Contrastive Analysis, Applied
Linguistics and computer lexicography. Still, it had been noticed,
they failed in one respect: they left the foreign learner of
English and his learning needs largely neglected. Hence, Prof.
Granger and her people at Louvain decided to close that gap. In
affiliation with the so-called ICE project (International Corpus of
English) led by Sidney Greenbaum from University College London,
they founded the ICLE project with the primary aim of investigating
the interlanguage of the foreign language learner.
     By that time - late 1990 - as Prof. J. Aarts pointed out, the
times of large, comprehensive corpora apparently representative of
the whole of a given language, were gone. Instead, smaller but more
specialised corpora became favoured. Consequently, it was not at
all reasonable that the ICLE team attempted at collecting material
coming from just any foreign learner. On the contrary, to make
certain that further investigations of the corpus were credible,
strict sampling requirements and procedures had to be imposed.
Naturally, at the outset only papers from French-speaking
(presumably Belgian) students were used; it was only later that
other nations started producing their own sub-corpora. The advanced
level was chosen for the entries - one corresponding roughly to the
third- or fourth-year of English university studies. Apart from
purely practical reasons that influenced that decision (the samples
were easier to collect since the project founders worked at the
university and all their students wrote essays), it was also
thought that this particular stage of learner development features
specially interesting lexical, stylistic and discourse errors
(coherence, cohesion), hitherto unaccounted for by the traditional
Error Analysis focused on the easier morpho-syntactic problems
characteristic of the earlier stages of learning. Another
consideration that spoke in favour of an advanced level corpus was
that it would help pinpoint more of the "non-native", or "foreign-
sounding", type of errors, which are often caused by over- and/or
underrepresentation of words and structures rather than factual
misuses. An analysis of these would hopefully contribute
interesting material to contrastive studies of discourse.
     As mentioned above, originally only the French mother tongue
was represented. But, as Prof. Granger rightly indicated in one of
her articles:

     "...it soon became apparent that no definitive conclusions
     could be drawn on the basis on one variety of L2 English. In
     order to be able to distinguish those features of L2 English
     that were L1-dependent, i.e. the result of transfer from the
     mother tongue, from those which were common to all learners,
     irrespective of mother tongue, i.e. the cross-linguistic
     invariants, it was essential to enlarge the corpus and include
     learners from different language backgrounds."

That turned out to be the watershed in the history of the ICLE. One
can even say that since that moment the decision of builiding up a
corpus of Polish-English was predetermined. To become a full member
of the Iternational Corpus of Learner English project, the Polish -
and any other L2 - side has to collect a set of data amounting to
minimum 200,000 words roughly distributed among 400 essays, each
containing approximately 500 words. The additional requirement is
that no writer can "donate" more than 1,000 words of his/her text.
We have tried to meet all those demands and to forge ahead with
assembling our material as soon as possible. Our plans, hopes and
difficulties will be briefly characterised at the end of this
presentation.
     It would be wrong to suppose that gathering the required
200,000 words of text under such strictly determined conditions is
easy. Stig Johansson noted that, "A smaller, high quality corpus is
much more valuable than a vast, heterogenous one." It is essential
that the project team foster that high quality throughout by
maintaining comparability of learning levels and backgrounds.
Therefore, apart from the actual essay or paper, each contributing
learner is asked to submit what is called a Learner Profile - a
source of biographical information and a great help in extracting
some customised sub-corpora. You can see a copy of it in your
handouts.
     With all the required amounts of essay text and learner
profiles on hand, the "raw" material of the corpus is said to have
been collected and is ready to be processed. Three or four
consecutive processing stages stages are distinguished. If they
deal with hand-written or typed out material, the compilers first
have to encode the texts in the magnetic form, by keying them in on
the computer keyboard by scanning them with an optical character
reader. This first stage of encoding may be considerably reduced,
thus saving a lot of time for later editing, if at least part of
the essays and papers are received from the students in the
computerised form already. The next stage, the mark-up, in which
the corpus is standardised, as it were, by some of the simpler
errors present in it being corrected, could then be initiated
almost immediately.
     The problem of normalisation of errors forced the Louvain team
to go back on their initial decision to mark and straighten a great
majority of errors in the submitted texts. For one thing, marking
proved to be highly subjective and thus liable to inconsistency;
for another, the analyst performing the task had to be very
familiar with the programs which were due to operate on the marked-
up text: the tagger - assigning grammatical categories and features
to lexical items - and the parser - noting the syntactic
relationships holding between those items. The TOSCA-ICE Tagger
used in th ICLE project, for example, required some specific
formatting, such as that a space should be put before and after any
punctuation mark. As a result of these hitches, normalisation of
the encoded ICLE corpus was reduced only to the bare minimum:
sentence boundaries were added, orthographic spellings were
normalised etc. As regards the simple morpho-syntactic errors, they
were not usually normalised, since it was expected that at the
advanced university level those errors would not occur particularly
frequetly and that, consequently, the chance of a tagger or parser
resulting failure in attaching the right category to an item was
very limited.
     Mark-up has to be done in accord with a standard set of
symbols. In the ICLE, by virtue of its being part of the ICE
project, the ICE symbols are used. The job of inserting them can be
done with the help of a program called the ICE Markup Assistant,
specially developed at University College London for all
participants in the ICE project. The program adds semiautomatically
some more refined markup, such as editorial comments, typographical
markup etc. 
     The examples of original normalisation and normative insertion
given below, quoted after Prof. Granger, show how the ICE markup
system looks:

1. Original normalisation:
     The diplomas are are equivalent.
     The diplomas <}_> <-_ are <-/ <=_ are <=/ <}/> equivalent.
2. Normative insertion:
     We can that ......
     We can <+_> say <+/> that .......

     I am not familiar with the ICE system of symbols just yet, but
it seems similar to those used in advanced style checking software.
The workings of it will pose fewer problems, I hope, when it comes
to marking up our own Polish corpus.
     The two final stages of processing the ICLE corpus are, then,
tagging and parsing. Having access to a tagged and parsed version
of a corpus has obvious advantages over using a "raw" one. To give
you an example: an analyst investigating learner use of the English
prepositions can retrieve the needed data much faster, since the
search command available on a program handling the annotated corpus
can easily leave out those occurrences of "in", "on", "for" etc. in
which they act as adverbials (as in "come in") rather than
prepositions ("in the garden").
     The tagging and parsing of the ICLE are usually performed at
the ICLE headquarters in Louvain, where care is taken to make
certain tagger or parser marks are disambiguated and failures are
corrected. The programs used are the TOSCA tagger (equipped with a
fine-grain system of about 150 tags from the ICE tag set) and
parser, both developed by Professor Jan Aarts, one of the most
respectable authorities in corpus research. 
     When fitted with all the grammatical and semantic information
and finally disambiguated, the corpus is thus ready to be tested by
linguists. In fact, all the national sub-corpora collected so far
will be, if have not already been, published as a lexical version
on a CD-ROM by the Norwegian Computing Center for the Humanities.

Research and application.

     Strangely enough, so far the only obstacle to the full
development of ICLE-based research has been the lack of an
adequate, comparable native English corpus. As mentioned above, for
any reliable contrastive claims to be made, the corpora compaired
have to meet the same criteria. Thus, the demanded Native Corpus of
English has to feature 500-word argumentative and literature essays
written by University students. It has been quite difficult to
obtain a substantial amount of the argumentative material; to date,
only the literary part of the Native English Corpus has been
assembled. Before the argumentative part follows suit, several
other non-literary computerised corpora are being used by the ICLE
team, all of them producing promising results.
     Opportunities for studies are massive, and so are, I think, 
the possible pedagogical and commercial applications following that
research. The few publications I have read so far plus the visit to
Louvain showed to me a wide range of perspectives. Let me introduce
but a few off-shoots of the ICLE project:

1. Comparative Interlanguage Research/Analysis (C.I.A.)

     This can be performed on two planes: on the one hand, as an
extension of Contrastive Syntax, analysts and scholars can compare
and contrast how non-native and native speakers of a language (in
this case English, PK.) behave in comparable linguistic situations;
on the other hand, various non-native varieties of English (French,
German, Chinese, for example) can be collated with each other. The
results of these interlanguage analyses could then be examined in
the light of classic contrastive analysis of the native languages.
In fact, they offer a better chance of success in differentiating
between L1-dependent and universal, developmental errors in
language learning.

a) Classic error analysis.

     This is carried out on a smaller sub-corpus of 30,000 words.
Firstly, the essays have to be corrected manually (ideally, one
would like to rely on editing software here; however, the existing
programs are not dependable in detecting advanced learner semantic,
stylistic and other errors). In the second stage, the analyst looks
at the usage and frequencies of syntactic, lexical and discursive
elements of the text, using to his aid retrieval packages with
frequency and concordancing facilities, such as TACT (used in the
ICLE project; special feature - providing information on recurring
phrases).

Example of findings:

     Concordances of the word 'possibility' in native and non-
     native (French) writing show that although the three types of
     grammatical structures present in the native corpus are also
     represented in the non-native one:

1. ... from the two-fold possibility for joining ...
2. ... possibility of identification ...
3. ... there seems every possibility that the present Queen ...

there is also a fourth, systematic type of occurence which is not
supported by the native corpus nor by any English grammar:

*4. ...students have the possibility to leave ...

   (excerpt from Prof. Sylviane Granger's paper "The place of
  computer learner corpora in corpus and applied linguistics")


b) Foreign-soundingness in non-native essays.

     This is at least partially related to under- and
overrepresentation of certain words, expresions and structures in
non-native production as opposed to the native English texts.
Statistical investigations can reveal precisely which and what
types of structures are used too much, too little, or sometimes
simply wrongly.

Examples:

a) the analysis of the occurence of the top 100 high-frequency
words (taken out of the total vocabulary in all the analysed
corpora) suggests that as we move from the more to the less
frequent vocabulary items, foreign (Finnish, Swedish and French)
learners, in contrast with native English writers, use them
increasingly more. The table in your handouts indicates this:

     Percentage of the 100 most frequent words out of total
     vocabulary:

_            1_      1-10_    1-30_    1-50_     1-70_    1-100_LOCNESS_6.9%_27.3%_39.1%_42.5%_46.2%_50.4%_JYVASKY_5.3%_24.5%_37.8%_45.1%_50.1%_55.2%_SICLE_4.9%_24.7%_38.0%_45.2%_50.1%_55.1%_French1_6.4%_25.9%_38.7%_45.3%_49.8%_54.2%_French2_5.9%_25.7%_39.2%_46.2%_52.4%_57.3
LOCNESS - Native speaker corpus (mainly exam papers on French
literature)
JYVASKY - Finnish ICLE
SICLE - Swedish ICLE
French1 - French ICLE 1 (half - exam papers on English literature)
French2 - French ICLE 2

    (taken from the handout to one the papers at the Louvain
                           symposium)


b) A more sophisticated analysis conducted on annotated corpora
reveals that French and Chech writers of English underuse the
preposition "over" in the contexts of "in connection with", as in:
                     An argument over money
while they overuse it in the locative sense, by confusing it with,
for example, "during" (during the week) or "throughout" (throughout
the play).

  (taken from the paper "Preposition Usage in Learner Writing:
          Overuse, Underuse and Misuse" by Guy Monfort)

c) Among other subjects chosen for comparative statistical studies
are discourse connectors (French, Swedish) and discourse features
(Spanish). These studies tend to be meticulous and detailed but
sometimes very revealing. If carried out properly, they supply the
researcher with the hard data that can almost unambiguously confirm
or disprove original intuitive hypotheses. In the study comparing
the number of words used in a basic clause structure (a so-called
T-Unit) by American and European Spanish writers, the original
presupposition had been made that the latter preferred longer
clauses because they used more function words in them. A count of
the function words proved that hypothesis to be false and led the
analysts to the real source: the use of heavy embedding patterns by
the Spanish. Later the same embedding inclinations were attested in
the samples of English essays written by Spanish students (taken
from Joanne Neff's paper "Markers of Text Continuity").

2. Extracting "tailor-made" subcorpora.

     Thanks to the computerised system of storage and the learner
profiles, a linguist interested in processes influenced by, or
related to, the writers' background, can easily retrieve and
assemble a sub-corpus, such as, for example, a subcorpus of those
Polish students of English who speak good German, who use a
particular type of dictionary, or who have spent more than one year
in an English-speaking country. Quick access to such quantitative
information does not mean it should be taken at its face value;
nevertheless, the power of using such reference tools is undeniable.

3. Devising better style and grammar checking software.

     On one hand, the ICLE, especially the smaller 30,000-word
manually-corrected sub-corpora, provides an ideal testbed for
improving the existing style and grammar checkers; what is more, it
can be a valuable help in devising a new generation of computer
software packages suited to individual foreign varieties of
English. The task is not very easy, as my preliminary attempts to
customise the GRAMMATIK 5 FOR WINDOWS, have shown. Still, there
definitely is a need to withdraw the exclusively native-speaker
bias from the programs in use today. In fact, this may prove to be
the most rewarding of all the research opportunities involving the
ICLE, as also the publishers have their vested interest in the
manufacturing of effective software that will meet the specific
market demand.

4. Stemming from the results of the Contrastive Interlanguage
Analysis and Error Analysis, new grammars, dictionaries, vocabulary
books and handbooks could be devised that would take the newly
identified learner needs into better account. Prior to that, if
properly disseminated, results of various comparative analyses of
the ICLE and native corpora could influence the day-by-day
performance of our writing teachers. Specific pedagogical hints
would be of great help, for instance, to native English tutors, who
would gain a clearer picture of the potential deficiencies of
written Polglish and could thus anticipate problems when drawing up
their syllabi.


The InternationaL Corpus of Learner English in the School of
English

     I am very grateful to Prof. W. Sobkowiak for conveying the
news of the ICLE project to me. A year after the idea of joining
that venture first dawned on me, I am glad to say that we are now
almost ready to start. In this respect, the trip to Belgium and the
chance to see for myself how fruitful corpus analyses are made, was
a major boost for me. While in Louvain, Prof. Sobkowiak and I
presented the Director of the ICLE project with our "resume", i.e.
a copy of the information booklet about our School plus a report on
the structure of the written English courses run here. We also
reiterated our eagerness to join hands with the other ICLE
collaborators. Hopefully, the first results of our efforts will be
seen in March.
     We are currently looking towards the initial and the most
cumbersome stage of the venture: assembling the data. This is going
to be arranged through a cooperative programme with the third-year
writing teachers, who will mediate between the students -
contributors of argumentative essays - and me - the coordinator of
the scheme. We hope to obtain a substantial portion of the needed
material on diskettes; all the remaining texts are to be submitted
in the type-written form. This is likely to relieve us greatly of
the chore of manual encoding. The first round of the collected
essays - an estimated 70-100 - will be out-of-class assignments of
about 600 words, which is in perfect tune with the conditions layed
out in the writing syllabus. Ideally, at the end of this academic
year, we will have assembled a corpus of plus 60,000 words, i.e.
one-third of the target 200,000.
     Impatient as we are, we would like to be able to supplement
the missing 140,000 words in the next academic year. Since the
number of participating writers will be, again, around one hundred
(only third-year students can be considered, since next year's
fourth-years will be this year's third-years who are barred from
entering a second time, while the second-year level is not advanced
enough), we shall be trying to establish links with other
universities in Poland that could help us with their supplies. I
hope to be able to inform you about good results of those contacts
before the end of this academic year. 
     If the idea of doing corpus-based linguistic research catches
on in our School, which I personally think it should, then soon the
need to found a new Centre or Department for Corpus Linguistics,
such as the one operating in Louvain, will arise. Once the ICLE
project successfully reaches its end, the Institute will have
acquired the necessary tools and expertise to develop dozens of
other written corpora. Here are some possiblities:

1. A corpus of English re-translations of texts that have first
been translated from English into Polish - together with the corpus
of the English originals. This could be divided into several mini-
corpora, depending on the nature of the texts involved (academic,
literary, scientific, economic etc).
2. A corpus of various English texts and their Polish translations
plus a corpus of original Polish productions in the respective
written styles.
3. A few parallel corpora along the lines of the ICLE (there is a
stipulation that no students can contribute more than 1,000 words
to one ICLE corpus; there has to be a way of dealing with surplus
material) - any analyses carried out on their basis would
supplement and verify the conclusions derived from the study of the
Polish ICLE proper.
4. A system of comparable corpora for monitoring the students'
progress through their subsequent years of study and for
"measuring" the level of advancement at each particular year. If
each academic year even an untagged corpus of, for example, 400-
word expository essays was collected for Year I, II, III and IV
separately, the amount of material on which to draw diacronic and
synchronic comparisons would be abundant.
5. A special corpus of examination papers (this would require the
help of a team doing the encoding).
6. A separate corpus of literature papers (in addition to those
included in the ICLE).
7. Separate corpora of student essays for the British and the 
American variety.
8. Specialised subcorpora of examination essays that received a
common grade: 2, 3, 4, 5.
9. Specialised subcorpora of essays done by one group or even one
student only - for example for teachers to measure progress on the
use of certain structures.
10. More ...

     I deliberately avoided referring to spoken corpora here, since
these lie outside the limited scope of my "jurisdiction". One thing
is undeniable - the same kind of research opportunities as we saw
effected by the colleague of ours, a historical linguist, Dr Marcin
Krygier, the opportunities asssociated with the unprecedentedly
quick and impartial access to potentially innumerable amounts of
data, lie before each of us. Corpus Linguistics is the sign of our
times and we ought to hurriedly jump on the bandwagon. I hope that
a success of the ICLE project and our participation in it will set
us in the right direction. Meanwhile, I would welcome any offers of
help and assistance in the process of acquiring the necessary ICLE
resources.
     On 12 or 13 April, during the INFOSYSTEM Fair, joining the
company of a few other respectable colleagues from the Institute,
I shall give a short talk to the Seminar "Language and Technology
1995" organised by the European Commission. My addressees are
supposed to be representatives of scholarly, industrial, commercial
and administration circles of the nation. The occasion will be a
splendid one for spreading the news about our ICLE project, future
plans and potential abilities. I hope to receive positive
responses, and one more time wish to thank Prof. W. Sobkowiak for
providing me with that opportunity.

 


Back to PICLE Research

Back to Unpublished articles...

Back to Main Page


Last update: 26 October 1999