Corpus Linguistics

by user






Corpus Linguistics
Corpus Linguistics
Dr. Gloria Cappelli
A/A 2006/2007 – University of Pisa
What is a CORPUS?
“A corpus is a collection of pieces
of language that are selected
and ordered according to
explicit linguistic criteria in order
to be used as a sample of the
(Sinclair 1996)
What is a CORPUS?
“[…] the term corpus as used in
modern linguistics can best be
defined as a collection of sampled
texts, written or spoken, in machinereadable form which may be
annotated with various forms of
linguistic information”
(McEnery, Xiao and Tono 2006)
Key concepts re. Corpora:
• Machine-readable texts
• Authentic texts
• Sampled texts
• Representative of a particular
language or language variety
Is Corpus Linguistics a new
approach to the study of language?
The expression Corpus Linguistics
first appeared in the early 80s.
Corpus-based language study,
however has a substantial
Corpus-based language study
In the “pre-Chomskyan era”:
• Field linguists (Boas)
• Structuralists (Sapir, Newman, Bloomfield,
Pike, etc.)
“Corpora” where few paper slips
with data. “Shoebox Corpora”:
Corpus-based only in that the
methodology was empirical and
based on observable data.
The 50s: the “protests”
Chomsky (1962) accused the
(contemporary) corpus methodology,
by reason of the skewedness of
Non-representative, time consuming,
competence vs. performance, Ilanguage vs. E-language
Corpora were marginalized.
The revolutionary 60s
With the advances in computer
technology the exploitation of
massive corpora became feasible.
Brown Corpus
Brown University Standard Corpus of American Present-day English
The 80s: the boom
From the 80s onwards the number
and size of corpora and corpus
based studies have increased
Corpora have revolutionized
almost all branches of linguistics.
A few remarks…
… allow us to speed up the
processing of data.
… avoid human bias in data
… allow the enrichment of data
with metadata
Intuition vs. Corpus
Intuition should be applied with caution:
• Influence of dialect, sociolect, idiolect…
• No universal agreement on
(degree of) acceptability
• Informants monitor their use of
language (non-spontaneous)
• Introspection is not observable
Intuition vs. Corpus
• Corpus-based approach draws
upon authentic or real texts
• Computer-based analysis can
retrieve differences that
intuition alone cannot
• Reliable quantitative data
Should we dismiss intuition then?
Not at all!
The key to using corpus data
is to find the balance between
the use of corpus data and
the use of one’s own intuition.
Should we dismiss intuition then?
Not all research questions can be
addressed by the corpus-based
Corpus-based approach and
intuition-based approach
Leech (1991:14) writes…
“[…] Neither the corpus linguist
of the 1950s, who rejected
intuition, nor the general
linguist of the 1960s, who
rejected corpus data, was able
to achieve the interaction of
data coverage and the insight
that characterise the many
successful corpus analyses of
recent years”.
Is CL a methodology or a theory?
No universal agreement.
CL is a METHODOLOGY and not
an independent branch of
linguistics such as semantics,
pragmatics, syntax, etc.
CL can be employed to explore
almost any area of linguistic
Corpus-based or Corpusdriven approaches?
Corpus-based approaches are used to
“expound, test or exemplify theories
and descriptions that were formulated
before large corpora became available
to inform language study” (Tognini-Bonelli
Therefore, corpus-based linguists are
not strictly committed to corpus data
and they would discard “inconvenient
evidence” by insulation,
standardisation and instantiation (i.e.
via corpus annotation).
Corpus-based or Corpusdriven approaches?
Corpus-driven linguists are
“strictly committed to the
integrity of the data as a
Theoretical statements are fully
consistent with, and reflect
directly, the evidence provided
by the corpus.
(Tognini-Bonelli 2001:84-85).
Corpus-based or Corpusdriven approaches?
The distinction is overstated, they are 2 idealized
basic differences among the 2 approaches:
Types of corpora used
Attitudes towards theories and intuitions
Focuses of research
Paradigmatic claims
C.B. Approaches
C.D. Approaches
Corpus must be representative and
Corpus will balance itself when it
grows to be big enough (cumulative
Size is not all-important;
Corpus must be very large;
Minimum frequency is used to
exclude non-relevant results;
Corpus evidence is exploited fully,
but this way the number of the
combinations is enormous;
In favour of corpus annotation: CB
approaches generally have existing
theory as a starting point and
correct and revise such theory in
the light of corpus evidence;
Against corpus
annotation (no
preconceived theories)
Distinction between the different
levels of language analysis.
No distinction between
lexis, syntax, pragmatics,
etc. There is only 1 level
of language description:
the functionally complete
unit of meaning or language
We will only refer to
A few key notions in
Corpus Linguistics…
Essential feature of a corpus.
Balance (the range of genres
included in a corpus) and
sampling (how the text chunks
for each genre are selected)
ensure representativeness.
A corpus is representative if…
…the findings based on its contents
cane be generalized to the
said language variety (Leech
…its samples include the full
range of variability in a
population (Biber 1993)
It changes over time (Hunston
2002): if a corpus is not regularly
updated, it rapidly becomes
Criteria to select texts for a corpus:
• External criteria (Biber’s situational
perspective): defined situationally, e.g. genres,
registers, text types, etc.
• Internal criteria (Biber’s linguistic perspective):
defined linguistically, taking into account the
distribution of linguistic features. CIRCULAR –
because a corpus is typically design to study
linguistic distribution, so there is no point in
analysing a corpus where distribution of
linguistic features is predetermined.
2 main types (for the range of text
categories represented):
• General corpora – a basis for an
overall description of a language
(variety); their r. depends on the
sampling from a broad range of
• Specialized corpora – domain- or
genre specific corpora; their r. can be
measured by the degree of closure or
saturation (lexical features).
The range of text categories included
in the corpus:
The acceptable b. is determined by the
intended uses.
A balanced corpus covers a wide
range of text categories which are
supposed to be representative of the
language (variety) under
There is no scientific measure for
It is more important for
sample corpora than for
monitor corpora
A corpus is a sample of a given
A sample is representative if what
we find for the sample holds for
the general population
Samples are scaled-down
versions of a larger population
Sampling unit: for written text, a s.u.
could be a book, periodical or
Population: the assembly of all
sampling units; it can be defined in
terms of language production,
reception (demographic, sex, age,
etc.) or language as a product
(category, genre of language data).
Sampling frame: the list of sampling
Sampling techniques:
• Simple random sampling: all sampling
units within the sampling frame are
numbered and the sample is chosen by
use of a table or random numbers; rare
features could not be accounted for.
• Stratified random sampling: the
population is divided in relatively
homogeneous groups, i.e. the strata, and
then these latter are sampled at random;
never less representative than the
former method.
Sample size:
• Full texts = no balance; peculiarity
of individual texts may show through.
• Text chunks are sufficient (e.g.
2000 running words): frequent
linguistic features are stable in their
distribution and hence short text
chunks are sufficient for their study
(Biber 1993). Text initial, middle and
end samples must be balanced.
Proportion and number of samples:
The number of samples across
text categories should be
proportional to their frequencies
and/or weights in the target
population in order for the
resulting corpus to be considered
as representative
What matters is the
Research Question!
Claims of corpus representativeness and
balance should be interpreted in relative
terms as there is no objective way to
balance a corpus or to measure its
Representativeness is a fluid concept:
the research question that one has in
mind when building a corpus
determines what is an acceptable
balance for the corpus one should use
and whether it is suitably
Data collection
Spoken data must be transcribed
from audio recordings.
Written text must be rendered
machine-readable by keyboarding
or OCR (Optical Character
Recognition) scanning.
Language data so collected form a
Corpus Mark-up
System of standard codes inserted into
a document stored in electronic form to
provide information about the text itself
and govern formatting, printing
and other processes.
Most widely used mark-up schemes:
• TEI (Text Encoding Initiative)
• CES (Corpus Encoding Standard)
Corpus Mark-up
It is essential in corpus-building because…
…sampled texts are out of context and it
allows to recover contextual information
…it provides more information
than the file names alone (re.
text types, sociolinguistic variables,
textual information – structure)
…it ads value to the corpus because it
allows for a broader range of questions
to be addressed
…it allows to insert editorial comments
during the corpus building process.
Corpus Mark-up
Extra-textual and textual information
must be kept separate from the
corpus data.
COCOA mark-up scheme
A= author, attribute name
WILLIAM SHAKESPEARE= attribute value
TEI Mark-up Scheme
Each individual text is a document consisting
in a header and a body, in turn composed
of different elements.
Ex. in the header there are 4 main elements:
• A file description <fileDesc>
• An encoding description <encodingDesc>
• A text profile <profileDesc>
• A revision history <revisionDesc>
Tags can be nested, i.e. they can appear
inside other elements.
TEI Mark-up Scheme
It can be expressed using a number of
different formal languages.
SGML (Standard Generalized
Mark-up Language – used by
the BNC)
XML (Extensible Mark-up Language)
CES Mark-up Scheme
Designed specifically for the encoding
of language corpora.
• Document-wide mark-up
descripion, encoding description, etc.)
• Gross structural mark-up
chapter, paragraph, footnotes, etc.; specifies
recommended character sets)
• Mark-up for subparagraph structures
(sentence, quotations, words, abbreviations, etc.)
CES Mark-up Scheme
It specifies a minimal encoding level that
corpora must achieve to be considered
standardized in terms of descriptive
representation as well as general architecture.
3 levels of standardization designed
to achieve the goal of universal
document interchange:
• Metalanguage level
• Syntactic level
• Semantic level
Corpus Annotation
Necessary in order to extract relevant
information from corpora.
“The process of adding […] interpretive,
linguistic information to an electronic
corpus of spoken and/or written
language data”
(Leech 1997)
Annotation vs. Mark-up
Corpus mark-up provides objective,
verifiable information.
Annotation is concerned with
interpretive linguistic
The advantages of
1. It makes extracting
information easier, faster
and enables human
analysts to exploit and
retrieve analyses of which
they are not themselves
The advantages of
2. Annotated corpora are reusable
3. Annotated corpora are
multifunctional: they can be
annotated with a purpose and
be reused with another.
The advantages of
4. Corpus annotation records a
linguistic analysis explicitly.
5. Corpus annotation provides a
standard reference resource, a
stable base of linguistic analyses,
so that successive studies can be
compared and contrasted on a
common basis.
Criticisms to corpus annotation
1. Annotation produces cluttered
2. Annotation imposes an analysis
3. Annotation overvalues corpora
making them less accessible
4. Is annotation accurate and
How are corpora annotated?
• Automatic annotation
• Computer-assisted annotation
• Manual annotation
Sinclair (1992): the introduction
of the human element in corpus
annotation reduces consistency.
Types of annotation
Different types of annotation can
be carried out with different
For some types automatic
annotation is very accurate.
Other types require
post-editing, i.e. human
Types of annotation
Corpora can be annotated at
different levels of linguistic
• Phonological level
– Syllable boundaries
(phonetic/phonemic annotation)
– Prosodic features (prosodic
Types of annotation
• Morphological level
– Prefixes
– Suffixes
– Stems
(morphological annotation)
Types of annotation
• Lexical level
– Part of speech (POS Tagging)
– Lemmas (lemmatization)
– Semantic fields (semantic
• Syntactic level
– parsing
– treebanking
– bracketing
Types of annotation
• Discourse level
– Anaphoric relations (coreference
– Speech acts (pragmatic
– Stylistic features such as speech
and thought in presentation
(stylistic annotation).
POS Tagging
POS is the most common type of
Also known as grammatical tagging or
morpho-syntactic annotation.
It provides the basis of further forms
of analysis such as parsing and
semantic annotation.
Many linguistic analyses, e.g. the collocates
of a word depend heavily on POS tagging.
POS Tagging
It can be performed automatically
with taggers like CLAWS
You can try it for free online.
Examples of tags: NN1 (noun), VVZ
(verb in the third person of the
simple present tense), VVD (verb in
the simple past form), ADJ0
(adjective in the basic form), etc.
POS Tagging
• Word segmentation (tokenization)
– Multiwords (so that, inspite of)
– Mergers (can’t, gonna)
– Variably spelled compounds
(noticeboard, notice-board, notice
Type of annotation that reduces the
inflextional variants of words to their
respective lexemes or lemmas as they
appear in dictionary entries:
Do, does, did, done, doing= DO
Corpus, corpora= CORPUS
Small capital letters are the convention.
It is important in vocabulary studies
and lexicography, e.g. in studying
the distribution pattern of
lexemes and improving dictionaries
and computer
It can be automatically performed.
Once a corpus is POS tagged, it
is possible to bring these
morpho-syntactic categories
into higher level syntactic
relationships with one another,
that is, to analyse the
sentences in a corpus into their
Parsing consists in bracketing.
It can be automated but with a
low precision rate.
(NP a
(ADJP very nice)
Semantic annotation
It assigns codes indicating the semantic
features of the semantic fields of the
words in a text. It is knowledge-based
so it needs to be manual most of
the time.
Two types:
– One marks the semantic relationships
between the constituents in a sentence
– One marks the semantic features of
words in a text
Coreference annotation
• Pronouns
• Repetition
• Substitution
• Ellipsis
Computer-assisted at best.
Pragmatic annotation
• Speech/dialogue acts in domain-specific
The most coherent system is DRI
(Discourse Representation Initiative).
3 layers of coding:
– Segmentation (dividing dialogue in
textual units, utterances)
– Functional annotation (dialogue act
– Utterance tags (applying utterance tags that
characterize the role of the utterance as a
dialogue act)
Pragmatic annotation
Utterance tags:
– Communicative status (intelligible,
complete, etc.)
– Information level and status (indicating
the semantic content of the utterance and
how it relates to the task in question)
– Forward-looking communicative function
(utterances that may constrain or affect
the discourse, e.g. assert, request,
question and offer)
– Backwarding-looking communicative
function (utterances that relate to
previous parts of the discourse, e.g.
accept, backchannelling, answer)
Stylistic annotation
It is particularly associated with
stylistic features in literary texts.
An example: the representation
of people’s speech and thoughts,
known as speech ad thought
presentation (S&TP)
Other types of tagging
• Error tagging
• Problem-oriented annotation
Types of corpora
• Multilingual
• Monolingual
Multilingual Corpora
• Parallel corpora (source texts
plus translations): Canadian Hansard
• Comparable corpora
(monolingual subcorpora
designed using the same
sampling techniques): Aahrus
corpus of contract law
– Multilingual
– Bilingual
Multilingual Corpora
Important resources for translation and
contrastive studies.
Multilingual corpora…
• …give new insight into the language
• …can be used to study language
specific and universal features
• …illuminate differences between
source texts and translations
• …can be used for a number of practical
applications, in lexicography, language
teaching, translation, etc.
Parallel Corpora
• Bilingual vs.Multilingual
• Unidirectional (from La to Lb or
from Lb to Lc alone) vs.
Bidirectional (from La to Lb
and from Lb to La) vs.
Multidirectional (from La to Lb,
Lc etc.)
Comparable corpora
A corpus containing components
that are collected using the
same sampling techniques and
similar balance and
representativeness, e.g. the
same proportions of the texts
of the same genres in the same
domains in a range of different
languages in the same
sampling period.
Comparable vs. parallel
The sampling frame is
essential for comparable
corpora but not for parallel
corpora because the texts
are exact translations of
each other.
Corpus Alignment
In order for us to be able to fully
exploit parallel corpora, they need
to be aligned.
Different types of alignment:
• Word-level alignment
• Sentence-level alignment
• Paragraph alignment
General Corpora
• British National Corpus
(100,106,008 words)
• The American National
Specialized Corpora
• Guangzhou Petroleum English Corpus
(411,612 words of written English
from the petrochemical domain)
• HKUST Computer Science Corpus
(1,000,000 words of written English
sampled from undergraduate
textbooks in computer science.
• CPSA (Corpus of Professional Spoken
American English)
• MICASE (1,700,000 words of English
spoken in the academic domain)
Written Corpora
• BROWN Corpus (written texts, AE
in 1961)
• LOB Corpus (Comparable to
BROWN Corpus, BE, early 1960s)
• FROWN Corpus (AE, Early 1990s)
• FLOB Corpus (BE, Early 1990’s)
Spoken Corpora
• London-Lund Corpus (LLC)
• Lancaster/IBM Spoken English
Corpus (SEC)
• Cambridge and Nottingham
Corpus of Discourse in English
• Santa Barbara Corpus of Spoken
American English (SBCSAE)
• Wellington Corpus of Spoken New
Zealand English (WSC)
Synchronic Corpora
Useful to compare varieties of
English. Texts date all to the same
• Brown and Lob
• Frown and Flob
• International Corpus of English
(ICE) (Texts produced after 1989)
Diachronic Corpora
Texts date to different periods in time.
Ideal to study language change and
• Brown/Frown
• Lob/Flob
• Helsinki Diachronic Corpus of
English Texts (8th-18th century)
• Archer Corpus – A representative
Corpus of Historical English
Registers (BE and AE, 1650-1990).
Lstr or L2 acquisition/L1 acquired by
• International Corpus of
Learner English – ICLE (LC)
• Cambridge Learner Corpus
Monitor Corpora
Constantly supplemented with fresh
material and keep increasing in
size, though the proportion of text
types included in the corpus
remains constant.
• Bank of English (BoE)
• Global English Monitor Corpus
The British National Corpus
(BNC) is a 100 million word
collection of samples of written
and spoken language from a
wide range of sources,
designed to represent a wide
cross-section of current British
English, both spoken and
The written part of the BNC (90%) includes,
for example, extracts from regional and
national newspapers, specialist periodicals
and journals for all ages and interests,
academic books and popular fiction,
published and unpublished letters and
memoranda, school and university essays,
etc. The spoken part (10%) includes a large
amount of unscripted informal conversation,
recorded by volunteers selected from
different age, region and social classes in a
demographically balanced way, together with
spoken language collected in all kinds of
different contexts, ranging from formal
business or government meetings to radio
shows and phone-ins.
The corpus is encoded according to
the Guidelines of the Text Encoding
Initiative (TEI) to represent both
the output from CLAWS (automatic
part-of-speech tagger) and a
variety of other structural
properties of texts (e.g. headings,
paragraphs, lists etc.). Full
classification, contextual and
bibliographic information is also
included with each text in the form
of a TEI-conformant header.
What sort of corpus is the BNC?
Monolingual: It deals with modern British English, not other
languges used in Britain. However non-British English and
foreign language words do occur in the corpus.
Synchronic: It covers British English of the late twentieth
century, rather than the historical development which
produced it.
General: It includes many different styles and varieties, and is
not limited to any particular subject field, genre or register.
In particular, it contains examples of both spoken and
written language.
Sample: For written sources, samples of 45,000 words are
taken from various parts of single-author texts. Shorter
texts up to a maximum of 45,000 words, or multi-author
texts such as magazines and newspapers, are included in
full. Sampling allows for a wider coverage of texts within the
100 million limit, and avoids over-representing idiosyncratic
BNC and Sketchengine
Sketch Engine is an excellent userinterface to query the BNC.
Here are some screenshots.
An example of a POS-tagged text
I've been giving some thought to the
whole idea of writing a book as of late
(I've also been giving some thought to
winning the lottery, and we can all see
where that's got me) and it came to me
while showering the other night that if I
were to ever write a book (which ain't
gonna happen, but let's just say for the
sake of argument) I would bill myself as
the anti-Francis Mayes.
An example of a POS-tagged text
I_PNP 've_VHB been_VBN giving_VVG some_DT0
thought_NN1 to_PRP the_AT0 whole_AJ0 idea_NN1
of_PRF writing_VVG a_AT0 book_NN1 as_PRP21
of_PRP22 late_AJ0 (_( I_PNP 've_VHB also_AV0
been_VBN giving_VVG some_DT0 thought_NN1 to_PRP
winning_VVG the_AT0 lottery_NN1 ,_, and_CJC
we_PNP can_VM0 all_DT0 see_VVI where_AVQ
that_DT0 's_VHZ got_VVN me_PNP )_) and_CJC it_PNP
came_VVD to_PRP me_PNP while_CJS showering_VVG
the_AT0 other_AJ0 night_NN1 that_CJT if_CJS I_PNP
were_VBD to_TO0 ever_AV0 write_VVI a_AT0
book_NN1 (_( which_DTQ ai_UNC n't_XX0 gon_VVG
na_TO0 happen_VVI ,_, but_CJC let_VM021 's_VM022
just_AV0 say_VVI for_PRP the_AT0 sake_NN1 of_PRF
argument_NN1 )_) I_PNP would_VM0 bill_NN1
myself_PNX as_PRP the_AT0 anti-Francis_AJ0
Mayes_NP0 ._.
Fly UP