...

Corpus Linguistics

by user

on
10

views

Report

Comments

Transcript

Corpus Linguistics
Introducing
Corpus Linguistics
Dr. Gloria Cappelli
A/A 2006/2007 – University of Pisa
What is a CORPUS?
“A corpus is a collection of pieces
of language that are selected
and ordered according to
explicit linguistic criteria in order
to be used as a sample of the
language”
(Sinclair 1996)
What is a CORPUS?
“[…] the term corpus as used in
modern linguistics can best be
defined as a collection of sampled
texts, written or spoken, in machinereadable form which may be
annotated with various forms of
linguistic information”
(McEnery, Xiao and Tono 2006)
Key concepts re. Corpora:
• Machine-readable texts
• Authentic texts
• Sampled texts
• Representative of a particular
language or language variety
Is Corpus Linguistics a new
approach to the study of language?
The expression Corpus Linguistics
first appeared in the early 80s.
Corpus-based language study,
however has a substantial
history.
Corpus-based language study
In the “pre-Chomskyan era”:
• Field linguists (Boas)
• Structuralists (Sapir, Newman, Bloomfield,
Pike, etc.)
“Corpora” where few paper slips
with data. “Shoebox Corpora”:
Non-representative.
Corpus-based only in that the
methodology was empirical and
based on observable data.
The 50s: the “protests”
Chomsky (1962) accused the
(contemporary) corpus methodology,
by reason of the skewedness of
corpora.
Non-representative, time consuming,
competence vs. performance, Ilanguage vs. E-language
Corpora were marginalized.
The revolutionary 60s
With the advances in computer
technology the exploitation of
massive corpora became feasible.
Brown Corpus
Brown University Standard Corpus of American Present-day English
The 80s: the boom
From the 80s onwards the number
and size of corpora and corpus
based studies have increased
dramatically.
Corpora have revolutionized
almost all branches of linguistics.
A few remarks…
Computers…
… allow us to speed up the
processing of data.
… avoid human bias in data
analysis
… allow the enrichment of data
with metadata
Intuition vs. Corpus
Intuition should be applied with caution:
• Influence of dialect, sociolect, idiolect…
• No universal agreement on
(degree of) acceptability
• Informants monitor their use of
language (non-spontaneous)
• Introspection is not observable
Intuition vs. Corpus
• Corpus-based approach draws
upon authentic or real texts
• Computer-based analysis can
retrieve differences that
intuition alone cannot
perceive
• Reliable quantitative data
Should we dismiss intuition then?
Not at all!
The key to using corpus data
is to find the balance between
the use of corpus data and
the use of one’s own intuition.
Should we dismiss intuition then?
Not all research questions can be
addressed by the corpus-based
approach.
Corpus-based approach and
intuition-based approach
ARE NOT MUTUALLY EXCLUSIVE
Leech (1991:14) writes…
“[…] Neither the corpus linguist
of the 1950s, who rejected
intuition, nor the general
linguist of the 1960s, who
rejected corpus data, was able
to achieve the interaction of
data coverage and the insight
that characterise the many
successful corpus analyses of
recent years”.
Is CL a methodology or a theory?
No universal agreement.
CL is a METHODOLOGY and not
an independent branch of
linguistics such as semantics,
pragmatics, syntax, etc.
CL can be employed to explore
almost any area of linguistic
research.
Corpus-based or Corpusdriven approaches?
Corpus-based approaches are used to
“expound, test or exemplify theories
and descriptions that were formulated
before large corpora became available
to inform language study” (Tognini-Bonelli
2001:65).
Therefore, corpus-based linguists are
not strictly committed to corpus data
and they would discard “inconvenient
evidence” by insulation,
standardisation and instantiation (i.e.
via corpus annotation).
Corpus-based or Corpusdriven approaches?
Corpus-driven linguists are
“strictly committed to the
integrity of the data as a
whole”.
Theoretical statements are fully
consistent with, and reflect
directly, the evidence provided
by the corpus.
(Tognini-Bonelli 2001:84-85).
Corpus-based or Corpusdriven approaches?
The distinction is overstated, they are 2 idealized
extremes.
4
•
•
•
•
basic differences among the 2 approaches:
Types of corpora used
Attitudes towards theories and intuitions
Focuses of research
Paradigmatic claims
C.B. Approaches
C.D. Approaches
•
Corpus must be representative and
balanced
•
Corpus will balance itself when it
grows to be big enough (cumulative
representativeness);
•
Size is not all-important;
•
Corpus must be very large;
•
Minimum frequency is used to
exclude non-relevant results;
•
Corpus evidence is exploited fully,
but this way the number of the
combinations is enormous;
•
In favour of corpus annotation: CB
approaches generally have existing
theory as a starting point and
correct and revise such theory in
the light of corpus evidence;
•
Against corpus
annotation (no
preconceived theories)
•
Distinction between the different
levels of language analysis.
•
No distinction between
lexis, syntax, pragmatics,
etc. There is only 1 level
of language description:
the functionally complete
unit of meaning or language
patterning
We will only refer to
CORPUS-BASED APPROACHES
A few key notions in
Corpus Linguistics…
Representativeness
Essential feature of a corpus.
Balance (the range of genres
included in a corpus) and
sampling (how the text chunks
for each genre are selected)
ensure representativeness.
Representativeness
A corpus is representative if…
…the findings based on its contents
cane be generalized to the
said language variety (Leech
1991);
…its samples include the full
range of variability in a
population (Biber 1993)
Representativeness
It changes over time (Hunston
2002): if a corpus is not regularly
updated, it rapidly becomes
unrepresentative.
Representativeness
Criteria to select texts for a corpus:
• External criteria (Biber’s situational
perspective): defined situationally, e.g. genres,
registers, text types, etc.
• Internal criteria (Biber’s linguistic perspective):
defined linguistically, taking into account the
distribution of linguistic features. CIRCULAR –
because a corpus is typically design to study
linguistic distribution, so there is no point in
analysing a corpus where distribution of
linguistic features is predetermined.
Representativeness
2 main types (for the range of text
categories represented):
• General corpora – a basis for an
overall description of a language
(variety); their r. depends on the
sampling from a broad range of
genres.
• Specialized corpora – domain- or
genre specific corpora; their r. can be
measured by the degree of closure or
saturation (lexical features).
Balance
The range of text categories included
in the corpus:
The acceptable b. is determined by the
intended uses.
A balanced corpus covers a wide
range of text categories which are
supposed to be representative of the
language (variety) under
consideration.
Balance
There is no scientific measure for
balance.
It is more important for
sample corpora than for
monitor corpora
Sampling
A corpus is a sample of a given
population
A sample is representative if what
we find for the sample holds for
the general population
Samples are scaled-down
versions of a larger population
Sampling
Sampling unit: for written text, a s.u.
could be a book, periodical or
newspaper.
Population: the assembly of all
sampling units; it can be defined in
terms of language production,
reception (demographic, sex, age,
etc.) or language as a product
(category, genre of language data).
Sampling frame: the list of sampling
units
Sampling
Sampling techniques:
• Simple random sampling: all sampling
units within the sampling frame are
numbered and the sample is chosen by
use of a table or random numbers; rare
features could not be accounted for.
• Stratified random sampling: the
population is divided in relatively
homogeneous groups, i.e. the strata, and
then these latter are sampled at random;
never less representative than the
former method.
Sampling
Sample size:
• Full texts = no balance; peculiarity
of individual texts may show through.
• Text chunks are sufficient (e.g.
2000 running words): frequent
linguistic features are stable in their
distribution and hence short text
chunks are sufficient for their study
(Biber 1993). Text initial, middle and
end samples must be balanced.
Sampling
Proportion and number of samples:
The number of samples across
text categories should be
proportional to their frequencies
and/or weights in the target
population in order for the
resulting corpus to be considered
as representative
What matters is the
Research Question!
Claims of corpus representativeness and
balance should be interpreted in relative
terms as there is no objective way to
balance a corpus or to measure its
representativeness.
Representativeness is a fluid concept:
the research question that one has in
mind when building a corpus
determines what is an acceptable
balance for the corpus one should use
and whether it is suitably
representative.
Data collection
Spoken data must be transcribed
from audio recordings.
Written text must be rendered
machine-readable by keyboarding
or OCR (Optical Character
Recognition) scanning.
Language data so collected form a
RAW CORPUS.
Corpus Mark-up
System of standard codes inserted into
a document stored in electronic form to
provide information about the text itself
and govern formatting, printing
and other processes.
Most widely used mark-up schemes:
• TEI (Text Encoding Initiative)
• CES (Corpus Encoding Standard)
Corpus Mark-up
It is essential in corpus-building because…
…sampled texts are out of context and it
allows to recover contextual information
…it provides more information
than the file names alone (re.
text types, sociolinguistic variables,
textual information – structure)
…it ads value to the corpus because it
allows for a broader range of questions
to be addressed
…it allows to insert editorial comments
during the corpus building process.
Corpus Mark-up
Extra-textual and textual information
must be kept separate from the
corpus data.
Examples:
COCOA mark-up scheme
<A WILLIAM SHAKESPEARE>
A= author, attribute name
WILLIAM SHAKESPEARE= attribute value
TEI Mark-up Scheme
Each individual text is a document consisting
in a header and a body, in turn composed
of different elements.
Ex. in the header there are 4 main elements:
• A file description <fileDesc>
• An encoding description <encodingDesc>
• A text profile <profileDesc>
• A revision history <revisionDesc>
Tags can be nested, i.e. they can appear
inside other elements.
TEI Mark-up Scheme
It can be expressed using a number of
different formal languages.
SGML (Standard Generalized
Mark-up Language – used by
the BNC)
XML (Extensible Mark-up Language)
CES Mark-up Scheme
Designed specifically for the encoding
of language corpora.
• Document-wide mark-up
(bibliographical
descripion, encoding description, etc.)
• Gross structural mark-up
(volume,
chapter, paragraph, footnotes, etc.; specifies
recommended character sets)
• Mark-up for subparagraph structures
(sentence, quotations, words, abbreviations, etc.)
CES Mark-up Scheme
It specifies a minimal encoding level that
corpora must achieve to be considered
standardized in terms of descriptive
representation as well as general architecture.
3 levels of standardization designed
to achieve the goal of universal
document interchange:
• Metalanguage level
• Syntactic level
• Semantic level
Corpus Annotation
Necessary in order to extract relevant
information from corpora.
“The process of adding […] interpretive,
linguistic information to an electronic
corpus of spoken and/or written
language data”
(Leech 1997)
Annotation vs. Mark-up
Corpus mark-up provides objective,
verifiable information.
Annotation is concerned with
interpretive linguistic
information.
The advantages of
annotation
1. It makes extracting
information easier, faster
and enables human
analysts to exploit and
retrieve analyses of which
they are not themselves
capable.
The advantages of
annotation
2. Annotated corpora are reusable
resources.
3. Annotated corpora are
multifunctional: they can be
annotated with a purpose and
be reused with another.
The advantages of
annotation
4. Corpus annotation records a
linguistic analysis explicitly.
5. Corpus annotation provides a
standard reference resource, a
stable base of linguistic analyses,
so that successive studies can be
compared and contrasted on a
common basis.
Criticisms to corpus annotation
1. Annotation produces cluttered
corpora
2. Annotation imposes an analysis
3. Annotation overvalues corpora
making them less accessible
4. Is annotation accurate and
consistent?
How are corpora annotated?
• Automatic annotation
• Computer-assisted annotation
• Manual annotation
Sinclair (1992): the introduction
of the human element in corpus
annotation reduces consistency.
Types of annotation
Different types of annotation can
be carried out with different
means.
For some types automatic
annotation is very accurate.
Other types require
post-editing, i.e. human
correction.
Types of annotation
Corpora can be annotated at
different levels of linguistic
analysis.
• Phonological level
– Syllable boundaries
(phonetic/phonemic annotation)
– Prosodic features (prosodic
annotation)
Types of annotation
• Morphological level
– Prefixes
– Suffixes
– Stems
(morphological annotation)
Types of annotation
• Lexical level
– Part of speech (POS Tagging)
– Lemmas (lemmatization)
– Semantic fields (semantic
annotation)
• Syntactic level
– parsing
– treebanking
– bracketing
Types of annotation
• Discourse level
– Anaphoric relations (coreference
annotation)
– Speech acts (pragmatic
annotation)
– Stylistic features such as speech
and thought in presentation
(stylistic annotation).
POS Tagging
POS is the most common type of
annotation.
Also known as grammatical tagging or
morpho-syntactic annotation.
It provides the basis of further forms
of analysis such as parsing and
semantic annotation.
Many linguistic analyses, e.g. the collocates
of a word depend heavily on POS tagging.
POS Tagging
It can be performed automatically
with taggers like CLAWS
http://www.comp.lancs.ac.uk/ucrel/claws/
You can try it for free online.
Examples of tags: NN1 (noun), VVZ
(verb in the third person of the
simple present tense), VVD (verb in
the simple past form), ADJ0
(adjective in the basic form), etc.
POS Tagging
Problems:
• Word segmentation (tokenization)
– Multiwords (so that, inspite of)
– Mergers (can’t, gonna)
– Variably spelled compounds
(noticeboard, notice-board, notice
board)
Lemmatization
Type of annotation that reduces the
inflextional variants of words to their
respective lexemes or lemmas as they
appear in dictionary entries:
Do, does, did, done, doing= DO
Corpus, corpora= CORPUS
Small capital letters are the convention.
Lemmatization
It is important in vocabulary studies
and lexicography, e.g. in studying
the distribution pattern of
lexemes and improving dictionaries
and computer
lexicons.
It can be automatically performed.
Parsing
Once a corpus is POS tagged, it
is possible to bring these
morpho-syntactic categories
into higher level syntactic
relationships with one another,
that is, to analyse the
sentences in a corpus into their
constituents.
Parsing consists in bracketing.
It can be automated but with a
low precision rate.
Parsing
Example:
(S
(NP
(VP
Mary)
visited)
(NP a
(ADJP very nice)
boy)))
Semantic annotation
It assigns codes indicating the semantic
features of the semantic fields of the
words in a text. It is knowledge-based
so it needs to be manual most of
the time.
Two types:
– One marks the semantic relationships
between the constituents in a sentence
– One marks the semantic features of
words in a text
Coreference annotation
• Pronouns
• Repetition
• Substitution
• Ellipsis
Computer-assisted at best.
Pragmatic annotation
• Speech/dialogue acts in domain-specific
dialogue.
The most coherent system is DRI
(Discourse Representation Initiative).
3 layers of coding:
– Segmentation (dividing dialogue in
textual units, utterances)
– Functional annotation (dialogue act
annotation)
– Utterance tags (applying utterance tags that
characterize the role of the utterance as a
dialogue act)
Pragmatic annotation
Utterance tags:
– Communicative status (intelligible,
complete, etc.)
– Information level and status (indicating
the semantic content of the utterance and
how it relates to the task in question)
– Forward-looking communicative function
(utterances that may constrain or affect
the discourse, e.g. assert, request,
question and offer)
– Backwarding-looking communicative
function (utterances that relate to
previous parts of the discourse, e.g.
accept, backchannelling, answer)
Stylistic annotation
It is particularly associated with
stylistic features in literary texts.
An example: the representation
of people’s speech and thoughts,
known as speech ad thought
presentation (S&TP)
Other types of tagging
• Error tagging
• Problem-oriented annotation
Types of corpora
• Multilingual
• Monolingual
Multilingual Corpora
• Parallel corpora (source texts
plus translations): Canadian Hansard
• Comparable corpora
(monolingual subcorpora
designed using the same
sampling techniques): Aahrus
corpus of contract law
– Multilingual
– Bilingual
Multilingual Corpora
Important resources for translation and
contrastive studies.
Multilingual corpora…
• …give new insight into the language
compared
• …can be used to study language
specific and universal features
• …illuminate differences between
source texts and translations
• …can be used for a number of practical
applications, in lexicography, language
teaching, translation, etc.
Parallel Corpora
• Bilingual vs.Multilingual
• Unidirectional (from La to Lb or
from Lb to Lc alone) vs.
Bidirectional (from La to Lb
and from Lb to La) vs.
Multidirectional (from La to Lb,
Lc etc.)
Comparable corpora
A corpus containing components
that are collected using the
same sampling techniques and
similar balance and
representativeness, e.g. the
same proportions of the texts
of the same genres in the same
domains in a range of different
languages in the same
sampling period.
Comparable vs. parallel
corpora
The sampling frame is
essential for comparable
corpora but not for parallel
corpora because the texts
are exact translations of
each other.
Corpus Alignment
In order for us to be able to fully
exploit parallel corpora, they need
to be aligned.
Different types of alignment:
• Word-level alignment
• Sentence-level alignment
• Paragraph alignment
General Corpora
• British National Corpus
(100,106,008 words)
• The American National
Corpus
• ICE-CUP
Specialized Corpora
• Guangzhou Petroleum English Corpus
(411,612 words of written English
from the petrochemical domain)
• HKUST Computer Science Corpus
(1,000,000 words of written English
sampled from undergraduate
textbooks in computer science.
• CPSA (Corpus of Professional Spoken
American English)
• MICASE (1,700,000 words of English
spoken in the academic domain)
Written Corpora
• BROWN Corpus (written texts, AE
in 1961)
• LOB Corpus (Comparable to
BROWN Corpus, BE, early 1960s)
• FROWN Corpus (AE, Early 1990s)
• FLOB Corpus (BE, Early 1990’s)
Spoken Corpora
• London-Lund Corpus (LLC)
• Lancaster/IBM Spoken English
Corpus (SEC)
• Cambridge and Nottingham
Corpus of Discourse in English
(CANCODE)
• Santa Barbara Corpus of Spoken
American English (SBCSAE)
• Wellington Corpus of Spoken New
Zealand English (WSC)
Synchronic Corpora
Useful to compare varieties of
English. Texts date all to the same
period.
• Brown and Lob
• Frown and Flob
• International Corpus of English
(ICE) (Texts produced after 1989)
• BNC
Diachronic Corpora
Texts date to different periods in time.
Ideal to study language change and
history.
• Brown/Frown
• Lob/Flob
• Helsinki Diachronic Corpus of
English Texts (8th-18th century)
• Archer Corpus – A representative
Corpus of Historical English
Registers (BE and AE, 1650-1990).
Learner/developmental
Corpora
Lstr or L2 acquisition/L1 acquired by
children
• CHILDES (DC)
• International Corpus of
Learner English – ICLE (LC)
• Cambridge Learner Corpus
(LC)
Monitor Corpora
Constantly supplemented with fresh
material and keep increasing in
size, though the proportion of text
types included in the corpus
remains constant.
• Bank of English (BoE)
• Global English Monitor Corpus
• AVIATOR
The BNC
The British National Corpus
(BNC) is a 100 million word
collection of samples of written
and spoken language from a
wide range of sources,
designed to represent a wide
cross-section of current British
English, both spoken and
written.
The BNC
The written part of the BNC (90%) includes,
for example, extracts from regional and
national newspapers, specialist periodicals
and journals for all ages and interests,
academic books and popular fiction,
published and unpublished letters and
memoranda, school and university essays,
etc. The spoken part (10%) includes a large
amount of unscripted informal conversation,
recorded by volunteers selected from
different age, region and social classes in a
demographically balanced way, together with
spoken language collected in all kinds of
different contexts, ranging from formal
business or government meetings to radio
shows and phone-ins.
The BNC
The corpus is encoded according to
the Guidelines of the Text Encoding
Initiative (TEI) to represent both
the output from CLAWS (automatic
part-of-speech tagger) and a
variety of other structural
properties of texts (e.g. headings,
paragraphs, lists etc.). Full
classification, contextual and
bibliographic information is also
included with each text in the form
of a TEI-conformant header.
What sort of corpus is the BNC?
Monolingual: It deals with modern British English, not other
languges used in Britain. However non-British English and
foreign language words do occur in the corpus.
Synchronic: It covers British English of the late twentieth
century, rather than the historical development which
produced it.
General: It includes many different styles and varieties, and is
not limited to any particular subject field, genre or register.
In particular, it contains examples of both spoken and
written language.
Sample: For written sources, samples of 45,000 words are
taken from various parts of single-author texts. Shorter
texts up to a maximum of 45,000 words, or multi-author
texts such as magazines and newspapers, are included in
full. Sampling allows for a wider coverage of texts within the
100 million limit, and avoids over-representing idiosyncratic
texts.
BNC and Sketchengine
Sketch Engine is an excellent userinterface to query the BNC.
Here are some screenshots.
An example of a POS-tagged text
I've been giving some thought to the
whole idea of writing a book as of late
(I've also been giving some thought to
winning the lottery, and we can all see
where that's got me) and it came to me
while showering the other night that if I
were to ever write a book (which ain't
gonna happen, but let's just say for the
sake of argument) I would bill myself as
the anti-Francis Mayes.
An example of a POS-tagged text
I_PNP 've_VHB been_VBN giving_VVG some_DT0
thought_NN1 to_PRP the_AT0 whole_AJ0 idea_NN1
of_PRF writing_VVG a_AT0 book_NN1 as_PRP21
of_PRP22 late_AJ0 (_( I_PNP 've_VHB also_AV0
been_VBN giving_VVG some_DT0 thought_NN1 to_PRP
winning_VVG the_AT0 lottery_NN1 ,_, and_CJC
we_PNP can_VM0 all_DT0 see_VVI where_AVQ
that_DT0 's_VHZ got_VVN me_PNP )_) and_CJC it_PNP
came_VVD to_PRP me_PNP while_CJS showering_VVG
the_AT0 other_AJ0 night_NN1 that_CJT if_CJS I_PNP
were_VBD to_TO0 ever_AV0 write_VVI a_AT0
book_NN1 (_( which_DTQ ai_UNC n't_XX0 gon_VVG
na_TO0 happen_VVI ,_, but_CJC let_VM021 's_VM022
just_AV0 say_VVI for_PRP the_AT0 sake_NN1 of_PRF
argument_NN1 )_) I_PNP would_VM0 bill_NN1
myself_PNX as_PRP the_AT0 anti-Francis_AJ0
Mayes_NP0 ._.
Fly UP