Corpus Creation And Analysis

An introduction to corpus linguistics, creating your very own corpus for research and some basic techniques in corpus analysis

Locky Law

on 11 April 2014

Transcript of Corpus Creation And Analysis

Corpus Creation and Analysis
What are the key considerations?
How large? --> representativeness & practicality (time constraints)
eg. 604,767-word Friends corpus (Quaglio, 2008)
or smaller specialised corpus
How to collect? --> virtual (web) or reality (class, assignment, speech)
Mark-up? --> Not all corpora need; parts of speech (POS)
(Reppen, 2010)
Plan before you start! Why?
Word Frequency
Word List
Grammar check
Usage (parts of speech, UK vs US English, etc.)
Keyword Keyness
WordSmith Tools
British National Corpus (BNC)
Corpus of Contemporary American English (COCA)
PolyU Language Bank
PolyU Web Concordancer (English)
Google Ngram Viewer
and many many more...


How to Construct Our Own Corpora?
"Seek and ye shall find" --Matthew 7:7
"One must have a good tool in order to do a good job"
-- Confucius
Realising What Is Real In Languages
by Locky
"Study the past as if you would define the Future" -- Confucius
Corpus Creation
Corpus Types
Can you name some examples?
Corpus Types
According to purpose: General-purpose corpora ; Domain-specific (or 'sub-language') corpora
According to text selection procedure: Sample corpus; Full-text corpora
Open / Close character: Close/static corpus; Open/dynamic corpus; 'Collections'
According to Medium: Written corpora; Spoken corpora; Mixed Corpora
According to number of languages / dialects represented: Monolingual corpora; Multilingual or 'parallel' corpora
According to temporal variety: Synchronic; Diachronic.
According to type of speaker: Native corpora, Learner corpora
According to annotation: Plain corpora; Annotated corpora
(Lario, unknown)
The Hong Kong Polytechnic University
ENGL 545 Multimedia in English Language Learning
(Latin plural
, English plural
) is Latin for

... I haven't seen her for weeks... John is marrying her this Sunday...
Mark-up Version 1:
... I (n) haven't (-axvb) seen (vb) her(n) for (pr) weeks(n)...
... John (n) is (axvb) marrying (vb) her (n) this (adj) Sunday (n)...
What if later I want to distinguish pronouns from proper nouns?
Size Matters!
Spoken corpora tend to be smaller than written ones. (Obviously!!!)
large: spoken corpus > 1 million words
small: written corpora < 5 million words (O’Keeffe et al., 2007)

small corpora contain up to 250,000 words (Flowerdew , 2004)
in Koester (2010)
How Small?
Biber (1990) : 1000 words are enough to produce reliable results
Tribble (1997): small corpus is enough if register is very specialised
Koester (2006): < 34,000 words of workplace discourse
O'Keeffe (2003): 55,000 words of phone-in data in radio discourse
in Evison (2010)
WordSmith Tools:
Main Console -- Advanced Settings:
Show if frequency at least __
WordList --
Simple (Quick):
Create and save your WordList
Preferred (Takes longer time, enable computation of clusters):
Make/Add to Index and Save
KeyWords -- Choose 2 WordLists (Working Corpus & Reference Corpus)
Make a keyword list now

Compute -->Clusters (WordList, KeyWords) & Concordance

Download Locky's Sample Corpora

Web Concordancers
(Dis) Prove the following claim!
"haven't any" is seen in Chinese learners' English but seldom in its native English counterpart. (What about Hong Kong learners?)

Use BNC to (dis)prove the following claim.
A non-native English grammar teacher K taught them that native English speakers use “no” more often and seldom use “not” in both their written and spoken English, thus learners should avoid using “not” whenever possible.

Do we say "highest", "largest" or "biggest" percentage?

I teach math in English and my students always need to describe charts, tables and diagrams. I have trouble explaining which prepositions go with "percent". What are the options? Which ones should I teach first?

What are the differences between 'fast', 'quick', 'speedy' and 'rapid'?

Advance level
What are the differences between 'sick' and' ill'?

What are the differences between 'whether' and 'if'?
Full transcript