Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Corpus Creation And Analysis

An introduction to corpus linguistics, creating your very own corpus for research and some basic techniques in corpus analysis

Locky Law

on 11 April 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Corpus Creation And Analysis

Corpus Creation and Analysis
What are the key considerations?
How large? --> representativeness & practicality (time constraints)
eg. 604,767-word Friends corpus (Quaglio, 2008)
or smaller specialised corpus
How to collect? --> virtual (web) or reality (class, assignment, speech)
Mark-up? --> Not all corpora need; parts of speech (POS)
(Reppen, 2010)
Plan before you start! Why?
Word Frequency
Word List
Grammar check
Usage (parts of speech, UK vs US English, etc.)
Keyword Keyness
WordSmith Tools
British National Corpus (BNC)
Corpus of Contemporary American English (COCA)
PolyU Language Bank
PolyU Web Concordancer (English)
Google Ngram Viewer
and many many more...


How to Construct Our Own Corpora?
"Seek and ye shall find" --Matthew 7:7
"One must have a good tool in order to do a good job"
-- Confucius
copy and paste as needed to add notes to your brainstorm
Realising What Is Real In Languages
by Locky
"Study the past as if you would define the Future" -- Confucius
Corpus Creation
Corpus Types
Can you name some examples?
Corpus Types
According to purpose: General-purpose corpora ; Domain-specific (or 'sub-language') corpora
According to text selection procedure: Sample corpus; Full-text corpora
Open / Close character: Close/static corpus; Open/dynamic corpus; 'Collections'
According to Medium: Written corpora; Spoken corpora; Mixed Corpora
According to number of languages / dialects represented: Monolingual corpora; Multilingual or 'parallel' corpora
According to temporal variety: Synchronic; Diachronic.
According to type of speaker: Native corpora, Learner corpora
According to annotation: Plain corpora; Annotated corpora
(Lario, unknown)
The Hong Kong Polytechnic University
ENGL 545 Multimedia in English Language Learning
(Latin plural
, English plural
) is Latin for

... I haven't seen her for weeks... John is marrying her this Sunday...
Mark-up Version 1:
... I (n) haven't (-axvb) seen (vb) her(n) for (pr) weeks(n)...
... John (n) is (axvb) marrying (vb) her (n) this (adj) Sunday (n)...
What if later I want to distinguish pronouns from proper nouns?
Size Matters!
Spoken corpora tend to be smaller than written ones. (Obviously!!!)
large: spoken corpus > 1 million words
small: written corpora < 5 million words (O’Keeffe et al., 2007)

small corpora contain up to 250,000 words (Flowerdew , 2004)
in Koester (2010)
How Small?
Biber (1990) : 1000 words are enough to produce reliable results
Tribble (1997): small corpus is enough if register is very specialised
Koester (2006): < 34,000 words of workplace discourse
O'Keeffe (2003): 55,000 words of phone-in data in radio discourse
in Evison (2010)
Andersen, G. (2012). Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian. John Benjamins Publishing.
Charles, M. (2012). ‘Proper vocabulary and juicy collocations’: EAP students evaluate do-it-yourself corpus-building. English for Specific Purposes , 31 (2), 93-102.
Dahlmeier, D., Ng, H. T., & Wu, S. M. (2013). Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, (pp. 22-31).
Evison, J. (2010). What are the basics of analysing a corpus? In A. O’Keeffe, & M. McCarthy, The Routledge Handbook of Corpus Linguistics (pp. 122-135). London and New York: Routledge.
Fletcher, W. H. (2014, November 5). Corpus Analysis of the World Wide Web. Retrieved April 7, 2014, from The Encyclopedia of Applied Linguistics - Wiley Online Library: http://onlinelibrary.wiley.com/doi/10.1002/9781405198431.wbeal0254/pdf
Koester, A. (2010). Building small specialised corpora. In A. O’Keeffe, & M. McCarthy, The Routledge Handbook of Corpus Linguistics (pp. 66-79). London and New York: Routledge.
Lario, J. S. (Unknown). MODELOS GRAMATICALES. Corpus Linguistics. Retrieved April 8, 2014, from Universidad de Granada: http://www.ugr.es/~jsantana/modelos/NOTES/02_Types_of_Corpora.pdf
Law, L. L. (2013c, September 27). Usage & Academic Research: Describing Percentage -- A Corpus Approach. Retrieved April 8, 2014, from Locky's English Playground: http://lockyep.blogspot.hk/2013/09/usage-academic-research-describing.html
Law, L. L. (2013a, August 27). Usage & Academic Research: Differences In Similar Words -- The Corpus Approach Part 1. Retrieved April 8, 2014, from Locky's English Playground: http://lockyep.blogspot.hk/2013/08/usage-academic-research-differences-in.html
Law, L. L. (2013b, August 27). Usage & Academic Research: Differences In Similar Words -- The Corpus Approach Part 2. Retrieved April 8, 2014, from Locky's English Playground: http://lockyep.blogspot.hk/2013/08/usage-academic-research-differences-in_27.html
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D., et al. (2012). On building a reusable twitter corpus. SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (pp. 1113-1114). New York: ACM.
Quaglio, P. (2008). Television Dialogue and Natural Conversation: Linguistic Similarities and Differences. In A. Äde, & R. Reppen, Corpora and Discourse: The Challenges of Different Settings (pp. 189-210). Amsterdam: John Benjamins.
Reppen, R. (2010). Building a corpus: What are the key considerations? In A. O’Keeffe, & M. McCarthy, The Routledge Handbook of Corpus Linguistics (pp. 31-37). London and New York: Routledge.
Zhu, W., Zhang, W., Shi, Q., Chen, F., Li, H., Ma, X., et al. (2002). Corpus building for data-driven TTS systems. Proceedings of 2002 IEEE Workshop, (pp. 199-202).

WordSmith Tools:
Main Console -- Advanced Settings:
Show if frequency at least __
WordList --
Simple (Quick):
Create and save your WordList
Preferred (Takes longer time, enable computation of clusters):
Make/Add to Index and Save
KeyWords -- Choose 2 WordLists (Working Corpus & Reference Corpus)
Make a keyword list now

Compute -->Clusters (WordList, KeyWords) & Concordance

Download Locky's Sample Corpora

Web Concordancers
(Dis) Prove the following claim!
"haven't any" is seen in Chinese learners' English but seldom in its native English counterpart. (What about Hong Kong learners?)

Use BNC to (dis)prove the following claim.
A non-native English grammar teacher K taught them that native English speakers use “no” more often and seldom use “not” in both their written and spoken English, thus learners should avoid using “not” whenever possible.

Do we say "highest", "largest" or "biggest" percentage?

I teach math in English and my students always need to describe charts, tables and diagrams. I have trouble explaining which prepositions go with "percent". What are the options? Which ones should I teach first?

What are the differences between 'fast', 'quick', 'speedy' and 'rapid'?

Advance level
What are the differences between 'sick' and' ill'?

What are the differences between 'whether' and 'if'?
Full transcript