Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Corpus Creation And Analysis
Transcript of Corpus Creation And Analysis
Corpus Creation and Analysis
What are the key considerations?
How large? --> representativeness & practicality (time constraints)
eg. 604,767-word Friends corpus (Quaglio, 2008)
or smaller specialised corpus
How to collect? --> virtual (web) or reality (class, assignment, speech)
Mark-up? --> Not all corpora need; parts of speech (POS)
Plan before you start! Why?
Usage (parts of speech, UK vs US English, etc.)
British National Corpus (BNC)
Corpus of Contemporary American English (COCA)
PolyU Language Bank
PolyU Web Concordancer (English)
Google Ngram Viewer
and many many more...
How to Construct Our Own Corpora?
"Seek and ye shall find" --Matthew 7:7
"One must have a good tool in order to do a good job"
copy and paste as needed to add notes to your brainstorm
Realising What Is Real In Languages
"Study the past as if you would define the Future" -- Confucius
Can you name some examples?
According to purpose: General-purpose corpora ; Domain-specific (or 'sub-language') corpora
According to text selection procedure: Sample corpus; Full-text corpora
Open / Close character: Close/static corpus; Open/dynamic corpus; 'Collections'
According to Medium: Written corpora; Spoken corpora; Mixed Corpora
According to number of languages / dialects represented: Monolingual corpora; Multilingual or 'parallel' corpora
According to temporal variety: Synchronic; Diachronic.
According to type of speaker: Native corpora, Learner corpora
According to annotation: Plain corpora; Annotated corpora
The Hong Kong Polytechnic University
ENGL 545 Multimedia in English Language Learning
, English plural
) is Latin for
... I haven't seen her for weeks... John is marrying her this Sunday...
Mark-up Version 1:
... I (n) haven't (-axvb) seen (vb) her(n) for (pr) weeks(n)...
... John (n) is (axvb) marrying (vb) her (n) this (adj) Sunday (n)...
What if later I want to distinguish pronouns from proper nouns?
Spoken corpora tend to be smaller than written ones. (Obviously!!!)
large: spoken corpus > 1 million words
small: written corpora < 5 million words (O’Keeffe et al., 2007)
small corpora contain up to 250,000 words (Flowerdew , 2004)
in Koester (2010)
Biber (1990) : 1000 words are enough to produce reliable results
Tribble (1997): small corpus is enough if register is very specialised
Koester (2006): < 34,000 words of workplace discourse
O'Keeffe (2003): 55,000 words of phone-in data in radio discourse
in Evison (2010)
Andersen, G. (2012). Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian. John Benjamins Publishing.
Charles, M. (2012). ‘Proper vocabulary and juicy collocations’: EAP students evaluate do-it-yourself corpus-building. English for Specific Purposes , 31 (2), 93-102.
Dahlmeier, D., Ng, H. T., & Wu, S. M. (2013). Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, (pp. 22-31).
Evison, J. (2010). What are the basics of analysing a corpus? In A. O’Keeffe, & M. McCarthy, The Routledge Handbook of Corpus Linguistics (pp. 122-135). London and New York: Routledge.
Fletcher, W. H. (2014, November 5). Corpus Analysis of the World Wide Web. Retrieved April 7, 2014, from The Encyclopedia of Applied Linguistics - Wiley Online Library: http://onlinelibrary.wiley.com/doi/10.1002/9781405198431.wbeal0254/pdf
Koester, A. (2010). Building small specialised corpora. In A. O’Keeffe, & M. McCarthy, The Routledge Handbook of Corpus Linguistics (pp. 66-79). London and New York: Routledge.
Lario, J. S. (Unknown). MODELOS GRAMATICALES. Corpus Linguistics. Retrieved April 8, 2014, from Universidad de Granada: http://www.ugr.es/~jsantana/modelos/NOTES/02_Types_of_Corpora.pdf
Law, L. L. (2013c, September 27). Usage & Academic Research: Describing Percentage -- A Corpus Approach. Retrieved April 8, 2014, from Locky's English Playground: http://lockyep.blogspot.hk/2013/09/usage-academic-research-describing.html
Law, L. L. (2013a, August 27). Usage & Academic Research: Differences In Similar Words -- The Corpus Approach Part 1. Retrieved April 8, 2014, from Locky's English Playground: http://lockyep.blogspot.hk/2013/08/usage-academic-research-differences-in.html
Law, L. L. (2013b, August 27). Usage & Academic Research: Differences In Similar Words -- The Corpus Approach Part 2. Retrieved April 8, 2014, from Locky's English Playground: http://lockyep.blogspot.hk/2013/08/usage-academic-research-differences-in_27.html
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D., et al. (2012). On building a reusable twitter corpus. SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (pp. 1113-1114). New York: ACM.
Quaglio, P. (2008). Television Dialogue and Natural Conversation: Linguistic Similarities and Differences. In A. Äde, & R. Reppen, Corpora and Discourse: The Challenges of Different Settings (pp. 189-210). Amsterdam: John Benjamins.
Reppen, R. (2010). Building a corpus: What are the key considerations? In A. O’Keeffe, & M. McCarthy, The Routledge Handbook of Corpus Linguistics (pp. 31-37). London and New York: Routledge.
Zhu, W., Zhang, W., Shi, Q., Chen, F., Li, H., Ma, X., et al. (2002). Corpus building for data-driven TTS systems. Proceedings of 2002 IEEE Workshop, (pp. 199-202).
Main Console -- Advanced Settings:
Show if frequency at least __
Create and save your WordList
Preferred (Takes longer time, enable computation of clusters):
Make/Add to Index and Save
KeyWords -- Choose 2 WordLists (Working Corpus & Reference Corpus)
Make a keyword list now
Compute -->Clusters (WordList, KeyWords) & Concordance
Download Locky's Sample Corpora
(Dis) Prove the following claim!
"haven't any" is seen in Chinese learners' English but seldom in its native English counterpart. (What about Hong Kong learners?)
Use BNC to (dis)prove the following claim.
A non-native English grammar teacher K taught them that native English speakers use “no” more often and seldom use “not” in both their written and spoken English, thus learners should avoid using “not” whenever possible.
Do we say "highest", "largest" or "biggest" percentage?
I teach math in English and my students always need to describe charts, tables and diagrams. I have trouble explaining which prepositions go with "percent". What are the options? Which ones should I teach first?
What are the differences between 'fast', 'quick', 'speedy' and 'rapid'?
What are the differences between 'sick' and' ill'?
What are the differences between 'whether' and 'if'?