Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.
Corpus Linguistics Introduction
Transcript of Corpus Linguistics Introduction
Some untagged text
Given what you've learned so far, think up a project that would require or benefit from an annotated corpus.
What is a corpus?
Types of Corpora
What is a corpus?
The Lexical Syllabus
"The main focus of study should be on a) the commonest word
forms in the language b) the central patterns of usage c) the
combinations which they usually form." (Sinclair and Renouf 1988:
Limitations of Corpus Linguistics
1. It won’t tell us if something is possible in a language, or well-formed. E.g. is “he expired of heart disease” acceptable
2. Any generalisations we make from corpus data can only be deductions – not facts.
3. Corpora give us evidence, but not information or explanations. Why do women say “wash” more than men?
4. Corpora give us language out of context – so no visual information e.g. pictures, fonts etc. And with spoken data –
no information on what the speakers look like, behaviour or body language.
Corpora and Language Teaching
Why use a corpus?
1. Large amounts of data tell us about tendencies and what’s normal or typical in real-life language use.
2. Corpora reveal instances of very rare or exceptional cases, that we wouldn’t get from looking at single texts or introspection.
3. Humans make mistakes and are slow.
1. It's usually large.
2. It must be representative.
3. It should be machine-readable
Latin for "corpse" = a body
1. Specialised corpus – e.g.
• genre: the language of newspapers
• time: 2005 to the present day
• place: just texts published in China
2. General corpus – needs to be much larger. E.g. The British National Corpus (BNC) has about 100 million words of spoken and written British English:
3. Multilingual corpus – e.g. English and Spanish. Or American English and Indian English.
4. Parallel corpus – e.g. English and Spanish – exactly the
same texts translated. E.g. the CRATER corpus.
5. Learner corpus – language use created by people learning a particular language. E.g. the International Corpus of
6. Monitor corpus – continually being added to. e.g. the Bank of English
7. Historical or Diachronic corpus – e.g. Helsinki corpus – 1.5
million words of texts from 700AD to 1700AD.
Your query "wash" returned 2415 matches in 952 different
texts (in 97,626,093 words; freq: 24.74 instances per million
Tags for headers and paragraphs
Change quotes to SGML
Tags for punctuation
Tags for word units
Tags for grammatical codes
University of Nottingham
Coverage: What to include, what to leave out
maulstick – appears in Oxford Advanced Learner’s Dictionary of
Current English (1974) and Longman Dictionary of Contemporary English (1978). Occurs twice in Oxford English Corpus (1.5 billion words).
E – Essential, about 4,900 terms
I – Improver, 3,300 terms
A – Advanced, 3,700 terms
(Cambridge Advanced Learner’s Dictionary)
thread . . .
. . . ~bare, adj. 1. (of cloth) worn thin; shabby: a ~bare coat. 2. (4 g.) much used and therefore uninteresting or valueless; hackneyed:
~bare jokes (sermons, arguments). (Oxford Advanced Learner’s
Dictionary of Current English 1974)
"threadbare clothing, carpet, or cloth is very thin and almost has holes in it because it has been worn or used a lot. Wearing or containing threadbare things, a threadbare family apartment. A threadbare idea
or excuse has been used a lot and is no longer effective."
Macmillan English Dictionary for Advanced Learners 2002
strategies for independence
• 4000-5000 different word types account for up to 95% of written texts
• 1000 words account for 85% of written texts
• 50 high frequency func.on words account for up to 60% of spoken
language (Nation 1990)
adding individual words often futile
extended and metaphorical meanings
more complex chunks