Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Edita Niaurienė

No description
by

Edita Niauriene

on 12 October 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Edita Niaurienė

Edita Niaurienė
17th Sept. 2014
Corpus defined
A collection of texts (utterances) used to conduct some type of linguistic investigation.
Electronic Corpora
data can be manipulated by a computer
Shortcomings of printed corpora
1. the effort required to physically gather a printed corpus is time-consuming;
2. limited range of documents can be gathered;
3. manual analysis is error-prone.
A collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety.
linguists / translators can access and display the information in a variety of useful ways.
printed corpora VS electronic corpora
bigger in size, easier to gather necessary data
Different types of electronic corpora
1. Monolingual

2. Bilingual

3. Multilingual
corpus that contains texts in single language
corpus that contains texts in two languages
corpus that contains texts in more than two languages
Bilingual corpus that contains source
texts and their translations is sometimes referred to as , but the more common term is corpus, which can be used to describe both bilingual and multilingual collections.
Bitext
parallel
Parallel corpora
consist of source texts aligned
with their translations
monolingual comparable corpora (MCC)
1. A collection of texts originally written in language A.
2. A collection of texts translated into language A from other languages
Vienkalbiai lyginamieji tekstynai
Corpus-analysis tools
Most corpus analysis tools come with several main features that allow users
to generate and manipulate
Word-frequency lists
-the most basic feature;
-helps to discover how many words are in corpus and how often each appears.

TOKENS - total number of words in the corpus (13).
TYPES - how many times each word appears (9).
1. word-frequency lists,
2. concordances,
3. collocations.
e.g. I really like translation because I think that translation is really, really fun.
a total number of words in
the corpus. (13)
how many times each word appears in the corpus. (9)
e.g. I really like translation because I think that translation is really, really fun.
TyPES
TOKENs
can be sorted in various orders:
1. order of occurrence in the corpus (Figure 1),
2. alphabetical order (Figure 2),
3. order of frequency (Figure 3),
Lists can be arranged in ascending or descending order same list can be arranged in 6 different ways.

In addition to counting the frequency of words, corpus-analysis tools calculate of types to tokens.
the ratio
Some corpus-analysis tools can also count the number of sentences and paragraphs and calculate the average length of words, sentences, and paragraphs in the corpus.
This type of information can help translators assess some of the stylistic features of the texts in the corpus.
Lemmatized lists
- more sophisticated types of manipulations than word-frequency lists;
- ablity to group related words together to get a combined frequency count for the
group of words rather than separate counts for each individual word form.

The term
is used to describe a word it includes and represents all related forms.
“lemma”
One difficulty that may arise when lemmatizing a word list automatically is the case of
(same spelling, different part of speech, e.g. “test” can be a noun or a verb).

In order for the computer to be able to distinguish these different forms, it is necessary to have a corpus that is annotated with part-of-speech information.
Stop lists
Type of specialized list which contains any items that a user wants the computer to ignore.
words with a grammatical function:
articles, conjunctions, and prepositions.
homographs
Concordancers
A tool that allows the user to see all the occurrences of a particular word in its immediate contexts and displays these in easy-to-read format.

Concordancers operate on:
a) monolingual texts,
b) bilingual texts.
Monolingual concordancers
Full-text search - searching through the entire corpus from beginning to end.

Indexed search - creating an index of all the words in the corpus along with a record of the location of each occurrence (e.g., line number)
KWIC (“key word in context”) display of the concordances:
- the most common result display format
- all occurrences of the search pattern are lined up in the centre of the screen.
Contexts can be sorted:
- order of appearance in the corpus,
- alphabetically (according to the words preceding or following the search pattern)
Bilingual concordancers
- used to investigate the contents of a parallel corpus, which contains a collection of ST in lge A aligned with their translations into lge B.
Alignment - sections of the ST are linked up with their corresponding translations.
Alignment can take place at different levels: text, paragraph, sentence, sub-sentence chunk, or even word.

contain texts in language A alongside their translations into language B, C, etc.
- ST aligned with translations.

texts written in different languages but having the same communicative function
(e.g. all on the same subject / type of text)
- cannot be aligned, no ST-TT relationship
characteristics:
“large” collection of “electronic” texts gathered according to “explicit criteria”
Advantages:
Bilingual and Multilingual corpora:
1. parallel (lygiagretieji)
2. comparable corpora (lyginamieji)
Parallel corpora: texts, paragraphs and sentences from one language and their translations in another language are connected.This allows you to study how words and phrases are used and translated across various languages.
consist of 2 parts:
+ for studying the nature of translated text
- less useful resource for practising translators
MCC:
Corpus-analysis tools (1.2)
Corpus-analysis tools (1.1)
Word-frequency lists
e.g. translate, translates, translating, translated
individual words under one lemma appear in parentheses
Corpus-analysis tools (1.3)
Corpus-analysis tools (2)
Corpus-analysis tools (2.1)
Corpus-analysis tools (2.2 )
- Case-sensitive searches (e.g., Turkey-turkey)
- wildcard searches (e.g., "print*", "dis?s")
- context search (another term appears within a specified distance of the search patern)
Other search patterns:
for best results, ST and TT must have similar/identical structure
ūŪūūūŪū
operate by:
Once a search has been conducted.....results are displayed
Mono/Bilingual concordances retrieve all occurrences of a particular speech pattern in its immediate contexts.

Most bilingual concordances are BIDIRECTIONAL - speech pattern can be entered in either lge A ir lge B

Many of the search options available in monolingual concordances are available in bilingual c. (wildcard searches, context searches)
Full transcript