Forskerskole Øst > Kurser > Corpus linguistics
Corpus linguistics - the short-cut to linguistic evidence
Time: April 19-23, 2010, 9-16
Venue: University of Copenhagen, room 22.1.62
Organizer: Gradeast Graduate School in Linguistics
Course description
During the last decades technological development in terms of computer power and storage capacity has paved the way for a revival of applied corpus methodology that has added innovative elements in nearly all branches of linguistics. Authenticity and huge amounts of linguistic data, combined with computer based search facilities, have made possible a qualitative lift in linguistic analyses both within lexical and grammatical studies and within contrastive and translation studies.
The course will be organized with a view to both describing already existing corpora and covering methods for how to build new corpora in order to address particular research questions.
In order to tie all the many topics of corpus linguistic together, it is the plan that the participants themselves design and collect their own corpus and that these corpora - as far as possible - will be processed and annotated during the course. Involving the participants actively will see to that knowledge is acquired in an efficient way.
Registration: Please fill out this registration form by March 31, 2010. Participants should send a one page abstract of their project to gradeast@hum.ku.dk before March 31. Information about admittance will be sent out to applicants shortly after March 31.
Credits: 3,8 ECTS. A course certificate will be issued to participants who have participated in at least 80% of the course.
Grants: A limited number of International Student Travel Grants are available to participants from Nordic and Baltic countries through the NordLing network and will be awarded on a "first come, first served" basis. Danish students cannot apply for these grants. Students must apply for International Student Travel Grants directly to NordLing. See the NordLing website: https://webmail.hum.ku.dk/exchweb/bin/redir.asp?URL=http://groups.google.com/group/nordling/ (see 'Participants' and 'Grants').
Programme:
April 19
Morning
Presentation of the participants (1 min. each)
Overall introduction to corpus linguistics - what is to be taken into consideration when you want to back up your linguistic research questions by making your own collections of texts?
- What is corpus linguistics, what is a corpus, corpus linguistics historically
- Representativeness, balance and sampling. How to compose a corpus for your purpose.
- Examples of how you can search a corpus using a corpus tool.
Literature: Tony McEnery, Richard Xiao and Yukio Tono: Corpus-Based Language Studies, A1 and A2, (pp. 3-21), Bente Maegaard and Hanne Ruus: The Composition and Use of a Text Corpus In: Zampolli (ed.) Studies in Honour of Roberto Busa S. J. Linguistica Computazionale, Vol IV-V, Pisa 1981, pp. 103-122.
Afternoon
Introduction to the design of a corpus of language for special purposes, with the DK-CLARIN LSP corpus as a concrete example
- Defining the domain(s), the communicative context and the text types
- Collecting the texts, copyright problems
- Metadata and how to encode them
- How to assess the quality of the collected texts
Literature: Lynne Bowker and Jennifer Pearson: Working with Specialized Language, A practical guide to using corpora, Routledge, London, 2002, pp.25-39, pp.45-54, pp.58-73.
April 20
Morning
Existing corpora and corpus tools
- KorpusDK. Composition, sampling, representativeness. Examples of use of KorpusDK
- Searching KorpusDK, the corpus tool
- Description of CQP as a corpus tool. MULINCO as an example (look up word, lemma etc.)
- British National Corpus. Composition, sampling, representativeness
- Searching British National Corpus, the corpus tool
Literature: Tony McEnery, Richard Xiao and Yukio Tono: Corpus-Based Language Studies, A7, A9 (pp. 59-70, 77-79),
Jørg Asmussen: Towards a methodology for corpus-based studies of linguistic change: Contrastive observations and their possible diachronic interpretations in the Korpus 2000 and Korpus 90. General Corpora of Danish . In: Language and Computers, Corpus Linguistics Around the World. Edited by Andrew Wilson, Dawn Archer, Paul Rayson , pp. 33-48(16), Rodopi.
On the British National Corpus, http://www.natcorp.ox.ac.uk/, in particular the Creating the BNC pages.
Afternoon
After the introduction to available general corpora, the primary activity this afternoon will focus on how to develop your own corpus.
- Introduction to WordSmith Tools (Version 5.0)
- Collection of corpora from the internet
Literature: McEnery, Xiao and Tono: Corpus-Based Language Studies. An Advanced Resource Book, Routledge, 2008 (Unit A8, A9) and various relevant parts of the WordSmith Tools manual.
April 21
Morning
Adding value by annotating collected corpora with linguistic information advantage is gained in more ways. In this session one of the most widely used annotation type, PartOfSpeech tagging, will be described in detail. In the afternoon, annotation of other types of linguistic and pragmatic information will be the topic.
- Introduction to the basic concepts in POS tagging
- Demonstration of existing POS taggers
Literature: Maegaard & Schøsler eds; En snemand på syv måder Ordklasse-tagging, McEnery, Xiao and Tono: Corpus-Based Language Studies. An Advanced Resource Book, Routledge, 2008 (Unit A4), Manning and Schütze: Foundations of Statistical Natural Language Processing, chapter 10, pp 341-380.
AfternoonIntroduction to other annotation such as syntactic, semantic and pragmatic information
- Overview of corpora containing these kinds of annotations (i.a. BNCWeb, IMS Dickens' Corpus, DAD)
- Presentation of formalisms, coding schemes, and tools which can support the annotation process (MATE, DTR)
- Exercises: annotation of data previously collected and extraction of semantic and/or pragmatic information from existing corpora.
Literature: McEnery, Xiao and Tono (2008) Corpus-Based Language Studies. An Advanced Resource Book: paragraphs A4.4.3-6 (pages 36-41), A10.2-5 (pages 80-91), A10.9-10 (pages 103-108), C5 (pages 287-320).
April 22
Morning
Overall introduction to translation theory and contrastive linguistic studies based on parallel (aligned) corpora.
- Bilingual and multilingual corpora; existing corpora; corpus tools for parallel corpora
- Parallel corpora and comparable corpora, their compilation and use
- Contrastive studies
Literature: Tony McEnery, Richard Xiao and Yukio Tono: Corpus-Based Language Studies, A5 (pp. 46-51), Hanne Jansen: Oversættelsesstudier, kontrastiv lingvistik og elektroniske tekstkorpora, chapter from En Snemand på syv måder, Museum Tusculanum 2010. Maeve Olohan: Introducing Corpora in Translation Studies, Routledge 2004, chapter 2-4, pp. 12-44.
Afternoon
Research within the scientific field of translation theory and contrastive linguistics is facilitated profoundly in case the bitexts and comparable documents are aligned in such a way that search and subsequently display of the results are straightforward.
- Alignment of texts
- Alignment of sentences and words
Literature: McEnery, Xiao and Tono: Corpus-Based Language Studies. An Advanced Resource Book, Routledge, 2008 (Unit A5), Manning and Schütze: Foundations of Statistical Natural Language Processing, chapter 13, pp 461-492.
April 23
Morning
In this session, the class will be given a description of the basic statistical concepts that are relevant in connection with using corpus-based approaches.
- How to calculate dispersion of the collected data
- Introduction to test methodology to wrt. statistical significance
Litterature: McEnery, Xiao and Tono: Corpus-Based Language Studies. An Advanced Resource Book, Routledge, 2008 (Unit A6 and Case study 1 in section C)
Afternoon
Searching in corpora
- Search with the CQP (Corpus Query Processor) tool which was developed to retrieve information from large corpora encoded with the IMS Corpus Workbench (CWB) at Stuttgart University (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html );
- Exercises: using CQP to retrieve various types of information from corpora organized in CWB (possibly data collected and previously annotated by participants)
Literature: Maegaard & Schøsler eds. En Snemand på syv måder. Article by Lene Offersgaard & Sussi Olsen. Brug af Korpusplatform, and relevant parts in The IMS Corpus Workbench: Corpus Query Processor (CQP) User's Manual (v2.2) 1999: http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPUserManual/PDF/cqpman.pdf
Chapter 3: pages 15-18, 20-23
Chapter 5: pages 34-35
Teachers
Claus Povlsen, CST, course coordinator
Bente Maegaard, CST
Costanza Navarretta, CST
Sussi Olsen, CST


