S. Beniamine: Quantitative data creation and management in inflectional morphology
Instructor: Sacha Beniamine (University of Surrey)
This course is an introduction to data management for quantitative linguists, applied to inflectional morphology. The core of session 1 applies to the curation of machine-readable data in linguistics in general: What are good data practices ? Why are they needed ? The two next sessions zoom in on inflected lexicons, or collections of morphological paradigms.
- Session 1: This session will introduce some problems in inflectional morphology and discuss data management in quantitative linguistics. We will examine the main current problems facing data curators and users, some principles which can help solve them, and specific solutions.
- Session 2: The Paralex standard for inflected lexicons.
- Session 3: Hands-on session: Evaluate existing resources and create a small paralex lexicon (bring your computer).
H. Burnett: xxx
Instructor: Heather Burnett (CNRS/LLF & InldEx EFL)
(to appear)
N. Levshina: A statistical toolkit for linguistic typology
Instructor: Dr. Natalia Levshina (Radboud University, The Netherlands)
This hands-on course introduces key statistical methods for data exploration and hypothesis testing in linguistic typology. The focus is on practical application, with methods illustrated through case studies based on cross-linguistic databases and corpora. Participants will work with provided R scripts and gain experience applying the techniques to real typological data.
- Session 1. Introduction to regression models for testing cross-linguistic correlations and implicational relationships, while accounting for genealogical and spatial dependencies between languages.
- Session 2. Multidimensional scaling for the construction of proximity-based semantic maps, including cluster identification and interpretation of dimensions.
- Session 3. Conditional inference trees and random forests for investigating and comparing constraints on formal variation across languages.
J. Marsault: Documentation and fieldwork issues - the Umóⁿhoⁿ language
Instructor: Julie Marsault (University of Paris 3 & Lacito)
This class will present issues with data gathering and annotation of an under-described and critically endangered language, Umóⁿhoⁿ (also Omaha ; Siouan, United-States).
- In a first session, I will present the Siouan languages in general and Umóⁿhoⁿ in particular.
- The second session will be dedicated to sharing my fieldwork experience and some social and technical difficulties to preserve and document the Umóⁿhoⁿ language.
- Finally, we will have a hands-on session of annotation of Umóⁿhoⁿ sentences.
A. Miletić: Natural Language Processing (NLP) and documentation of less studied languages: focus on Automatic Speech Recognition (ASR)
Instructor: Aleksandra Miletić (CNRS/MoDyCo)
Estimates say that 35-42% of the world’s languages remain undocumented. This is in part due to the high cost of the manual processing of the collected data, the bottleneck often being the very first step: transcribing audio into text. However, recent developments in NLP have lead to impressive improvements across numerous tasks, including automatic speech recognition (ASR). In this course, we will explore the impact these advances have on the documentation of less studied languages. First, we will examine the notion of less resourced languages and the variety of sociolinguistic situations they evolve in. Next, we will explore the promises and pitfalls of recent (sometimes massively) multilingual ASR tools (Whisper, Omnilingual ASR) through hands-on experience with model fine-tuning, evaluation, and error analysis. Finally, we will examine the role of the “language-as-data” paradigm in the current NLP landscape and possible alternative ways forward.
P. Muller: Pretrained Language Models for linguistic research
Instructor: Philippe Muller (University of Toulouse, IRIT & GDR TAL)
This class will introduce the fundamentals of recent pretrained language models (PLM) and explore their relationship with traditional linguistic levels of analysis.
We will examine PLM language capabilities and discuss how they can support linguistic research, e.g. through generating data, automated annotations or as experimental models for language performance. In addition, we will address issues and challenges in using or studying these models, notably multilinguality and representativeness.
The class will feature practical lab exercises, some of which can be informed by participants' use cases.
C. Parisse: Create, annotate, and analyze a multimodal corpus of language interaction
Instructor: Christophe Parisse (CNRS/MoDyCo & HumaNum CORLI consortium)
This course will present the methodology and the tools that can be used to create, annotate, and work with a multimodal data corpus which goal is to study natural and spontaneous conversational situations collected in ecological settings.
The course will use examples focusing on the study of language acquisition and situations involving children and parents, but it wiil also apply to any type of language situation.
- Session 1: Review of recording conditions and data collection procedures. Presentation of data available in repositories such as ORTOLANG that can be used for research purposes. Presentation of tools for transcribing data, either manually or semi-automatically.
- Session 2: The theme of this session will be data annotation and the use of appropriate tools. The aim here is no longer simply to produce a transcription, but to become familiar with the tools and principles that enable data enrichment, to study for example phonology, syntax, pragmatics, gesture, and interaction. This session will present in more detail tools such as ELAN, which are particularly well suited to multimodal research, in connection with other tools depending on the research needs.
- Session 3: This session will focus on data analysis. In particular, it will present methods for extracting and analyzing data from ELAN, using as an example the work carried out within the DINLANG project, which examines dinner conversations between children and parents, with analyses focusing on language and gesture.
C. Pozniak: Applying psycholinguistic methods to linguistics
Instructor: Céline Pozniak (University of Paris 8 & SFL)
This course will introduce you to the principles and methods of psycholinguistics. We will explore how empirical approaches allow us to test hypotheses about language processing, using both offline and online methods.
- Session 1: Introduction. We will explore how empirical evidence shapes psycholinguistic research, from hypothesis testing to experimental design.
- Session 2: Offline methods. We will examine offline methods (e.g., acceptability judgments) and their role in uncovering language structure and processing.
- Session 3: Online methods. We will examine online methods (e.g., eye-tracking) to study real-time language processing.