MorphLaz : a finite-state morphological analyzer for Laz

Author:

Esra Önal

Advisor:

Arzucan Özgür

Advisor:

Balkız Öztürk

This thesis is a part of documentation and revitalization efforts of the endangered Laz language, a member of South Caucasian language family mainly spoken on the northeastern coastline of Turkey. It introduces the implementation of the first automatic language analysis tool for Laz, specifically for Pazar dialect designed as a rule-based morphological analyzer developed with two-level morphology using finite-state networks. Additional language resources such as lexicon and corpus were collected for the purposes of increasing the coverage power and evaluating the performance of the analyzer. Morphologically rich languages create many challenges for natural language processing (NLP) tasks. In order to develop high or low-level NLP systems such as lemmatization, part-of-speech-tagging, spelling correction and machine translation, in any NLP pipeline, the first aim is usually to do some sort of morphological analysis on text or speech. Among different approaches to the computational study of morphology, for this study, due to the low amount of language and computational resources, I chose a rule-based approach that is highly accepted and used for formalizing morphotactics and morphophonemics, namely two-level morphology and finite-state transducers. The evaluation is based on naïve coverage of the analyzer over text data and error analysis. The results show 78.2% of coverage over the unique tokens in Pazar Laz corpus (PLC), 92.1% of coverage over Laz Treebank and 74.3% on Fındıklı Laz corpus (FLC). Error analysis on PLC results indicates that most of the word forms that could not be analyzed are due to missing word stems.