Computational Linguistics Colloquium

Thursday, 17 July 2014, 16:15
Conference Room, Building C7.4

Readability analysis as an experimental sandbox for exploring linguistic complexity

Detmar Meurers
Dept of linguistics
University of Tübingen

The analysis of readability has traditionally relied on surface properties of language, such as average sentence and word lengths and specific word lists. At the same time, there is a long tradition analyzing the Complexity, Accuracy, and Fluency (CAF) of language produced by language learners in second language acquisition (SLA) research. Reusing SLA measures of learner language complexity to analyze readability, Sowmya Vajjala and I explored which aspects of linguistic modeling can successfully be employed to predict the readability of a native language text. Using various machine learning setups and corpora, we show that a broad range of linguistic properties are highly indicative of the readability of documents, from graded readers to web pages and TV programs targeting different age groups. The readability model using our full linguistic feature set currently is the best non–commercial readability model available for English (and second overall, with the commercial ETS model coming in first), based on the performance on the Common Core State Standard data set. The fact that we found readability to be reflected in a wide range of linguistic aspects also has consequences for text simplification, where we are interested in identifying for which sentences which kind of simplification would be worthwhile. To support such research, we show that our text readability models can meaningfully be applied to individual sentences..

If you would like to meet with the speaker, please contact Andrea Horbach.