A Syntactic Flexibility Measure for Learning Multiword Expressions
Colin Bannard

The extraction of multiword units from text corpora has been a topic of research in NLP for more that a decade. This has focused almost exclusively on the extraction of sequences of words that occur with a higher frequency than would be predicted from the frequencies of the individual words. For some lexicographic tasks such as terminology extraction this seems to have proved useful. However, for most NLP tasks, we are interested not in statistically idiosyncratic units but rather in those units that do not behave like regular word combinations in terms of their syntax or their semantics. Extraction techniques based solely on cooccurence statistics cannot be
used to acquire these, as the vast majority of the phrases returned are syntactic and semantically orthodox. This talk will describe an attempt to automatically extract V+NP multiword expressions by looking at how their syntactic behaviour observed over a corpus diverges from what we would expect to see in a free word combination.