The natural language processing community has struggled for years to develop computational models of text structure. Such models are essential both for interpretation of human-written text and for evaluation of machine-generated text.

In this talk, I will present our first steps towards learning to model text structure. I will describe two models that are induced from a large collection of unannotated texts. The first model captures the notion of text cohesion by
considering connectivity patterns characteristic of well-formed texts. These patterns are inferred from a matrix that combines distributional and syntactic information about text entities. The second model captures the content structure of texts within a specific domain, in terms of the topics the texts address and the order in which these topics appear. I will present an effective method for learning content models, utilizing a novel adaptation of algorithms for Hidden Markov Models. To conclude my talk, I will show how these text models can be effectively integrated in natural language generation and summarization systems.

This is joint work with Mirella Lapata and Lillian Lee.