Evaluating a dialogue system or dialogue model is typically done by considering
the values of certain features of dialogue transcripts, such as dialogue length
or task completion. However for complex tasks these features can become less
relevant or difficult to compute. Tutorial dialogue is such a case. In the
absence of learning gain data (as is the case in our corpus) we will use the
judgements made by human experts, who viewed dialogue actions generated by the
model, as a measure of how good the model is. In this talk I will motivate this
approach to evaluation, relate it to similar studies and present the
experimental design, before discussing some of the benefits and possible
drawbacks.