Evaluating a dialogue system or dialogue model is typically done by considering the values of certain features of dialogue transcripts, such as dialogue length or task completion. However for complex tasks these features can become less relevant or difficult to compute. Tutorial dialogue is such a case. In the absence of learning gain data (as is the case in our corpus) we will use the judgements made by human experts, who viewed dialogue actions generated by the model, as a measure of how good the model is. In this talk I will motivate this approach to evaluation, relate it to similar studies and present the experimental design, before discussing some of the benefits and possible drawbacks.