3.7 Comments on the Exercise

Comments on the exercise's results.

3.7.1 Annotation reliability

If you didn't agree with each other don't worry.

If you didn't agree with each other don't worry. Inter-annotators' agreement is a common problem in annotating data. Quite commonly the agreement among different annotators is as small as the agreement that would be expected to occur by chance.

In order to make coding credible enough and be able to test empirical hypothesis by comparing the latter to the data, [CII⁺97] have argued for the application of a statistical measure known as the Kappa Coefficient. The Kappa Coefficient measures pairwise agreement among coders who make category judgments. In terms of dialogue data annotation this can be applied to the decision coders have to make between the different dialogue acts available in the taxonomy used for the annotation. The definition of the Coefficient for dialogue acts annotation is the following:

where is the proportion of times that the annotators agree on the dialogue act assigned to an utterance and is the proportion of times that they are expected to agree by chance.

A suggestion of how reliability can be achieved was first introduced by [Kri80] for the field of content analysis. He suggested three different ways of tested reliability, which are applicable to annotation reliability:

Stability: It measures the agreement between two annotations by the same annotator at two different points in time. There should be no significant difference.
Reproducibility: It measures agreement between different coders. There should be no significant difference.
Accuracy: It measures the difference between a coder and a standard annotation. The standard is commonly the annotation of an expert, the developer. In that case the results of the test show if the developer's instructions to the coders capture the goal of the developer's scheme.

Two points should be made in relation to the interpretation of the test results. First, that the amount of agreement attributed to chance is dependent on the relative frequency of the dialogue act category. Second, the results of a coding category are dependent on the results of any category that is relying upon. That means, for example, that if annotators do not reach a significant agreement when coding for dialogue participants' turns, the results of coding for utterances, for dialogue acts and insertions will not exhibit significant agreement either.

3.7.2 Multi-functional utterances

One utterance can be performing more than one dialogue acts.

One utterance can be performing more than one dialogue acts. This means that the dialogue act taxonomy should allow an annotation at different levels. The annotation scheme should not force a choice between these functions. We have seen at least two good examples of how this issue can be handled, namely, DAMSL (See Section 3.3.8 and Section 3.3.9) and Section 3.3.9, Verbmobil (See Section 3.3.5, Dialogue acts). They both allow annotation at different levels or for different functions.

3.7.3 Sub-dialogues

A dialogue act taxonomy needs to account for sub-dialogues/insertions that might occur.

A dialogue act taxonomy needs to account for sub-dialogues/insertions that might occur. The idea of Dialogue Games and Transaction in the HCRC Map Task (See Section 3.3.3) is meant to do exactly that. Two more examples of a similar idea are Dialogue Phases in Verbmobil (See Section 3.3.5) and Argumentation Acts in TRAINS (See Section 3.3.7).

For planning to take place adjacency pairs are commonly used as a means of judging what is the appropriate dialogue act to be produced next. A good example of modeling adjacency pairs are again Dialogue Games (See Section 3.3.3). In Subsection Section 3.2 we briefly addressed the issue of how far a system should plan, which is also a relevant decision to be made.

3.7.4 An example annotation

None yet.

None yet.