Jan 17
======

Viswanathan/etal:2010
---------------------

How might the specific stimulus design (e.g., resynthesized CV
syllables) influence the observed effects, and could the results
differ with more naturalistic speech stimuli? Resynthesized stimuli
may not accurately capture the full range of acoustic properties
present in natural speech, which could potentially lead to differences
in how listeners process and interpret the stimuli.

If I understand correctly, the Direct Realist Theory (Gestural) is
applicable when the acoustic signal is Speech, but the evidence for
General Auditory theory can be found both in speech and non-speech
audible signals. If this is the case, is it possible that both these
mechanisms work together during perception of speech, but only one of
them is observed depending on if the auditory signal is speech or
non-speech? For example, in the case of non-speech auditory signal,
perhaps there is simply an absence of perception of phonetic gestures?

"Fowler and Dekle (1991) tested the generality of this explanation by
using a combination of haptic and auditory information"
-> I'm curious as to what sorts of consonants/vowels can actually be
perceived tactically. I find it hard to imagine distinguishing any
sounds other than labials/labiodentals this way. I did look up this
study and find it really interesting that they concluded that subjects
who gained more from adding tactile information to auditory
information tend to gain less from adding it to visual information and
vice-verse. I would have expected individuals to be generally good or
bad at incorporating tactile information across conditions.

Figure 1 -> Should the labels for the [ga] points be swapped?

I know this is the whole point of the paper, but it's really odd to
think that we are reliably able to determine place of articulation
from the acoustic signal yet we don’t have a consistent answer as to
how we do so.

To produce different syllables at the [da]-[ga] continuum, the authors
first recorded a native speaker and then synthesized different
syllables based on the characteristics of his voice. Can a human
speaker intentionally produce a voice at a given point in a spectrum
between two known sounds with reasonable accuracy? For example, is it
possible to train someone to say a sound that is ~40% [da] and ~60%
[ga]?