Auditory-visual (AV) processing binds information from two different sensory modalities, and its occurrence is not felt most of the times in day-to-day life. Binding information from two different modalities helps our auditory perceptual system in reducing noise and enhancing the salience of the target stimuli. This reduction in the noise in turn allows for an easy division between successive events and separation of target from the background noise. For example, listeners perceive speech better at lesser signal-to-noise ratios when the visual information supplements the auditory signal. Also, when the visual signal is present, listeners detect auditory signal at much lower intensities. The effect of visual signal on auditory perception can be evidenced by a classic illusion called McGurk effect [
1]. McGurk and MacDonald [
1] found that, when a face articulating /ga/ is dubbed with a voice saying /ba/, listeners perceive the consonant /da/. The acoustic speech signal was heard as another consonant when dubbed with incongruent visual speech, even though it was recognized well in isolation. Since, the percept differs from the acoustic and visual components, it is called the fusion effect [
1]. Later studies showed that different types of percepts are produced other than fusion responses when the auditory and visual stimuli are incongruent [
2,
3]. It can lead to percept of another speech sound with similar place of articulation [
3] or can just lead to percept of visual component alone [
2]. Thus, of late, McGurk effect is considered as a categorical change in auditory perception induced by incongruent visual speech, resulting in a single percept of hearing something other than what the auditory stimulus is conveying [
4].
While listening to incongruent stimuli, listeners may give more weight to auditory or visual information depending on the relative importance/salience of the two. It is thought that when auditory information is more reliable than the visual information, auditory-oriented percept is elicited. Similarly, when visual information is more reliable than auditory information, a visually-oriented percept is evoked. When both modalities are informative to the same extent, a fusion or a combination percept is elicited. The strength of AV integration can be determined by one modality enhancing the other, one modality biasing the other, or by the creation of strong illusory effects.
Even though McGurk illusion has been considered a robust effect, there is a wide range of variability seen across individuals in literature [
5-
7]. These studies report McGurk effect ranging from zero to hundred percent. However, many of the above-mentioned studies have pooled the responses across all AV consonant combinations irrespective of the consonant class or category, i.e., voiced and unvoiced stops, fricatives, or nasals. However, there is some evidence to show that McGurk effect could be different for different consonant categories such as voiced, unvoiced, fricatives or nasals. Colin, et al. [
8] compared the fusion and combination responses between voiced and unvoiced consonants and showed that combination responses were significantly more in unvoiced compared to voiced consonants. There was no difference between the two consonant classes in terms of fusion responses. However, MacDonald and McGurk [
9] reported higher fusion percepts for voiced incongruent AV stimuli than with no differences in combination responses for voiced and unvoiced AV stimuli. Contrary to this, more fusion percepts were reported for unvoiced consonants than voiced consonants by Sekiyama and Tokura [
10]. However, the motive of these studies were not to compare the McGurk percept among different consonant classes. Moreover, all these studies have been carried out on Western/Japanese listeners. A considerable amount of evidence suggests that linguistic structure of language modulates the McGurk percepts in terms of frequency of occurrence [
11,
12]. The occurrence of McGurk illusion has been less studied with respect to Indian languages. Kannada is a Dravidian language spoken in a southern state of India and has a different phonetic characteristics compared to the languages in which McGurk percepts were studied so far. Specifically, Kannada is one among the Indian languages which have a retroflex stop consonant (/ṭ/ & /ḍ/) and a retroflex nasal consonant (/ṇ/) which is not present in many of the other languages of the world [
13]. These inherent differences in the phonetic structure of the language might influence the perception of AV syllables differentially compared to other languages where McGurk perception has been evaluated. Evaluating AV perception across different consonant combinations will allow us to infer the effect of inherent properties of acoustic and visual signals on perception of amount of fusion or McGurk responses. Thus, the primary aim of the study was to compare the amount of McGurk responses across three different consonant combinations: 1) /ba/ & /ga/ (voiced stops), 2) /pa/ & /ka/ (unvoiced stops), and 3) /ma/ & /ṇa/ (nasals) in participants who speak Kannada language. We hypothesize that the amount of McGurk effect would vary across these consonant combinations due to different acoustic properties and visibility of the syllables according to the weightage given to specific modality. Due to large amount of variability seen in the McGurk responses some of the recent studies have used different criteria to consider presence of McGurk effect in participants. For example, Benoit, et al. [
5] have considered participants with more than 50% McGurk responses, whereas Roa Romero, et al. [
14] included participants having more than 15% McGurk responses to understand the neural dynamics of McGurk effect. Similarly, Venezia, et al. [
15] and van Wassenhove, et al. [
3] considered more than 25% and 40% McGurk responses respectively to evaluate temporal dependencies of McGurk effect. However, there are no consensus among the studies in defining the criteria, which is arbitrarily selected varying across different studies. Therefore, we considered classifying the participants more statistically using cluster analysis, also considering the auditory oriented responses which is more data-driven. However, although there are not many studies using cluster analysis in AV perception, few speech and language perception studies have used cluster analysis in classifying [
16]. In addition, this study also assessed whether there is any relationship between unimodal identification of consonants (auditory alone and visual alone) and the fusion responses in Kannada. Earlier studies have demonstrated that the identification accuracy of unisensory components is reflected into audiovisual speech perception [
10,
17,
18]. Therefore, it is important to consider unisensory perception as well. Studies have shown that the perception of fusion responses largely depends on the clarity of visual components even though the McGurk stimuli were of a fusion type [
19,
20]. The models of AV integration argue that extracting information from unimodal signals is important, which modulates the perception of fusion responses [
21,
22]. Although the visual properties of the bilabials are similar across all the three consonant combinations, the visual properties of velars (/ga/ & /ka) and retroflex (/ṇa/) are different. When the acoustic properties of these syllables are considered, each of the syllable have unique characteristics. The retroflex nasal (/ṇa/) syllable is unique to the Kannada language, which is absent in most of the Indian languages too. Hence, we hypothesize that the inherent acoustic and visual characteristics of each syllable should have differential effect on the identification of unimodal stimulus and also should these effects be inferred to the amount of McGurk effect perceived. Thus, we evaluated how the unimodal identification of these consonants would be modulating the amount of McGurk responses for these consonant combinations.