[Из песочницы] Microsoft Speech и потоковое аудио29.09.2015 18:05

Стало интересно, насколько хорошо Microsoft Speech умеет распознавать речь. В качестве источника для распознавания я решил взять аудиопоток переговоров полиции с сайта youarelistening.to.
Существует два неймспейса System.Speech и Microsoft.Speech. Как я понял, чтобы использовать Microsoft.Speech, необходимо установить Microsoft Speech Platform Runtime и Microsoft Speech Platform SDK. А System.Speech уже есть в последних версиях .NET Framework.

Будем использовать System.Speech, т.к. в этом случае поддерживается диктовка, а в случае Microsoft.Speech — нет.

Еще нам потребуется библиотека для работы со звуком NAudio. Там есть пример Mp3StreamingDemo, который умеет работать с потоковым аудио. Он нам и нужен. Создаем свой проект. Из MP3StreamingPanel перетаскиваем к себе метод StreamMp3 и все что ему потребуется. Добавляем ссылку на NAudio.

В нашем классе создаем метод StartStreaming, который запустит StreamMp3 в отдельном потоке:

        public void StartStreaming()
        {
            playbackState = StreamingPlaybackState.Buffering;
            bufferedWaveProvider = null;
            ThreadPool.QueueUserWorkItem(StreamMp3, "http://relay.broadcastify.com:80/949398448");            
        }

Конструктор нашего класса будет создавать и конфигурировать SpeechRecognitionEngine, В качестве грамматики будем использовать диктовку:

        private bool completed = true;
        readonly SpeechRecognitionEngine sre = new SpeechRecognitionEngine();

        public Recognition()
        {
            var grammarBuilder = new GrammarBuilder();
            grammarBuilder.Culture = new CultureInfo("en-Gb");
            grammarBuilder.AppendDictation();

            var grammar = new Grammar(grammarBuilder);
            
            grammar.Enabled = true;
            sre.LoadGrammar(grammar);
            sre.BabbleTimeout = TimeSpan.FromHours(1);
            sre.EndSilenceTimeout = TimeSpan.FromSeconds(10);
            sre.InitialSilenceTimeout = TimeSpan.FromSeconds(10);
            sre.SpeechRecognized += sre_SpeechRecognized;
            sre.RecognizeCompleted += sre_RecognizeCompleted;
        }

Данные из буфера копируем в MemoryStream, который передаем в SetInputToAudioStream. Тут необходимо правильно задать параметры формата аудио. Метод SetInputToWaveStream у меня не заработал.

        public void Recognize()
        {
            var size = bufferedWaveProvider.BufferLength;
            byte[] bytes = new byte[size];
            bufferedWaveProvider.Read(bytes, 0, size);
            using (var ms = new MemoryStream(bytes))
            {
                sre.SetInputToAudioStream(ms, new SpeechAudioFormatInfo(
                    bufferedWaveProvider.WaveFormat.SampleRate, AudioBitsPerSample.Sixteen, AudioChannel.Mono));

                sre.RecognizeAsync(RecognizeMode.Multiple);
                while (!completed)
                {
                    Thread.Sleep(333);
                }
            }
        }

        void sre_RecognizeCompleted(object sender, RecognizeCompletedEventArgs e)
        {
            Debug.WriteLine("Finished");
            completed = true;
        }

        private static void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            Console.WriteLine(e.Result.Text);
        }

Манипуляции с флагом completed и циклом с Thread.Sleep я взял из документации к Speech API. По какой-то причине без этого цикла распознавание не происходит.

Теперь осталось модифицировать заимствованный метод StreamMp3. Как только буфер почти заполнен, считываем из него данные:

                        if (IsBufferNearlyFull)
                        {
                            Debug.WriteLine("Buffer getting full, taking a break");
                            if (completed)
                            {
                                completed = false;
                                Recognize();
                            }
                            Thread.Sleep(200);
                        }

И можно запускать:

        private static void Main(string[] args)
        {
            var recognition = new Recognition();
            recognition.StartStreaming();
            Console.ReadKey();
        }

Кончно же, на выходе получается полная околесица:

Результаты работы распознавания

the Canadian an engineer by the IRA at the corner Michio politically inclined to
it and
regarded
it can I've had it in feeling
her have her had had
had
her her
her
in any category goalkeeper were: he cheekily nobody will include adequate real e
ye cannot
any attempt at her home in a holiday party had other E. are currently ensure tha
t lead about one it may take it nine
but lineup plenty alignment in a the manager Graeme
there get them into productive
all the legal
definitely likely telling it like legally were quickly added
a that when flying ability in immediately daylight
building unlucky in Allied
initiative commissioner cutting minister Jan fifteen along the who had failed
the effect it has to lead England manager Clive be clinging
the Italian Italian open
the relational for transplant
partner new-line
there that they are likely to alive plans but new-line
Eddie then entitling and it didn't go
in bed aware that their campaign locally in between thirteen and children ultima
te
a enabling info about an agenda implied sugary inundated with an
it entailed million any luckily a
English allowed her
lineker nine editor the twentieth brutality in any that nine treated like at
there are
all she unclear whether he'll but nine point overhauled understanding complain a
bout it because frankly and that
it is essential either a touchline do you play lanarkshire acute illness cover t
he
the but for their life under scrutiny old lucky enough virtually getting off lig
htly down to that internal it changed
near the of light relief Latino fondly

Выводы:

Полицейскую волну распознает очень плохо
При этом работает довольно быстро, буфер не успевает заполняться
В System.Speech нет поддержки русского языка. Она есть в Microsoft.Speech, но там нет поддержки диктовки