I want to thank Dr. Frank Lowney from the Digital Innovation Group at Georgia College & State University for this informative guest post.
If you’re interested in captioning your videos, you’ll find this interesting. A useful, more advanced workflow, Dr. Lowney describes how to use the Enhanced Dictation feature in MacOS X 10.9 (Mavericks), combined with Audio Hijack and Soundflower to turn recorded audio into a text file. This can be extremely handy for anyone that needs to create captions for a video, but lacks the transcribed text. Without further ado….
By Dr. Frank Lowney
The pressure is on to to make screencasts and other online video more accessible. One important aspect of that challenge is to make video more accessible to persons who are deaf or have difficulty hearing. For video content creators, this means providing a transcript or, better, providing subtitles to that video so that dialogue may be viewed in the same context as the video.
The problem is that many videos are created without a script that is followed closely by the speakers in that video. Indeed, many important videos are created in ad hoc fashion (interviews, panel discussions, conference presentations and the like) where scripts would be totally inappropriate.
Creating text from speech has become essential to meeting these expectations, especially where all one has to work with is the speech in the audio track of a video. Speech to text (STT) is a bit more difficult than text to speech (TTS) which has been in use much longer.
MacOS X recently introduced Dictation (speech-to-text) as a feature usable in any application that takes text as input. This is quite an advance over having to purchase a two hundred dollar application to accomplish the same end. However, the first iteration of this system required an internet connection so that speech could be uploaded to Apple’s servers where it would be turned into text. This created delays and was difficult to use for substantial bodies of text. However, Dictation was given a significant boost in MacOS X 10.9 (Mavericks) with the introduction of Enhanced Dictation which enables offline use and continuous dictation with live feedback.
Still, this is a system that assumes a live speaker. There is no obviously easy way to route speech from a recorded file through Apple’s Dictation system to produce usable text.
That’s what this post is all about.
You can, in fact, route the speech in an audio file through Apple’s speech-to-text subsystem and render very usable text output. It isn’t intuitive or Apple-easy but it is something that anyone can accomplish with a bit of determination. Here’s how:
The application at the center of this process is Audio HiJack Pro by Rogue Amoeba ($32 USD). There are two things to set up with this app. The first is to identify the source of the audio. It could be any app that emits audio but I used QuickTime Player X. Thus, I set that app as the audio source as follows:
This will capture the audio from anything that this app plays. My sample audio is from NPR and contains a dramatic reading from noted actor, Sam Waterston and looks like this in QuickTime Player X:
This configuration will grab all the audio from QuickTime Player X as it plays the “NPR Gettsyberg Address” audio file. Next, we use Audio HiJack Pro to send that audio to Soundflower (free). To do that we go to the Effects tab and choose Auxiliary Device Output from the 4FX menu.
The Auxiliary Device Output plug-in enables us to choose the previously installed Soundflower as the recipient of the HiJacked audio as follows:
Once installed, Soundflower becomes an input/output option in your Sound preference pane and everywhere else audio sources and destinations can be specified. In other words, it becomes an integral part of your sound system in MacOS X.
Finally, we set the Dictation input to be Soundflower as follows:
At this point, any audio played by QuickTime Player X will be routed to Soundflower and will thus become available to any application that accepts text input and has a Start Dictation menu item. In Pages, that looks like:
The following screencast illustrates this process from start to finish:
Do you have your own solution for this that you’ve been using? Please comment below and share what you’ve learned.