How about a layered tutorial? A layered tutorial begins with a basic set of instructions, such as what your example image shows. Then, after a condition has been met, a more detailed layer is activated. This could be by time delay (triggering speech hints), button press, or when an incorrect action is taken.
Help button example:
Attachment:
LayeredTutorial-2Layer.jpg [ 20.95 KiB | Viewed 9465 times ]
This could, for example, activate a short video of carrying out the action, with a spoken narration of the process of carrying out the task.