Hm, I see.
I just thought the sound files would be processed so that they contain kind of an "marker" after the process, which is somehow noticeable through a program, so that it can start an animation at that point where the "marker" is, so the animation would be sinced, too, by letting it started at an exactly defined point (of the spoken text).
But i fear this might sound too complex to realize.

(Oh, my, another idea gone nuts ...)