First, like I said, the part that crashed it I commented out inside the script. Couldn't have put it up for other folks to try otherwise. The rest of it works pretty nice, I think. But if you un-comment the layer scaling section Moho goes belly up right off. I've been trying to give the beta the test-to-destruction workout, but I didn't mean to go that far.
The sample I've been using is for testing is you're very own Pres. Clinton bit from the tutorials. Okay, they probably had him on a pretty nice mic. But even someone who is recording their own dialog on a crappy headset mic in front of their computer is probably going to reject a take that is so rife with annoying P-pops and sibilance that it's unlistenable. Let alone unprofessional. But in this case, it's also really not that much of a problem. These sounds are, on a small time basis, at least related to the souunds you are trying to catch.
I picked the 2 frequencies I did because thay are, for the most part, the fundamental frequencies of bass drum and snare hits. They are both fairly short time frame events, but also well over the time limit of many annoying transients. I whipped up the LayerSound script as a rough analog to some of the audio based AfterEffects plugs, which seem to be used more for tracking music.
Most vocal frequencies tend to center around 2.5 kHz. P-pops are down around that 80 cycle tone, sibilance most prominent at pretty high registers, say 8k and up. Filtering for the midrange shouldn't be that difficult. Like I said, I'll try to track dowm a handy-dandy algorithm or 2.
Most audio compression, variations of which you use to control these things, are basically just averaging, but specific and context based. On a vocal, do I see a spike below 100 Hz that lasts for less that .05 seconds. Yup, that's probably a pop, filter it out. Anything on a vocal above 6K that's not Maria Callas? Crap, make it go away. Those are 2 audio benchmarks you can hardly go wrong with. Sibilance is people, but P-pops are a mic artifact that have a signature that's hard to miss.
Like I said in the last post, I don't have any handy problem files. Still think that fiinding peaks while discarding transients would be a good, "best fit" solution. I'm not really a coder, but I could certainly work up some logically proper psuedocode to do the job if you thought it was worthwhile
to send me a look at the relevant source. I know this is you're baby (and a big, fat smilin' brat it is), but this is the one area I know I could actually be useful in.
Forgot about this...
but who's going to be pumping a pure 80 Hz sine wave into Moho?
Your average garden variety Jeep Beat is constructed from a low end percussive sample overlayes with a 40 - 80 Hz sine wave, lasting very vaguely on the order of a tenth of a second (from listening out my window). Thats what makes the car go boom. I only patcherd in a pure tone to confirm what I thought, which was that the way Moho is determining the amplitude of a file had little relation to the actual volume. I got an average "amp" of
.03 with the Clinton sample (max, min of arounf .02)
.64 with the 80 Hz sine (last 2 are consistent within reason per frame)
.40 with the 4KHz sine
.03? A maximum of 3% possible volume on an audio file, properly normalized, that you can hear clear as a bell?Sorry, man, but that is clearly not a proper result. I just ran off a test in the script, with the Clinton sample, and I get results for the frames amplitude running from .011 to 1.0, with entirely proper tracking.

So that's the "I" in the "I did not have" from that sample. The maximum amplitude spikes appear to take up a rather small amount of room in the waveform. There is much more in the way of lower amplitude, "intermediate" waveforms.

This is a closeup , of about 8 milliseconds, within the "I". Each horizontal line represents one audio sample. You can see that averaging by audio sample will directly skew the used amplitude down. You can also see from the previous image, how depending on where the frame divisions fall you can end up with variying results, sometime wildly. In any case, a raw averaging is never going to give you a result that means much.
Summation -- Pops, hisses and crackles rarely come to be the loudest events in an audio file. I think that worry belongs in the same category as people who won't wear a seatbelt because 2% of people in a crash are hurt worse instead of less if they do.
--Brian
Added-- A loud sound that lasts for 10 frames. Hmmm. An explosion? A 1950's horror flick chick screaming? Perhaps extreme examples, but if the effect runs into the next frame in these cases, is anyone going to notice? I still go with greatest benefit, greatest part of the time. Most audio related apps pride themselves on transient detection in areas like this, bad audio be damned.