Musical Noise Removal of Enhanced Speech K. Anitha Sheela K. Satya Prasad

by user

Category: Documents





Musical Noise Removal of Enhanced Speech K. Anitha Sheela K. Satya Prasad
Musical Noise Removal of Enhanced Speech Using Gray Level Values of Spectrogram Plots
Musical Noise Removal of Enhanced Speech
Using Gray Level Values of Spectrogram Plots
K. Anitha Sheela 1 , K. Satya Prasad 2 , and M. Madhavi Latha 3 , Non-members
Enhanced speech using Power Spectral Subtraction method is computationally quite simple but results in an annoying residual noise also called musical noise due to its narrowband spectrum and presence of tone like characteristics. To eliminate this
modified spectral subtraction technique is proposed
in our previous work which is based on cross terms,
which were made zero in the conventional spectral
subtraction method. This technique could effectively
suppress musical noise, but at the cost of speech intelligibility. To overcome this problem, a post processing technique is proposed in this paper, which is
based on the gray level values of pixels in the spectrogram of enhanced speech. This method can effectively suppress musical noise without noticeable
effect on speech intelligibility. Both subjective and
objective performance assessments confirm that our
method is effective.
Keywords: Spectral Subtraction, Musical noise,
spectrogram, intelligibility.
Speech enhancement using a single microphone has
become an active area of research for speech signal
enhancement. Among the numerous approaches for
speech enhancement Spectral Subtraction [1], [2] is
most widely used because of its less computational
complexity. But this method results in a residual
noise called musical noise. Contrary to its name, musical noise is not necessarily pleasing, rather it is annoying. In fact, the unacceptability of musical noise
has motivated the invention of enhancement methods based on considerations different from those of
spectral subtraction. Among such methods the one’s
that are proposed by Lim and Oppenheim [3], and
Ephraim and Malah [4], Martin’s [9] method based
on minimum statistics are very promising and have
received considerable research attention.
In this paper, we investigate whether musical noise
introduced by Power spectral subtraction can be suppressed without noticeable effect on speech intelligibility. In this connection, it is worthwhile mentioning
Manuscript received on February 1, 2007 ; revised on March
4, 2007.
1,2,3 The authors are with DSP Group, Jawaharlal
Nehru Technological University, Hyderabad, India , Emails:
[email protected], prasad [email protected], and
[email protected]
the relevant methods proposed by Boll [2] and Whipple [5] and Yamauchin [8]. Basically, both methods
are developed based on the assumption that the spectral components of musical noise usually appear as
isolated peaks in the spectrogram of enhanced speech.
However, in practice, musical noise manifests itself as
not only isolated peaks, but also “short ridges” in
the spectrogram, and therefore will not be effectively
suppressed by these methods. Although it is possible
to suppress those “short ridges” by modifying some
parameters associated with the methods, the intelligibility of speech would usually have to be compromised. We have proposed in our previous work [6] a
new approach for noise estimation for use in Power
spectral subtraction. It has resulted in better intelligibility with less computational complexity but has
resulted in increase of musical noise. So, in this work
we propose to eliminate this musical noise by post
processing the enhanced speech produced by our previous method.
In this work, we first identify the trade-offs among
suppression of unwanted noise, generation of musical noise, and preservation of the intelligibility of desired speech. Subsequently, we propose a post processing method capable of suppressing musical noise
effectively without noticeable effect on speech intelligibility, via exploiting some specific characteristics
of human speech. Finally, we subject our method to
tests with speech sentences which we add noises like
white Gaussian noise and cockpit noise, low speed car
noise (separately) etc., amounting to various values of
Signal-to-Noise Ratios (SNR’s). Performance assessments based on spectrogram plots, objective measures, and informal subjective listening tests show
consistently good results.
Spectral subtraction is a method for restoration
of the power or the magnitude spectrum of a signal observed in additive noise, through subtraction
of an estimate of the average noise spectrum from
the noisy signal spectrum. It is the most common of
the subtractive type algorithms, which form a family
of methods based on subtraction of the noise estimate
from the original speech [7,10]. These systems form
a category of algorithms that operate in frequency
domain. The noise spectrum is estimated, and updated, from the periods when the signal is absent and
only the noise is present. The assumption being that
noise is stationary or a slowly varying process, and
that the noise spectrum does not change significantly
between the updating periods. For restoration of the
time-domain signal, an estimate of the instantaneous
magnitude spectrum is combined with the phase of
the noisy signal, and then transformed by applying
IDFT to the time domain signal. The phase of the
noisy signal is not modified, as not only it is hard
to get an estimation of the phase as compared to the
magnitude spectrum, but also believed that from perceptual point of view the phase does not carry any
useful information for noise suppression
If we assume that , the discrete noise corrupted
input signal, is composed of the clean speech signal
and the uncorrelated additive noise signal, then the
noisy signal can be represented as:
y(n) = x(n) + d(n)
This assumption is based on the fact that d(n) is
stationary, but speech is not a stationary signal. The
processing is carried out on a short-time basis (frameby-frame), therefore, is multiplied by a time limited
window w(n). Thus, the windowed signal yw (n) can
be represented as:
yw (n) = xw (n) + dw (n)
|X̂(ω)|2 = |Y (ω)|2 − E[|D(ω)|2 ]
Eq.4 cannot be guaranteed to be non-negative, as the
right side can become negative due to errors in estimating the noise. These negative values can either
be made positive by changing the sign (full-wave rectification) or can be set to zero (half -wave rectification).When there is no speech present in a given
frame the difference between estimated and original
noise is called noise residual and will demonstrate itself as disorderly spaced narrow bands of magnitude
spikes/peaks (fig1). Once the signal is transformed
back into the time domain, these disorderly spaced
narrow bands of magnitude spikes/peaks will sound
like the sum of tone generators with random frequencies. This phenomenon is known as the “musical noise
The spectral subtraction method described above
has the following limitations.
(a) the finite variance of the instantaneous noise
power spectrum
(b) the cross-product terms
(c) the non-linear mapping of spectral estimates that
fall below a threshold since the magnitude cannot be
negative for instances where noise has been overestimated.
In the frequency domain, with their respective Fourier
transforms, the power spectrum of the noisy signal
can be represented as:
|Yw (ω)|2 = |Xw (ω)|2 + |Dw (ω)|2
+ Xw (ω) · Dw ∗ (ω) + Xw ∗ (ω) · Dw (ω)
where Dw ∗ (ω) and Xw ∗ (ω) represent the complex
conjugates of Dw (ω) and Xw (ω) respectively. The
function |Xw (ω)|2 is referred to as the short time
power spectrum of speech. In power spectrum subtraction, the terms, |Dw (ω)|2 X(ω) · D∗ (ω) and cannot be obtained directly and are approximated as
E[|D(ω)|2 ], E[X ∗ (ω)D(ω)]and E[X(ω).D∗ (ω)] where
E[.] denotes the expectation operator. Typically,
E[|D(ω)|2 ] is estimated during the silence periods,
and is denoted by |D̂(ω)|2 . If we assume that d(n)is
zero mean and uncorrelated with x(n), then the terms
E[X ∗ (ω)D(ω)] and E[X(ω).D∗ (ω)] reduce to zero.
Thus from the above based assumptions, the estimate
of clean speech using Gneneral Spectral Subtraction
method(GSS) can be given as:
Fig.1: Characteristics of musical noise in frequency
domain shown over a 20 ms window for 35 frames.
Here X-axis indicates frequency, Y-axis indicates amplitude of enhanced speech with peaks showing the musical noise and Z-axis indicates the frame number in
which this analysis is done
Musical Noise Removal of Enhanced Speech Using Gray Level Values of Spectrogram Plots
It is literally impossible to minimize musical noise
without affecting the speech, and hence is a trade off
between the amount of noise reduction and speech
distortion. It is due to this fact that the perceptual
based approaches, instead of completely eliminating
have masked the musical noise, taking the advantage of the masking properties of the auditory system. On accounting for the cross-terms it is possible
to reduce the residual noise in the enhanced speech
and there by provide a better estimate of the clean
speech. Unfortunately, we do not have access to the
clean speech x(n). Therefore, in an attempt to approximate the cross-terms, corrupted signal yn and
estimate, Y ∗ (ω)D(ω) and Y (ω).D∗ (ω) are used instead of X ∗ (ω)D(ω) and X(ω).D∗ (ω) .in equation 3.
Y (ω).D∗ (ω) + Y ∗ (ω)D(ω)
= [X(ω) + D(ω)]D∗ (ω) + [X ∗ (ω) + D∗ (ω)]D(ω)
= X(ω)D∗ (ω) + X ∗ (ω) + D(ω) + 2|D(ω)|2
σyd =
1 X
|yk − µy |.|D̂(k) − µd |
and µy =
1 X
and µd =
1 X
and σy , σd are given by
σy 2 =
1 X
{|y(k)| − µy } ,
σd =
1 Xn
|D̂(k)| − µd
From the above equation (5), it can be seen that by
taking cross product of y(n) and dn in frequency domain, we can get the desired cross-terms as well as
twice the estimated noise spectrum. Based on these
observations, a modification is proposed in this work,
to the power spectrum subtraction based methods in
an attempt to take into account the distortions in
spectral subtraction attributed to cross-terms (Eq.3).
By including a short-time (instantaneous) estimate
of the cross product of y(n) and dn in the original
method of spectral subtraction [1], the new approach
for spectral subtraction can be presented as:
Where y(k) = Y (ω), when sampled at 2πk
N . Here,
if δ is zero then Eq.6 reduces to the original spectral subtraction equation, as proposed in [7]. The
frame-to-frame value of the over subtraction factor is
a factor of the segmented SNR. The segmental signalto-noise ratio varies from frame-to-frame and is given
SN R(dB) = 10 log10
N −1
k=0 |y(k)|
PN −1
k=0 |D̂(k)|
Using the SNR from Eq8, α can be determined
|X̂(ω)|2=|Y(ω)|2 −α|D̂(ω)|2 −δ|Y(ω)|.|D̂(ω)| if |X̂(ω)|2 > 0
where α is the over-subtraction factor and β is the
spectral floor parameter [7], and δ is a multiplying
factor which is incorporated such that 0 ≤ δ ≤ 1
and gives the instantaneous cross-correlation between
Y (ω) and D̂(ω) (eq.7). If this correlation is more
the estimated noise is very close to the original noise
added to speech and vice-versa
¯ σyd ¯
σy .σd ¯
= α0 −
= l
20 SN R
SN R < − 5
− 5dB ≤ SN R ≤ 20db
SN R > 20dB
where α0 = 4
The value of the spectral floor parameter was set to
0.002. By increasing the values of and the musical
noise can be reduced, at the cost of more speech distortion. So, in order to have musical noise reduction
with out affecting the speech we propose a post processing method based on the Spectrogram plots of the
Speech signals.
Before we begin the discussion of our method, it
is beneficial to examine the spectrograms (graphical
Fig.2-a: Spectrogram of (clean) speech sentence “Hedge apples may stain your hands green”
Fig.2-b: Spectrogram of the speech corrupted by Car
Fig.2-d: Spectrogram of the enhanced speech
obtained using GSS with large α.
Fig.2-c: Spectrogram of the enhanced speech
obtained using our GSS method with moderate α.
Fig.2-e: Spectrogram of the enhanced speech
obtained using our post processing method. .
Musical Noise Removal of Enhanced Speech Using Gray Level Values of Spectrogram Plots
plots of spectral magnitudes) of typical clean speech,
noisy speech, and enhanced speech. Fig. 2(a) shows
the spectrogram of an 8 kHz (clean) speech signal.
The horizontal axis of the spectrogram denotes time,
vertical axis frequency, and the spectral magnitude is
shown with gray shade (darker shade indicates larger
value). Observe that a large portion of the spectrogram is practically blank (i.e., unshaded) and the
speech energy is concentrated in a few isolated regions. In the figure, the voiced portion of speech is
characterized by dark parallel “stripes” whereas unvoiced portion is characterized by gray patches. Notice that some parallel stripes are horizontal while
some are slanting up or down, indicating a change in
the pitch of the speech signal.
When Car noise amounting to an SNR of 0 dB
is added to the clean speech, the blank region of
the spectrogram as shown in Fig. 2(a) becomes
shaded, and some of the stripes corresponding to
voiced speech disappear [see Fig. 2(b)]. With an
appropriate spectral subtraction, we obtain an enhanced speech with spectrogram as shown in Fig.2(c).
Spectral subtraction has suppressed the noise greatly,
and consequently Fig. 2(c) resembles Fig. 2(a) much
more than Fig. 2(b) does. However, noise suppression is achieved at the price of musical noise which
corresponds to isolated short stripes in the spectrogram.
Musical noise can be easily eliminated via over subtraction (i.e., GSS/our previous work using a larger
), but this will be at the expense of speech intelligibility. Indeed, in Fig. 2(d) which shows the spectrogram of the enhanced speech obtained using our
proposed with large , we observe a significant reduction of the unwanted short stripes. At the same time,
some stripes observed in Fig. 2(c) (the spectrogram of
the enhanced speech with a smaller ), corresponding
to the desired speech content, are eliminated. [Fig.
2(e) will be referred later in Section 6].
In short, it is possible to suppress unwanted noise
effectively with our previous proposed method / GSS
method. However, the speech quality is compromised
because of the annoying musical noise or the speech
intelligibility decreases. Consequently, it is a challenge to suppress unwanted noise effectively while
maintaining reasonably high speech quality and intelligibility.
Our strategy is to first obtain the spectral magnitudes (and hence the spectrogram) of the enhanced
speech via the GSS/previous proposed method, using appropriate such that the enhanced speech is of
reasonably high intelligibility but with resulting appreciable musical noise. The next step involves suppressing the short stripes in the spectrogram that
correspond to musical noise, without noticeable effect on the speech content. The final step requires
computation of enhanced speech via inverse discrete
Fourier transform and overlap-add processing, with
the use of the spectral magnitudes obtainable from
the processed spectrogram and the spectral phases of
the noisy speech.
The effectiveness of our method depends greatly
on the ability to identify which regions of the spectrogram correspond to the desired speech signal
and which regions correspond to musical noise, and
the processing to be carried out over these regions.
Therefore, we shall focus on our classification (identification) approach and our specific treatment to the
various regions of the spectrogram.
A. Classification We shall determine which regions
in the spectrogram are very likely to correspond to
speech, and which regions correspond to either musical noise or speech (of low energy). For convenience,
we shall refer to the regions very likely to be speech
as Region A, and the other as Region B.
1) Stage 1 : We will exploit the fact that musical
noise can be effectively reduced via over subtraction
[see Fig. 2(d) and the accompanying discussion in
Section 5].
Fig.3: Six blades over 7x7 spectrogram points. The
point of concern coincides with (0, 0), the centroids
of the blades.
Indeed, we propose first computing the spectrogram
of the enhanced speech based on GSS with a large
α. We then include those spectrogram points (r, k)’s
that have spectral magnitudes greater than zero in
Region A:
(r, k) ² Region A if |Y r[k]|2 − λ|D̂[k]|2 > 0
Clearly , those spectral components of speech that
are strong enough will be retained.
2) Stage 2: The fact that over subtraction leads
to decrease in speech intelligibility implies that some
spectral components corresponding to speech are attenuated in the over subtraction process. It is therefore sensible to assume that there would be some additional spectrogram points that can be classified under Region A. However, the spectral values associated
with these points are low, and usually comparable
to those of musical noise, meaning that any form of
development of the following method which exploits
specific characteristics of speech.
First define blades Bi0 s, for i = 1, . . . , p, with
different orientations over the point of concern [i.e.,
(0,0)], in the way shown in Fig. 3. The length of the
blade is decided by the window size which is defined
around the point of concern in a spectrogram. Since
we have chosen a 7×7 window the size of the blade
is 7. If the window size is very small computational
complexity increases and if it is very large averaging
over a large region takes place which will result in
poor accuracy. So we have chosen blade size as 7 by
trail and error method. Around the point of concern
we have defined blades at an angle of 30◦ which has
resulted in 6 blades (180/30=60). If the angle is very
small the computational complexity increases since
the number of blades increases. If the angle is more,
many of the points in the window would not come in
the vicinity of the blades.
The grid points of Bi, . . . , B6 as shown in Fig.3
are {(-3, 3), (-2, 2), (-1, 1), (0, 0), (1, -1), (2, -2),
(3, -3)}, {(-3, l),(-2, 1),(-1, 0),(0, 0),(l, 0),(2, -1),(3,
-l),{(-3, 0), (-2, 0), (-1, 0), (0, 0), (1, 0), (2, 0), (3, 0),
{(-3, -1), (-2, -1), (-1, 0), (0, 0), (1, 0), (2, 1), (3, 1)},
{(-3, -3), (-2, -2), (-1, -1), (0, 0), (1, 1), (2, 2), (3,
3)}, and {(0, -3), (0, -2), (0, -l),(0, 0),(0, l),(0, 2),(0,
3)}, respectively. The width of each blade should be
thin enough so that the grid points being intersected
form a straight line. Of course for some orientations,
the width of the blade has to be somewhat larger so
as to intersect a significant number of points, and the
points intersected are not strictly in one straight line
(seeB1 and B4 of Fig.3).
To determine whether a spectrogram point belongs
to Region A, we compute var(Bi ), the variance of the
values of the spectrogram points which the blade intersects, for each blade Bi where i = 1, · · · , p:
V ar(Bi ) =
 X
{20 log10 (|Ŝr [k] + 1)}2
|Bi |
20 log10 (|Sr [k] + 1) 
|Bi |
We then identify, Bmin the blade with variance being
the minimum among var(B;) for i = 1, . . . , p:
Bmin = arg min[var(Bi )].
The variance associated with Bmin , which will be
denoted as var (Bmin ), would offer an indication as
to whether the point concerned belongs to Region
A. Indeed, for a point belonging to those parallel
stripes associated with voiced speech, Bmin will most
likely be of the same orientation as the stripes, and
var(Bmin ) will be quite small due to homogeneity in
the spectral magnitude values. For a point within
patches associated with unvoiced speech, all the variances will be reasonably low, especially var(Bmin ).
On the other hand, the variances for a point which belongs to short stripes corresponding to musical noise
will all be considerably high, because the blade length
is longer than the stripe length, and Bmin intersects
some points outside the stripes, in addition to those
Consequently, it is justifiable to assume that a
spectrogram point (r, k) belongs to Region A if is considerably small:
(r, k)²Region, if var (Bmin ) < τ
Where τ is an appropriately chosen threshold.
In fact, the histogram of var(Bmin )’s will often exhibit two peaks, of which one occurs at a large
var(Bmin ) and the other at small var(Bmin ). Analysis of the peaks confirms that the former correlates
well with musical noise and the latter with speech
signal. Therefore we recommend setting the threshold τ to be one around the valley between the two
peaks. Additional considerations were given to the
points at the boundaries of stripes/patches associated
with speech. In fact, for such a boundary point, almost every one of the blades will have one part which
protrudes out of the stripe/patch of concern, leading
to large var(Bmin ). To tackle this problem, we employ additional ”left” and ”right” blades as shown in
Fig. 4 (the additional blades have lengths and orientation angles identical to the original ones shown
in Fig. 3). For example, B7= {(-6, 6), (-5, 5), (4, 4), (-3, 3), (-2, 2), (-1, 1), (0, 0)} and B13 =
{(0, 0),(l, 0),(2, 1),(3, 1),(4, 2), (5, 2), (6, 3)} [note
that the point of concern is at (0,0)]. Now for any
one point of concern, one has to simply obtain Bmin
and var(Bmin ) for the original, the left and the right
blades, in exactly the way discussed in the previous
paragraphs. There will thus be a single lowest Bmin
for the three types of blades. The spectrogram points
for the blade with this . Bmin are most likely to be
part of a stripe or patch associated with speech. On
the other hand, for a point of concern on a short stripe
associated with musical noise, the value of var(Bmin )
will remain high, as the lengths of the original, left
and right blades are all longer than that of the stripe.
Remark: With the classification carried out in Stage
2, that in Stage 1 seems redundant. While this is generally true, Stage 1 is crucial when there exist sudden
bursts of speech utterances which give rise to intense
stripes with rapidly increasing spectral values. For
these bursts, var(Bmin ) will be high and the point
concerned can be confirmed to fall under Region A
only with the energy discrimination approach that
Stage 1 adopts.
B. Processing Treatment
Region A: After the speech/musical-noise classification process, the spectrogram is divided into two
regions, namely Region A and Region B. For those
points in Region A, we suggest leaving the spectral
values untouched. For those points in Region B, the
following processing will be carried out.
Region B: Region B comprises points that are associated with either musical noise or speech (of low
Musical Noise Removal of Enhanced Speech Using Gray Level Values of Spectrogram Plots
Fig.4: (a) Five “left” blades (b) Five “right” blades
energy). The criterion for processing should be that
the spectral values of the points corresponding to musical noise are considerably attenuated, while those
corresponding to speech are at most slightly altered.
With this in mind, we propose replacing the spectral value of the point concerned by the median of
the values of those spectrogram points which Bmin
intersects, if the median value is not larger than the
current spectral value.
00 [k ]
|Ŝr [k]| =median
(r k εBmin )
< |Ŝr [k]|
0 k 0 εB
min )
Of course, over a spectrogram point corresponding to speech, any form of processing will likely to
change the spectral value, and simply leaving the
value untouched as we have suggested for Region A
is probably a safer approach. However, the difficulty
here is that we are unsure whether the point corresponds to either speech or musical noise. Fortunately, Bmin would be likely to coincide/overlap with
the stripes/patches associated with speech. Consequently, the median value will not be too different
from the spectral value of the point concerned due to
the uniformity of points within such stripes. On the
other hand, over a short stripe associated with musical noise, the median will take a spectral value considerably smaller than that of the point itself, since
many points that the blade intersects will fall outside
the stripe and will thus have much lower values.
C. The Complete Enhancement Procedure
Now we shall present the complete enhancement
procedure we propose. Given a noisy speech signal, it
is first buffered into overlapping frames with a frame
size of 32 ms and an overlap of 24 ms. If the length
of window is less than 4 ms the full pitch period will
not be included in the window. So pitch information
would not show up in the spectrum. If it is greater
than 50 ms signal characteristics will be changing over
the course of the window, so window size should lie
between 4ms and 50ms and hence we have chosen
32ms as the size of the frame which corresponds to
256 samples at 8 KHz sampling rate. If we use disjoint
windows the envelope shape of the hamming window
may be audible in reconstructed signal so we have
to go for overlapped windows. The standard overlap
% is 50%, but we have used 75% overlap (24ms) to
have better reconstruction of speech from overlapped
analysis segments.
Each frame is then multiplied by a Hamming window and transformed to the frequency domain via a
fast Fourier transform (FFT). Next, spectral subtraction based on (6) is employed, for obtaining the enhanced speech. Subsequently, every Sampled Shorttime Fourier Transform (SSTFT) magnitude point is
subject to classification: it will be classified to be in
Region A (region very likely to be speech)or Region B
(those not in Region A) according to (10). Note that
all 16 blades B, B16 as shown in Figs. 3 and 4 are used
in the computation of Bmin as given by (12), which
will in turn be used for classification via the recipe
specified by (13). For points classified under Region
A, we leave them untouched. For points classified under Region B, we recomputed the SSTFT magnitude
via (14). Finally, by combining the SSTFT magnitudes so obtained with the SSTFT phases of noisy
speech Y(w) as well as applying inverse FFT, standard overlap-add, we get the desired post processed
enhanced speech.
We shall now assess the performance of our
method. The noisy speech signal is first sampled at
8KHz since the maximum speech bandwidth mostly
used in telephone or other commercial applications
is 4KHz. Also, different types of noise, namely
computer-generated white Gaussian noise (WGN)
and real cockpit noise, amounting to various values
of SNR (-5, 0, 5 dB) were considered. For performance assessment, we relied on not only spectrogram
plots, but also on objective measures such as segmental SNR (SEGSNR) and inverted linear spectral
distance (ILSD) [11], and informal subjective listening tests. ILSD was employed because it has reasonably high correlation with diagnostic acceptability
measure [11], a widely adopted subjective measure
for overall speech quality and intelligibility. (Interested readers may refer to [11] for more details about
SEGSNR and ILSD.) Note that the ILSD measure
takes value between 0% and 100%, with 100% (0%)
being best (worst) in overall quality and intelligibility. Note also that we removed silent intervals in the
speech signals before computing SEGSNR since the
silent intervals could drastically affect the value of
We first evaluated the performance through a visual inspection of the spectrograms. The evaluation
was carried out on the speech sentence “Hedge apples may stain your hands green” corrupted by Car
noise. Now recall that Fig.2(c) is the spectrogram of
the enhanced speech obtained using GSS with a moderate α, which exhibits a few isolated short stripes
corresponding to musical noise. Subjecting the spectrogram to our post processing method, we obtained
the spectrogram shown in Fig.2 (e).
A comparison of the two spectrograms shows that
most unwanted short stripes are eliminated while
the parallel stripes and large patches corresponding
to voiced and unvoiced speech respectively remains
practically untouched. Fig.2 (d) shows the spectrogram of the enhanced speech obtained using GSS
with a large α. The value of α is indeed large since
some stripes associated with speech in Fig.2 (a) (the
spectrogram of the clean speech) have disappeared.
Unfortunately, even with such a large α, some musical noise still survives. On the other hand, it is
encouraging to see that with our method, not only
can the musical noise be almost completely removed,
but also that the speech content is better preserved
[see Fig.2(e)]. Indeed, some musical noise observed in
Fig.2(d) is absent in Fig.2(e), while some stripes corresponding to voiced speech are retained in Fig.2(e)
but not in Fig2(d) (see the relevant labels on both
We also performed informal subjective listening
tests on four speech sentences .It was clear that
the enhanced speech obtained using GSS with our
method was much more pleasant than that with GSS
alone .More over, it was found that the intelligibility of the former was comparable to, if not higher
than later. Next we compute the objective measures
of ILSD and SEGSNR with the use of all four sentences mentioned. Fig 5 shows the SegSNR for Boll
method, our preprocessing method and the Post processing method for an input SegSNR of 0dB for dif-
Fig.5: Graph of Seg-SNR measurements for Input
Seg-SNR of 0dB for different types of noise.
X-Noisy speech; Y-Enhanced speech obtained using
GSSwith moderate α ; Z-Enhanced speech obtained
using GSS with the same α andour method in
Fig.6: Graph of Seg-SNR measurements Vs SNR
for WGN and Cockpit Noise at different SNRs
ferent types of noises. From the figure we can see that
the post processing method gives better results in any
type of noise conditions. For street, airport and Train
noise the post processing method could show only
slight improvement because the noise that is added is
other speakers noise which is difficult to remove using Spectral Subtraction technique. Fig 6&7 shows
SEGSNR and ILSD, respectively, for noisy speech ,
enhanced speech obtained using GSS with a moderate
as well as enhanced speech obtained using GSS with
the same and our method in cascade. Clearly, our
method offers significant improvements consistently
in the presence of WGN, Cock-Pit noise, Car noise
etc., at various SNRs.
We have developed a post processing method for
suppressing noise generated by spectral subtraction.
Visual evaluation based on spectrograms, objective
assessment based on ILSD and SEGSNR and infor-
Musical Noise Removal of Enhanced Speech Using Gray Level Values of Spectrogram Plots
Fig.6: Same as Fig.6. except ILSD instead of
mal subjective listening tests all indicated that our
method is reasonably effective. But, coming to the
complexity and processing time aspect our method
has the disadvantage. The time taken for processing using our post processing method for approx. 2.8
ms of speech using 25ms window with 40% overlap i.e.
10ms is approx. 40sec.This delay is more for real time
applications but for offline applications like recorded
speech this delay is quite small. The computational
complexity of our algorithm is also more.
In our future work we will try to reduce this by using a spectrogram which is formed by subtracting the
spectrogram with large λ over estimation) from the
one with small λ under estimation) so that almost all
strong speech regions will be removed and we are leftout with only weak speech regions and musical noise
regions which are to be processed by our method.
For further work, we propose to explore the use of
our method for removing unwanted noise inevitably
generated in other enhancement process.. Indeed, although our method was applied only in conjunction
with the spectral subtraction , it may be appropriate
to employ our method to suppress the enhancement
noise associated with other methods such as Wiener
filtering[4], signal subspace decomposition and filtering[12] etc. This issue will be addressed in our future
J. S. Lim and A. V. Oppenheim, “Enhancement
and bandwidth compression of noisy speech,”
Proc. IEEE, vol. 67, pp. 1586-1604, Dec. 1979.
S. F. Boll, “Suppression of acoustic noise in
speech using spectral subtraction,” IEEE Trans.
Acoust., Speech, Signal Processing, vol. ASSP27, pp. 113-120, Apr. 1979.
J. S. Lim and A. V. Oppenheim, “All-pole modeling of degraded speech,” IEEE Trans. Acoust.,
Speech, Signal Processing, vol. ASSP-26, pp. 197210, June 1978.
Y. Ephraim and D. Malah, “Speech enhance-
ment using a minimum mean-square error
short-time spectral amplitude estimator,” IEEE
Trans. Acoust, Speech, Signal Processing, vol. 32,
pp. 1109-1121, Dec. 1984.
[5] G. Whipple, “Low residual noise speech enhancement utilizing time-frequency filtering,” in Proc.
ICASSP’94, pp. I-5/I-8.
[6] K.Anitha Sheela & Dr.K.Satya Prasad, “A Noise
Reduction Preprocessor For Mobile Voice Communication Using Perceptually Weighted Spectral Subtraction Method,” Proc., OBCOM Dec.
2006 at VIT, Vellore., vol.1,pp.91-100, 2006.
[7] J. R Berouti, M., Schwartz, R. and Makhoul, J.,
“Enhancement of Speech corrupted by Acoustic
Noise,” . IEEE ICASSP, pp. 208-211, Washington DC, April 1979.
[8] Yamauchi and T. Shimamura, “Noise estimation
using high frequency regions for Power Spectral
Subtraction,” IEICE Trans. Fundam., vol. E85A, no. 3, pp. 723-727, Mar. 2002.
[9] R. Martin, “Power Spectral Subtraction based
on minimum statistics,” in Proc. EUSIPCO, Sep.
1994, pp. 1182-1185.
[10] “Evaluation of a correlation subtraction method
for enhancing speech degraded by additive white
noise,” IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-26, no. 5, pp. 471-472, Oct.
[11] S. R. Quackenbush, T. P. Barnwell, III, and
M. A. Clements, Objective Measures of Speech
Quality. Englewood Cliffs, NJ: Prentice-Hall,
[12] Y. Ephraim and H. L. Van Trees, “A spectrallybased signal subspace approach for speech enhancement,” in Proc. ICASSP’95,
K. Anitha Sheela received the B.Tech
degree in Electronics and Communication Engineering from Regional Engineering College, Warangal, India, in
1993 and M. Tech degree in Systems and
Signal Processing from Osmania University, India, in 1998.
She has worked as Testing Engineer
for two years from 1993 to 1995 and then
she has worked as LECTURER from
1998 to 2000 in various private engineering colleges at Hyderabad, India and is presently working as
ASSISTANT PROFESSOR at Jawaharlal Nehru Technological University in Electronics and Communication Engineering
Department, Hyderabad, India. She has presented about 15
papers in various national and international conferences and
journals. Her current research is in Speech Enhancement and
Speaker Recognition using Neural Networks. Besides Speech
processing she is also working in the fields of Image Processing,
Pattern Recognition and DSP Processors.
Ms. Sheela is a life member of various professional bodies
like IETE and ISTE.
K.Satya Prasad received the B.Tech
degree in Electronics and communication Engineering from JNTU college of
Engg. Anantapur, Andhra Pradesh, India in 1977 and M.E. degree in Communication Systems from Guindy college of Engg., Madras University, India in 1979 and Ph.D degree from Indian Institute of Technology ,Madras in
He started his teaching carrier as
Teaching Assistant at Regional Engineering college , Warangal
in 1979. He joined JNT University as Lecturer in 1980 and
served in different constituent colleges viz., Kakinada, Hyderabad & Anantapur and at different capacities viz.. Associate
Professor, Professor, Head of the Department, Vice Principal
& Principal. He has published more than 50 technical papers
in different National & International conferences and Journals
and Authored one text book.
He has successfully guided 3 Ph. D scholars and at present
8 scholars are working under him. His areas of research are
Signal processing, Image processing, Speech processing, Neural Networks & Ad hoc wireless Networks.
Dr. Prasad is a Fellow member of various professional bodies like IETE, IE, and ISTE.
M. Madhavi Latha received the B.E.
degree in Electronics and Communication Engineering from Bapatla College of Engineering, Bapatla, India, in
1986 and M. Tech degree and Ph.D.
from College of Engineering, Jawaharlal Nehru Technological University, Hyderabad ,India, in the years 1993, 2002
She has worked as Engineer for three
years during 1989 to 1991 and has
worked as LECTURER from 1986 to 1989 and Associate Lecturer from 1993 to 1994 in polytechnic college at Guntur, India and worked and is presently working as PROFESSOR at
Jawaharlal Nehru Technological University in Electronics and
Communication Engineering Department, Hyderabad, India.
She has presented around 15 papers in various national and
international conferences and journals. Her current research
and guidance is in Image Processing. Besides Image processing she is also working in the fields of DSP Processors, Mixed
Signal Design and Speech Processing.
Ms. Latha is fellow of IETE and life member of ISTE.
Fly UP