audiofilemfcc¶
This module contains the following classes:
AudioFileMFCC
, representing a mono WAVE audio file as a matrix of Melfrequency ceptral coefficients (MFCC).

class
aeneas.audiofilemfcc.
AudioFileMFCC
(file_path=None, file_format=None, mfcc_matrix=None, audio_file=None, rconf=None, logger=None)[source]¶ A monoaural (single channel) WAVE audio file, represented as a NumPy 2D matrix of Melfrequency ceptral coefficients (MFCC).
The matrix is “fat”, that is, its number of rows is equal to the number of MFCC coefficients and its number of columns is equal to the number of window shifts in the audio file. The number of MFCC coefficients and the MFCC window shift can be modified via the
MFCC_SIZE
andMFCC_WINDOW_SHIFT
keys in therconf
object.If
mfcc_matrix
is notNone
, it will be used as the MFCC matrix.If
file_path
oraudio_file
is notNone
, the MFCCs will be computed upon creation of the object, possibly converting to PCM16 Mono WAVE and/or loading audio data in memory.The MFCCs for the entire wave are divided into three contiguous intervals (possibly, zerolength):
HEAD = [:middle_begin[ MIDDLE = [middle_begin:middle_end[ TAIL = [middle_end:[
The usual NumPy convention of including the left/start index and excluding the right/end index is adopted.
For alignment purposes, only the
MIDDLE
portion of the wave is taken into account; theHEAD
andTAIL
intervals are ignored.This class heavily uses NumPy views and inplace operations to avoid creating temporary data or copying data around.
Parameters:  file_path (string) – the path of the PCM16 mono WAVE file, or
None
 file_format (tuple) – the format of the audio file, if known in advance:
(codec, channels, rate)
orNone
 mfcc_matrix (
numpy.ndarray
) – the MFCC matrix to be set, orNone
 audio_file (
AudioFile
) – an audio file, orNone
 rconf (
RuntimeConfiguration
) – a runtime configuration  logger (
Logger
) – the logger object
Raises: ValueError: if
file_path
,audio_file
, andmfcc_matrix
are allNone
New in version 1.5.0.

all_length
¶ The length, in MFCC coefficients, of the entire audio file, that is, HEAD + MIDDLE + TAIL.
Return type: int

all_mfcc
¶ The MFCCs of the entire audio file, that is, HEAD + MIDDLE + TAIL.
Return type: numpy.ndarray
(2D)

audio_length
¶ The length, in seconds, of the audio file.
This value is the actual length of the audio file, computed as
number of samples / sample_rate
, hence it might differ thanlen(self.__mfcc) * mfcc_window_shift
.Return type: TimeValue

head_length
¶ The length, in MFCC coefficients, of the HEAD of the audio file.
Return type: int

inside_nonspeech
(index)[source]¶ If
index
is contained in a nonspeech interval, return a pair(interval_begin, interval_end)
such thatinterval_begin <= index < interval_end
, i.e.,interval_end
is assumed not to be included.Otherwise, return
None
.Return type: None
or tuple

intervals
(speech=True, time=True)[source]¶ Return a list of intervals:
[(b_1, e_1), (b_2, e_2), ..., (b_k, e_k)]
where
b_i
is the time when thei
th interval begins, ande_i
is the time when it ends.Parameters:  speech (bool) – if
True
, return speech intervals, otherwise return nonspeech intervals  time (bool) – if
True
, returnTimeInterval
objects, otherwise return indices (int)
Return type: list of pairs (see above)
 speech (bool) – if

is_reversed
¶ Return
True
if currently reversed.Return type: bool

masked_length
¶ Return the number of MFCC speech frames in the FULL wave.
Return type: int

masked_map
¶ Return the map from the MFCC speech frame indices to the MFCC FULL frame indices.
Return type: numpy.ndarray
(1D)

masked_mfcc
¶ Return the MFCC speech frames in the FULL wave.
Return type: numpy.ndarray
(2D)

masked_middle_length
¶ Return the number of MFCC speech frames in the MIDDLE portion of the wave.
Return type: int

masked_middle_map
¶ Return the map from the MFCC speech frame indices in the MIDDLE portion of the wave to the MFCC FULL frame indices.
Return type: numpy.ndarray
(1D)

masked_middle_mfcc
¶ Return the MFCC speech frames in the MIDDLE portion of the wave.
Return type: numpy.ndarray
(2D)

middle_begin
¶ Return the index where MIDDLE starts.
Return type: int

middle_begin_seconds
¶ Return the time instant, in seconds, where MIDDLE starts.
Return type: TimeValue

middle_end
¶ Return the index (+1) where MIDDLE ends.
Return type: int

middle_length
¶ The length, in MFCC coefficients, of the middle part of the audio file, that is, without HEAD and TAIL.
Return type: int

middle_map
¶ Return the map from the MFCC frame indices in the MIDDLE portion of the wave to the MFCC FULL frame indices, that is, an
numpy.arange(self.middle_begin, self.middle_end)
.NOTE: to translate indices of MIDDLE, instead of using fancy indexing with the result of this function, you might want to simply add
self.head_length
. This function is provided mostly for consistency with the MASKED case.Return type: numpy.ndarray
(1D)

middle_mfcc
¶ The MFCCs of the middle part of the audio file, that is, without HEAD and TAIL.
Return type: numpy.ndarray
(2D)

reverse
()[source]¶ Reverse the audio file.
The reversing is done efficiently using NumPy views inplace instead of swapping values.
Only speech and nonspeech intervals are actually recomputed as Python lists.

run_vad
(log_energy_threshold=None, min_nonspeech_length=None, extend_before=None, extend_after=None)[source]¶ Determine which frames contain speech and nonspeech, and store the resulting boolean mask internally.
The four parameters might be
None
: in this case, the corresponding RuntimeConfiguration values are applied.Parameters:  log_energy_threshold (float) – the minimum log energy threshold to consider a frame as speech
 min_nonspeech_length (int) – the minimum length, in frames, of a nonspeech interval
 extend_before (int) – extend each speech interval by this number of frames to the left (before)
 extend_after (int) – extend each speech interval by this number of frames to the right (after)

set_head_middle_tail
(head_length=None, middle_length=None, tail_length=None)[source]¶ Set the HEAD, MIDDLE, TAIL explicitly.
If a parameter is
None
, it will be ignored. If bothmiddle_length
andtail_length
are specified, onlymiddle_length
will be applied.Parameters: Raises: TypeError: if one of the arguments is not
None
orTimeValue
Raises: ValueError: if one of the arguments is greater than the length of the audio file

tail_begin
¶ The index, in MFCC coefficients, where the TAIL of the audio file starts.
Return type: int

tail_length
¶ The length, in MFCC coefficients, of the TAIL of the audio file.
Return type: int
 file_path (string) – the path of the PCM16 mono WAVE file, or