[1] local/prepare_data.sh waves_yesno
#!/bin/bash
mkdir -p data/local
local=`pwd`/local
scripts=`pwd`/scripts
export PATH=$PATH:`pwd`/../../../tools/irstlm/bin
echo "Preparing train and test data"
train_base_name=train_yesno
test_base_name=test_yesno
waves_dir=$1
cd data/local
# output the files names for all the input waves
ls -1 ../../$waves_dir > waves_all.list
# read the waves list, separate them into test and train sets
# save them in data/local
../../local/create_yesno_waves_test_train.pl waves_all.list waves.test waves.train
# generate the scp file that contains the file name(pattern) and full path
# e.g., 0_0_0_0_1_1_1_1 waves_yesno/0_0_0_0_1_1_1_1.wav
../../local/create_yesno_wav_scp.pl ${waves_dir} waves.test > ${test_base_name}_wav.scp
# same for the training set
../../local/create_yesno_wav_scp.pl ${waves_dir} waves.train > ${train_base_name}_wav.scp
# translate the wav file name into yes/no words
# e.g., 0_0_0_0_1_1_1_1 NO NO NO NO YES YES YES YES
../../local/create_yesno_txt.pl waves.test > ${test_base_name}.txt
../../local/create_yesno_txt.pl waves.train > ${train_base_name}.txt
# copy over the language model file
cp ../../input/task.arpabo lm_tg.arpa
cd ../..
# This stage was copied from WSJ example
for x in train_yesno test_yesno; do
mkdir -p data/$x
# copy over the script and text file
cp data/local/${x}_wav.scp data/$x/wav.scp
cp data/local/$x.txt data/$x/text
# replace the second column with 'global'
# for instance, 0_0_0_0_1_1_1_1 global
cat data/$x/text | awk '{printf("%s global\n", $1);}' > data/$x/utt2spk
# convert the multiple lines (utt2spk) to single line (spk2utt), using single
# space to separate the wave file
# for instance, global 0_0_0_0_1_1_1_1 0_0_0_1_0_0_0_1 0_0_0_1_0_1_1_0 ....
utils/utt2spk_to_spk2utt.pl <data/$x/utt2spk >data/$x/spk2utt
done
[2] local/prepare_dict.sh
#!/bin/bash
mkdir -p data/local/dict
# cp the symbols for the targeted words
cp input/lexicon_nosil.txt data/local/dict/lexicon_words.txt
# cp all symbols for the lexical analysis
cp input/lexicon.txt data/local/dict/lexicon.txt
# cpy the phones without silence
cat input/phones.txt | grep -v SIL > data/local/dict/nonsilence_phones.txt
echo "SIL" > data/local/dict/silence_phones.txt
echo "SIL" > data/local/dict/optional_silence.txt
echo "Dictionary preparation succeeded"
[3] utils/prepare_lang.sh --position-dependent-phones false data/local/dict "<SIL>" data/local/lang data/lang
# This script adds word-position-dependent phones and constructs a host of other
# derived files, that go in data/lang/.
reformat the file from dict to lang
[4] local/prepare_lm.sh
Preparing language models for test
finite state transducer: G.fst
arpa2fst -
Processing 1-grams
Connected 0 states without outgoing arcs.
fstisstochastic data/lang_test_tg/G.fst
1.20397 0
Succeeded in formatting data.
[5] Feature extraction
for x in train_yesno test_yesno; do
steps/make_mfcc.sh --nj 1 data/$x exp/make_mfcc/$x mfcc
steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x mfcc
done
compute-mfcc-feats
if you have print out in compute-mfcc-feats.cc, you can view the output in the following log file.
$vim exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log
the code is in featbin folder
you can measure the timing by include
// added for timing
#include "base/timer.h"
#include "base/kaldi-common.h"
#include "base/kaldi-utils.h"
to the top of the file, and start timer
after recompile the program and execute ./run.sh,
in /home/leiming/kaldi-trunk/egs/yesno/s5
you can check the training log regarding to the timing,
$vim exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log
Here is some profiling results on i7-4790K CPU @ 4.00GHz .
1 frame of MFCC:
extract window time : 1.78814e-05 (s)
FFT time : 5.96046e-06 (s)
power spectrum on mel banks : 2.14577e-06 (s)
apply log() : 9.53674e-07 (s)
dct : 9.53674e-07 (s)
we have 633 frames for utterance 0_0_0_0_1_1_1_1 (8 words).
so that's around 0.00002789 x 633 = 17.7 ms
if we timing the mfcc computation time for this single utterance, it is 16.6 ms.
it brings down 2ms per word for mfcc computation.
31 utterance, 16.6 x 31 = 514.6 ms
Regarding to the cmvn (cepstral mean and variance normalization), it uses one channel mode as default.
31 utterance, 4.32205 ms
[6] Training
it brings down 2ms per word for mfcc computation.
31 utterance, 16.6 x 31 = 514.6 ms
Regarding to the cmvn (cepstral mean and variance normalization), it uses one channel mode as default.
31 utterance, 4.32205 ms
[6] Training
# Mono training
steps/train_mono.sh --nj 1 --cmd "$train_cmd" --totgauss 400 data/train_yesno data/lang exp/mono0a
it skips feat-to-dim, run gmm-init-mono
gmm with hmm topology, check the timing in the following log
$vim exp/mono0a/log/init.log
gmm-init-mono time : 0.00767899 (s)
next, compile-train-graphs
vim exp/mono0a/log/compile_graphs.1.log
compile train graph time : 0.0409088 (s)
next,
align-equal-compiled
vim exp/mono0a/log/align.0.1.log
align equal compiled time : 0.031034 (s)
....
I did a quick hack on timing, where I directly measure the time in shell script.
data preparation time in (ms): 155
feature extraction time in (ms): 1112
train_mono time in (ms): 9448
compile graph in (ms): 31
decode in (ms): 837
steps/train_mono.sh --nj 1 --cmd "$train_cmd" --totgauss 400 data/train_yesno data/lang exp/mono0a
it skips feat-to-dim, run gmm-init-mono
gmm with hmm topology, check the timing in the following log
$vim exp/mono0a/log/init.log
gmm-init-mono time : 0.00767899 (s)
next, compile-train-graphs
vim exp/mono0a/log/compile_graphs.1.log
compile train graph time : 0.0409088 (s)
next,
align-equal-compiled
vim exp/mono0a/log/align.0.1.log
align equal compiled time : 0.031034 (s)
....
I did a quick hack on timing, where I directly measure the time in shell script.
data preparation time in (ms): 155
feature extraction time in (ms): 1112
train_mono time in (ms): 9448
compile graph in (ms): 31
decode in (ms): 837