Monday, January 11, 2016

dive deep into Example 1 : yesno

in  the run.sh
[1] local/prepare_data.sh waves_yesno

#!/bin/bash

mkdir -p data/local
local=`pwd`/local
scripts=`pwd`/scripts

export PATH=$PATH:`pwd`/../../../tools/irstlm/bin

echo "Preparing train and test data"

train_base_name=train_yesno
test_base_name=test_yesno
waves_dir=$1

cd data/local

# output the files names for all the input waves
ls -1 ../../$waves_dir > waves_all.list

# read the waves list, separate them into test and train sets
# save them in data/local
../../local/create_yesno_waves_test_train.pl waves_all.list waves.test waves.train

# generate the scp file that contains the file name(pattern) and full path
# e.g., 0_0_0_0_1_1_1_1 waves_yesno/0_0_0_0_1_1_1_1.wav
../../local/create_yesno_wav_scp.pl ${waves_dir} waves.test > ${test_base_name}_wav.scp

# same for the training set
../../local/create_yesno_wav_scp.pl ${waves_dir} waves.train > ${train_base_name}_wav.scp

# translate the wav file name into yes/no words
# e.g., 0_0_0_0_1_1_1_1 NO NO NO NO YES YES YES YES
../../local/create_yesno_txt.pl waves.test > ${test_base_name}.txt

../../local/create_yesno_txt.pl waves.train > ${train_base_name}.txt

# copy over the language model file
cp ../../input/task.arpabo lm_tg.arpa

cd ../..

# This stage was copied from WSJ example
for x in train_yesno test_yesno; do
  mkdir -p data/$x
  # copy over the script and text file
  cp data/local/${x}_wav.scp data/$x/wav.scp
  cp data/local/$x.txt data/$x/text

  # replace the second column with 'global'
  # for instance, 0_0_0_0_1_1_1_1 global
  cat data/$x/text | awk '{printf("%s global\n", $1);}' > data/$x/utt2spk

  # convert the multiple lines (utt2spk) to single line (spk2utt), using single
  # space to separate the wave file
  # for instance, global 0_0_0_0_1_1_1_1 0_0_0_1_0_0_0_1 0_0_0_1_0_1_1_0 ....
  utils/utt2spk_to_spk2utt.pl <data/$x/utt2spk >data/$x/spk2utt
done


[2] local/prepare_dict.sh
#!/bin/bash                                                                     
                                                                                
mkdir -p data/local/dict                                                        
                                                                                
# cp the symbols for the targeted words                                         
cp input/lexicon_nosil.txt data/local/dict/lexicon_words.txt                    
                                                                                
# cp all symbols for the lexical analysis                                       
cp input/lexicon.txt data/local/dict/lexicon.txt                                
                                                                                
# cpy the phones without silence                                                
cat input/phones.txt | grep -v SIL > data/local/dict/nonsilence_phones.txt      
                                                                                
echo "SIL" > data/local/dict/silence_phones.txt                                 
                                                                                
echo "SIL" > data/local/dict/optional_silence.txt                               
                                                                                

echo "Dictionary preparation succeeded"


[3] utils/prepare_lang.sh --position-dependent-phones false data/local/dict "<SIL>" data/local/lang data/lang

# This script adds word-position-dependent phones and constructs a host of other
# derived files, that go in data/lang/.

reformat the file from dict to lang

[4] local/prepare_lm.sh
Preparing language models for test

finite state transducer: G.fst

arpa2fst -
Processing 1-grams
Connected 0 states without outgoing arcs.
fstisstochastic data/lang_test_tg/G.fst
1.20397 0
Succeeded in formatting data.

[5] Feature extraction                                                          
for x in train_yesno test_yesno; do                                          
 steps/make_mfcc.sh --nj 1 data/$x exp/make_mfcc/$x mfcc                      
 steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x mfcc                    
done

compute-mfcc-feats

if you have print out in compute-mfcc-feats.cc, you can view the output in the following log file.
 $vim exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log

the code is in featbin folder
you can measure the timing by include
// added for timing
#include "base/timer.h"
#include "base/kaldi-common.h"
#include "base/kaldi-utils.h"
to the top of the file, and start timer

after recompile the program and execute ./run.sh
in /home/leiming/kaldi-trunk/egs/yesno/s5
you can check the training log regarding to the timing,
$vim exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log 

Here is some profiling results on i7-4790K CPU @ 4.00GHz .

1 frame of MFCC:
extract window time :                     1.78814e-05 (s)
FFT time :                                      5.96046e-06 (s)
power spectrum on mel banks :     2.14577e-06 (s)
apply log() :                                   9.53674e-07 (s)
dct :                                                9.53674e-07 (s)

we have 633 frames for  utterance 0_0_0_0_1_1_1_1 (8 words).
so that's around 0.00002789 x 633 = 17.7 ms

if we timing the mfcc computation time for this single utterance, it is 16.6 ms.

it brings down 2ms per word for mfcc computation.

31 utterance, 16.6 x 31 = 514.6 ms


Regarding to the cmvn (cepstral mean and variance normalization), it uses one channel mode as default.
31 utterance, 4.32205 ms 


[6]  Training
# Mono training                                                              
steps/train_mono.sh --nj 1 --cmd "$train_cmd"  --totgauss 400  data/train_yesno data/lang exp/mono0a

it skips feat-to-dim, run  gmm-init-mono
gmm with hmm topology, check the timing in the following log
$vim exp/mono0a/log/init.log
gmm-init-mono time : 0.00767899 (s)

next, compile-train-graphs
vim exp/mono0a/log/compile_graphs.1.log
compile train graph time : 0.0409088 (s)

next,
align-equal-compiled
vim exp/mono0a/log/align.0.1.log
align equal compiled time : 0.031034 (s)


....


I did a quick hack on timing, where I directly measure the time in shell script.

data preparation time in (ms): 155
feature extraction time in (ms): 1112
train_mono time in (ms): 9448
compile graph in (ms): 31
decode in (ms): 837