Monday, January 11, 2016

dive deep into Example 1 : yesno

in  the run.sh
[1] local/prepare_data.sh waves_yesno

#!/bin/bash

mkdir -p data/local
local=`pwd`/local
scripts=`pwd`/scripts

export PATH=$PATH:`pwd`/../../../tools/irstlm/bin

echo "Preparing train and test data"

train_base_name=train_yesno
test_base_name=test_yesno
waves_dir=$1

cd data/local

# output the files names for all the input waves
ls -1 ../../$waves_dir > waves_all.list

# read the waves list, separate them into test and train sets
# save them in data/local
../../local/create_yesno_waves_test_train.pl waves_all.list waves.test waves.train

# generate the scp file that contains the file name(pattern) and full path
# e.g., 0_0_0_0_1_1_1_1 waves_yesno/0_0_0_0_1_1_1_1.wav
../../local/create_yesno_wav_scp.pl ${waves_dir} waves.test > ${test_base_name}_wav.scp

# same for the training set
../../local/create_yesno_wav_scp.pl ${waves_dir} waves.train > ${train_base_name}_wav.scp

# translate the wav file name into yes/no words
# e.g., 0_0_0_0_1_1_1_1 NO NO NO NO YES YES YES YES
../../local/create_yesno_txt.pl waves.test > ${test_base_name}.txt

../../local/create_yesno_txt.pl waves.train > ${train_base_name}.txt

# copy over the language model file
cp ../../input/task.arpabo lm_tg.arpa

cd ../..

# This stage was copied from WSJ example
for x in train_yesno test_yesno; do
  mkdir -p data/$x
  # copy over the script and text file
  cp data/local/${x}_wav.scp data/$x/wav.scp
  cp data/local/$x.txt data/$x/text

  # replace the second column with 'global'
  # for instance, 0_0_0_0_1_1_1_1 global
  cat data/$x/text | awk '{printf("%s global\n", $1);}' > data/$x/utt2spk

  # convert the multiple lines (utt2spk) to single line (spk2utt), using single
  # space to separate the wave file
  # for instance, global 0_0_0_0_1_1_1_1 0_0_0_1_0_0_0_1 0_0_0_1_0_1_1_0 ....
  utils/utt2spk_to_spk2utt.pl <data/$x/utt2spk >data/$x/spk2utt
done


[2] local/prepare_dict.sh
#!/bin/bash                                                                     
                                                                                
mkdir -p data/local/dict                                                        
                                                                                
# cp the symbols for the targeted words                                         
cp input/lexicon_nosil.txt data/local/dict/lexicon_words.txt                    
                                                                                
# cp all symbols for the lexical analysis                                       
cp input/lexicon.txt data/local/dict/lexicon.txt                                
                                                                                
# cpy the phones without silence                                                
cat input/phones.txt | grep -v SIL > data/local/dict/nonsilence_phones.txt      
                                                                                
echo "SIL" > data/local/dict/silence_phones.txt                                 
                                                                                
echo "SIL" > data/local/dict/optional_silence.txt                               
                                                                                

echo "Dictionary preparation succeeded"


[3] utils/prepare_lang.sh --position-dependent-phones false data/local/dict "<SIL>" data/local/lang data/lang

# This script adds word-position-dependent phones and constructs a host of other
# derived files, that go in data/lang/.

reformat the file from dict to lang

[4] local/prepare_lm.sh
Preparing language models for test

finite state transducer: G.fst

arpa2fst -
Processing 1-grams
Connected 0 states without outgoing arcs.
fstisstochastic data/lang_test_tg/G.fst
1.20397 0
Succeeded in formatting data.

[5] Feature extraction                                                          
for x in train_yesno test_yesno; do                                          
 steps/make_mfcc.sh --nj 1 data/$x exp/make_mfcc/$x mfcc                      
 steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x mfcc                    
done

compute-mfcc-feats

if you have print out in compute-mfcc-feats.cc, you can view the output in the following log file.
 $vim exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log

the code is in featbin folder
you can measure the timing by include
// added for timing
#include "base/timer.h"
#include "base/kaldi-common.h"
#include "base/kaldi-utils.h"
to the top of the file, and start timer

after recompile the program and execute ./run.sh
in /home/leiming/kaldi-trunk/egs/yesno/s5
you can check the training log regarding to the timing,
$vim exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log 

Here is some profiling results on i7-4790K CPU @ 4.00GHz .

1 frame of MFCC:
extract window time :                     1.78814e-05 (s)
FFT time :                                      5.96046e-06 (s)
power spectrum on mel banks :     2.14577e-06 (s)
apply log() :                                   9.53674e-07 (s)
dct :                                                9.53674e-07 (s)

we have 633 frames for  utterance 0_0_0_0_1_1_1_1 (8 words).
so that's around 0.00002789 x 633 = 17.7 ms

if we timing the mfcc computation time for this single utterance, it is 16.6 ms.

it brings down 2ms per word for mfcc computation.

31 utterance, 16.6 x 31 = 514.6 ms


Regarding to the cmvn (cepstral mean and variance normalization), it uses one channel mode as default.
31 utterance, 4.32205 ms 


[6]  Training
# Mono training                                                              
steps/train_mono.sh --nj 1 --cmd "$train_cmd"  --totgauss 400  data/train_yesno data/lang exp/mono0a

it skips feat-to-dim, run  gmm-init-mono
gmm with hmm topology, check the timing in the following log
$vim exp/mono0a/log/init.log
gmm-init-mono time : 0.00767899 (s)

next, compile-train-graphs
vim exp/mono0a/log/compile_graphs.1.log
compile train graph time : 0.0409088 (s)

next,
align-equal-compiled
vim exp/mono0a/log/align.0.1.log
align equal compiled time : 0.031034 (s)


....


I did a quick hack on timing, where I directly measure the time in shell script.

data preparation time in (ms): 155
feature extraction time in (ms): 1112
train_mono time in (ms): 9448
compile graph in (ms): 31
decode in (ms): 837





Sunday, April 12, 2015

Day 1: Familiar with the framework

[1]A little about history:
Kaldi began its existence in the 2009 Johns Hopkins University workshop cumbersomely titled "Low Development Cost, High Quality Speech Recognition for New Languages and Domains"

[2] Makefile
You can edit settings for different compilation options (debugging, speed, performance, precision, etc.) in ~/kaldi-trunk/src/kaldi.mk. (link)

[3] Matrixmatrix library is heavily used in Kaldi, whose matrix library is mostly a C++ wrapper for standard BLAS and LAPACK linear algebra routines. (check this out)

[4] GPU
Kaldi has applied CUDA matrix library. In the ~/kaldi-trunk/src/cudamatrix,  you can search "#if HAVE_CUDA==1" in the .cc files. (for more info)
In their implementation, they usually only run specific tasks on the GPU- mainly neural net training.

Kaldi is intended to be run in "exclusive mode"; whether it's process exclusive or thread exclusive doesn't matter. You can find out what mode your GPU is running in as follows:

# nvidia-smi --query | grep 'Compute Mode'
Compute Mode : Exclusive_Thread

You can set the correct mode by typing nvidia-smi -c 1. You might want to do this in a startup script so it happens each time you reboot.

    -c,   --compute-mode=       Set MODE for compute applications:
                                0/DEFAULT, 1/EXCLUSIVE_THREAD,
                                2/PROHIBITED, 3/EXCLUSIVE_PROCESS

[5] online decoding (link)
By "online decoding" we mean decoding where the features are coming in in real time, and you don't want to wait until all the audio is captured before starting the online decoding. (We're not using the phrase "real-time decoding" because "real-time decoding" can also be used to mean decoding whose speed is not slower than real time, even if it is applied in batch mode).

The approach that we took with Kaldi was to focus for the first few years on off-line recognition, in order to reach state of the art performance as quickly as possible. Now we are making more of an effort to support online decoding.

There are two online-decoding setups: the "old" online-decoding setup, in the subdirectories online/ and onlinebin/, and the "new" decoding setup, in online2/ and online2bin/. The "old" online-decoding setup is now deprecated, and may eventually be removed from the trunk (but remain in ^/branches/complete).

[6] Keyword Search
They focus on word level keyword search for simplicity purpose, but our implementation naturally supports word level as well as subword level keyword search – both our LVCSR module and the KWS module are implemented using weighted finite state transducer (WFST), and the algorithm should work as long as the symbol table properly maps words/subwords to integers.

This one tutorial on youtube that covers the simple structure of FST. (you can start watch at 10:30)


Mapping integers and strings to save RAM space.



Friday, April 10, 2015

Example 1 : yesno

go to ~/kaldi-trunk/egs/yesno/s5
view run.sh
follow its steps.

Here are the printout.

waves_yesno/
waves_yesno/1_0_0_0_0_0_1_1.wav
waves_yesno/1_1_0_0_1_0_1_0.wav
waves_yesno/1_0_1_1_1_1_0_1.wav
waves_yesno/1_1_1_1_0_1_0_0.wav
waves_yesno/0_0_1_1_1_0_0_0.wav
waves_yesno/0_1_1_1_1_1_1_1.wav
waves_yesno/0_1_0_1_1_1_0_0.wav
waves_yesno/1_0_1_1_1_0_1_0.wav
waves_yesno/1_0_0_1_0_1_1_1.wav
waves_yesno/0_0_1_0_1_0_0_0.wav
waves_yesno/0_1_0_1_1_0_1_0.wav
waves_yesno/0_0_1_1_0_1_1_0.wav
waves_yesno/1_0_0_0_1_0_0_1.wav
waves_yesno/1_1_0_1_1_1_1_0.wav
waves_yesno/0_0_1_1_1_1_0_0.wav
waves_yesno/1_1_0_0_1_1_1_0.wav
waves_yesno/0_0_1_1_0_1_1_1.wav
waves_yesno/1_1_0_1_0_1_1_0.wav
waves_yesno/0_1_0_0_0_1_1_0.wav
waves_yesno/0_0_0_1_0_0_0_1.wav
waves_yesno/0_0_1_0_1_0_1_1.wav
waves_yesno/0_0_1_0_0_0_1_0.wav
waves_yesno/1_1_0_1_1_0_0_1.wav
waves_yesno/0_1_1_1_0_1_0_1.wav
waves_yesno/0_1_1_1_0_0_0_0.wav
waves_yesno/README~
waves_yesno/0_1_0_0_0_1_0_0.wav
waves_yesno/1_0_0_0_0_0_0_1.wav
waves_yesno/1_1_0_1_1_0_1_1.wav
waves_yesno/1_1_0_0_0_0_0_1.wav
waves_yesno/1_0_0_0_0_0_0_0.wav
waves_yesno/0_1_1_1_1_0_1_0.wav
waves_yesno/0_0_1_1_0_1_0_0.wav
waves_yesno/1_1_1_0_0_0_0_1.wav
waves_yesno/1_0_1_0_1_0_0_1.wav
waves_yesno/0_1_0_0_1_0_1_1.wav
waves_yesno/0_0_1_1_1_1_1_0.wav
waves_yesno/1_1_0_0_0_1_1_1.wav
waves_yesno/0_1_1_1_0_0_1_0.wav
waves_yesno/1_1_0_1_0_1_0_0.wav
waves_yesno/1_1_1_1_1_1_1_1.wav
waves_yesno/0_0_1_0_1_0_0_1.wav
waves_yesno/1_1_1_1_0_0_1_0.wav
waves_yesno/0_0_1_1_1_0_0_1.wav
waves_yesno/0_1_0_1_0_0_0_0.wav
waves_yesno/1_1_1_1_1_0_0_0.wav
waves_yesno/README
waves_yesno/0_1_1_0_0_1_1_1.wav
waves_yesno/0_0_1_0_0_1_1_0.wav
waves_yesno/1_1_0_0_1_0_1_1.wav
waves_yesno/1_1_1_0_0_1_0_1.wav
waves_yesno/0_0_1_0_0_1_1_1.wav
waves_yesno/0_0_1_1_0_0_0_1.wav
waves_yesno/1_0_1_1_0_1_1_1.wav
waves_yesno/1_1_1_0_1_0_1_0.wav
waves_yesno/1_1_1_0_1_0_1_1.wav
waves_yesno/0_1_0_0_1_0_1_0.wav
waves_yesno/1_1_1_0_0_1_1_1.wav
waves_yesno/0_1_1_0_0_1_1_0.wav
waves_yesno/0_0_0_1_0_1_1_0.wav
waves_yesno/1_1_1_1_1_1_0_0.wav
waves_yesno/0_0_0_0_1_1_1_1.wav
Preparing train and test data
Dictionary preparation succeeded
Checking data/local/dict/silence_phones.txt ...
--> reading data/local/dict/silence_phones.txt
--> data/local/dict/silence_phones.txt is OK

Checking data/local/dict/optional_silence.txt ...
--> reading data/local/dict/optional_silence.txt
--> data/local/dict/optional_silence.txt is OK

Checking data/local/dict/nonsilence_phones.txt ...
--> reading data/local/dict/nonsilence_phones.txt
--> data/local/dict/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt
--> disjoint property is OK.

Checking data/local/dict/lexicon.txt
--> reading data/local/dict/lexicon.txt
--> data/local/dict/lexicon.txt is OK

Checking data/local/dict/extra_questions.txt ...
--> data/local/dict/extra_questions.txt is empty (this is OK)
--> SUCCESS [validating dictionary directory data/local/dict]

**Creating data/local/dict/lexiconp.txt from data/local/dict/lexicon.txt
fstaddselfloops 'echo 4 |' 'echo 4 |' 
prepare_lang.sh: validating output directory
Checking data/lang/phones.txt ...
--> data/lang/phones.txt is OK

Checking words.txt: #0 ...
--> data/lang/words.txt has "#0"
--> data/lang/words.txt is OK

Checking disjoint: silence.txt, nonsilence.txt, disambig.txt ...
--> silence.txt and nonsilence.txt are disjoint
--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...
--> summation property is OK

Checking data/lang/phones/context_indep.{txt, int, csl} ...
--> 1 entry/entries in data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.int corresponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.csl corresponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.{txt, int, csl} are OK

Checking data/lang/phones/disambig.{txt, int, csl} ...
--> 2 entry/entries in data/lang/phones/disambig.txt
--> data/lang/phones/disambig.int corresponds to data/lang/phones/disambig.txt
--> data/lang/phones/disambig.csl corresponds to data/lang/phones/disambig.txt
--> data/lang/phones/disambig.{txt, int, csl} are OK

Checking data/lang/phones/nonsilence.{txt, int, csl} ...
--> 2 entry/entries in data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.int corresponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.csl corresponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.{txt, int, csl} are OK

Checking data/lang/phones/silence.{txt, int, csl} ...
--> 1 entry/entries in data/lang/phones/silence.txt
--> data/lang/phones/silence.int corresponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.csl corresponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.{txt, int, csl} are OK

Checking data/lang/phones/optional_silence.{txt, int, csl} ...
--> 1 entry/entries in data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.int corresponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.csl corresponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.{txt, int, csl} are OK

Checking data/lang/phones/roots.{txt, int} ...
--> 3 entry/entries in data/lang/phones/roots.txt
--> data/lang/phones/roots.int corresponds to data/lang/phones/roots.txt
--> data/lang/phones/roots.{txt, int} are OK

Checking data/lang/phones/sets.{txt, int} ...
--> 3 entry/entries in data/lang/phones/sets.txt
--> data/lang/phones/sets.int corresponds to data/lang/phones/sets.txt
--> data/lang/phones/sets.{txt, int} are OK

Checking data/lang/phones/extra_questions.{txt, int} ...
--> WARNING: the optional data/lang/phones/extra_questions.{txt, int} are empty!

Checking optional_silence.txt ...
--> reading data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1
--> data/lang/phones/disambig.txt has "#0" and "#1"
--> data/lang/phones/disambig.txt is OK

Checking topo ...
--> data/lang/topo's nonsilence section is OK
--> data/lang/topo's silence section is OK
--> data/lang/topo is OK

Checking data/lang/oov.{txt, int} ...
--> 1 entry/entries in data/lang/oov.txt
--> data/lang/oov.int corresponds to data/lang/oov.txt
--> data/lang/oov.{txt, int} are OK

--> data/lang/L.fst is olabel sorted
--> data/lang/L_disambig.fst is olabel sorted
--> WARNING (check output above for warnings)
Preparing language models for test
arpa2fst - 
Processing 1-grams
Connected 0 states without outgoing arcs.
fstisstochastic data/lang_test_tg/G.fst 
1.20397 0
Succeeded in formatting data.
steps/make_mfcc.sh --nj 1 data/train_yesno exp/make_mfcc/train_yesno mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker.  This probably a bad idea.
   Search for the word 'bold' in http://kaldi.sourceforge.net/data_prep.html
   for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/train_yesno
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
Succeeded creating MFCC features for train_yesno
steps/compute_cmvn_stats.sh data/train_yesno exp/make_mfcc/train_yesno mfcc
Succeeded creating CMVN stats for train_yesno
steps/make_mfcc.sh --nj 1 data/test_yesno exp/make_mfcc/test_yesno mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker.  This probably a bad idea.
   Search for the word 'bold' in http://kaldi.sourceforge.net/data_prep.html
   for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/test_yesno
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
It seems not all of the feature files were successfully processed (29 != 31);
consider using utils/fix_data_dir.sh data/test_yesno
Less than 95% the features were successfully generated.  Probably a serious error.
steps/compute_cmvn_stats.sh data/test_yesno exp/make_mfcc/test_yesno mfcc
Succeeded creating CMVN stats for test_yesno
steps/train_mono.sh --nj 1 --cmd utils/run.pl --totgauss 400 data/train_yesno data/lang exp/mono0a
steps/train_mono.sh: Initializing monophone system.
steps/train_mono.sh: Compiling training graphs
steps/train_mono.sh: Aligning data equally (pass 0)
steps/train_mono.sh: Pass 1
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 2
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 3
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 4
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 5
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 6
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 7
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 8
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 9
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 10
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 11
steps/train_mono.sh: Pass 12
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 13
steps/train_mono.sh: Pass 14
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 15
steps/train_mono.sh: Pass 16
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 17
steps/train_mono.sh: Pass 18
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 19
steps/train_mono.sh: Pass 20
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 21
steps/train_mono.sh: Pass 22
steps/train_mono.sh: Pass 23
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 24
steps/train_mono.sh: Pass 25
steps/train_mono.sh: Pass 26
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 27
steps/train_mono.sh: Pass 28
steps/train_mono.sh: Pass 29
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 30
steps/train_mono.sh: Pass 31
steps/train_mono.sh: Pass 32
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 33
steps/train_mono.sh: Pass 34
steps/train_mono.sh: Pass 35
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 36
steps/train_mono.sh: Pass 37
steps/train_mono.sh: Pass 38
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 39
1 warnings in exp/mono0a/log/update.*.log
Done
fstdeterminizestar --use-log=true 
fstminimizeencoded 
fsttablecompose data/lang_test_tg/L_disambig.fst data/lang_test_tg/G.fst 
fstisstochastic data/lang_test_tg/tmp/LG.fst 
1.20412 0
[info]: LG not stochastic.
fstcomposecontext --context-size=1 --central-position=0 --read-disambig-syms=data/lang_test_tg/phones/disambig.int --write-disambig-syms=data/lang_test_tg/tmp/disambig_ilabels_1_0.int data/lang_test_tg/tmp/ilabels_1_0 
fstisstochastic data/lang_test_tg/tmp/CLG_1_0.fst 
1.20412 0
[info]: CLG not stochastic.
make-h-transducer --disambig-syms-out=exp/mono0a/graph_tgpr/disambig_tid.int --transition-scale=1.0 data/lang_test_tg/tmp/ilabels_1_0 exp/mono0a/tree exp/mono0a/final.mdl 
fsttablecompose exp/mono0a/graph_tgpr/Ha.fst data/lang_test_tg/tmp/CLG_1_0.fst 
fstminimizeencoded 
fstdeterminizestar --use-log=true 
fstrmsymbols exp/mono0a/graph_tgpr/disambig_tid.int 
fstrmepslocal 
fstisstochastic exp/mono0a/graph_tgpr/HCLGa.fst 
1.20412 -0.000430956
HCLGa is not stochastic
add-self-loops --self-loop-scale=0.1 --reorder=true exp/mono0a/final.mdl 
steps/decode.sh --nj 1 --cmd utils/run.pl exp/mono0a/graph_tgpr data/test_yesno exp/mono0a/decode_test_yesno
** split_data.sh: warning, #lines is (utt2spk,feats.scp) is (31,29); you can 
**  use utils/fix_data_dir.sh data/test_yesno to fix this.
decode.sh: feature type is delta
%WER 0.00 [ 0 / 232, 0 ins, 0 del, 0 sub ] [PARTIAL] exp/mono0a/decode_test_yesno/wer_10



【1】
train_cmd="utils/run.pl"
decode_cmd="utils/run.pl"

# Download speeches
if [ ! -d waves_yesno ]; then
  wget http://www.openslr.org/resources/1/waves_yesno.tar.gz || exit 1;
  tar -xvzf waves_yesno.tar.gz || exit 1;
fi

train_yesno=train_yesno
test_base_name=test_yesno

rm -rf data exp mfcc

# Data preparation
local/prepare_data.sh waves_yesno

Preparing train and test data.

A new directory called "data" was created.
You should see three main types of folders :
    local :    Contains the dictionary for the current data.
    train_*: The data segmented from the corpora for training purposes.
    test_* :  The data segmented from the corpora for testing purposes.


In the prepare_data.sh script,
ls -1 ../../$waves_dir > waves_all.list
it saves the input audio name into the list

../../local/create_yesno_waves_test_train.pl waves_all.list waves.test waves.train
The perl script to trunk the waves list into 1st (train) and 2nd part (test).
The chomp is used to remove any trailing string such as "\n".

../../local/create_yesno_wav_scp.pl ${waves_dir} waves.test > ${test_base_name}_wav.scp
../../local/create_yesno_wav_scp.pl ${waves_dir} waves.train > ${train_base_name}_wav.scp
it generates the sound label and sound path for each wave
0_0_0_0_1_1_1_1 waves_yesno/0_0_0_0_1_1_1_1.wav

../../local/create_yesno_txt.pl waves.test > ${test_base_name}.txt
../../local/create_yesno_txt.pl waves.train > ${train_base_name}.txt
It translates the 1 and 0 into Yes and No in the test and training set.

cp ../../input/task.arpabo lm_tg.arpa
It copies the language model to lm_tg.arpa.
\data\
ngram 1=3

\1-grams:
-1 NO
-1 YES
-99 <s>
-1 </s>

\end\

Then, go to the top level of s5 folder.
for x in train_yesno test_yesno; do
  mkdir -p data/$x
  cp data/local/${x}_wav.scp data/$x/wav.scp
  cp data/local/$x.txt data/$x/text
  cat data/$x/text | awk '{printf("%s global\n", $1);}' > data/$x/utt2spk
  utils/utt2spk_to_spk2utt.pl <data/$x/utt2spk >data/$x/spk2utt
done

copy the data/local to the data/ folder for both train_ and test_yesno
utt2spk contains the 1 and 0 for speech utterance, with global at the end.


0_0_0_0_1_1_1_1 global
0_0_0_1_0_0_0_1 global
...

the last line concatenates all the 1s and 0s in the utt2spk file into a single line of string.
global 0_0_0_0_1_1_1_1 0_0_0_1_0_0_0_1 0_0_0_1_0_1_1_0 0_0_1_0_0_0_1_0 0_0_1_0_0_1_1_0 0_0_1_0_0_1_1_1 0_0_1_0_1_0_0 ........




【2】generate(prepare) the dictionary
local/prepare_dict.sh
it is located at ~/kaldi-trunk/egs/yesno/s5/data/local

mkdir -p data/local/dict
cp input/lexicon_nosil.txt data/local/dict/lexicon_words.txt
cp input/lexicon.txt data/local/dict/lexicon.txt
make dict directory in data/local folder
copy the lexicon (with and without silence<SIL>) to the lexicon_*.txt files.

cat input/phones.txt | grep -v SIL > data/local/dict/nonsilence_phones.txt
echo "SIL" > data/local/dict/silence_phones.txt
echo "SIL" > data/local/dict/optional_silence.txt
produce phone list with and without SIL

not all of these files are "native" Kaldi formats, i.e. not all of them could be read by Kaldi's C++ programs and need to be processed using OpenFST tools before Kaldi can use them.

  lexicon.txt : This is the lexicon.
  silence*.txt : These files contain information about which phones are silent and which are not.

【3】language parameters
The next step is to create the raw language files that Kaldi uses. In most cases, these will be text files in integer formats. Make sure that you are back in the s5 directory and execute the following command:

utils/prepare_lang.sh --position-dependent-phones false data/local/dict "<SIL>" data/local/lang data/lang

it first creates lexiconp.txt (example below) and lexicon.txt under data/local/dict


<SIL> 1.0   SIL                                                              
YES 1.0 Y                                                                    
NO 1.0  N

create phone_map.txt.  The following is an example.
  # AA AA_B AA_E AA_I AA_S
  # for (B)egin, (E)nd, (I)nternal and (S)ingleton
  # and in the case of silence
  # SIL SIL SIL_B SIL_E SIL_I SIL_S

There are silence_phones.txt and nonsilence_phones.txt files.

#here is some printed out msg
Checking data/local/dict/silence_phones.txt
Checking data/local/dict/optional_silence.txt
Checking data/local/dict/lexicon.txt

Checking data/local/dict/extra_questions.txt ...
--> data/local/dict/extra_questions.txt is empty (this is OK)

**Creating data/local/dict/lexiconp.txt from data/local/dict/lexicon.txt
prepare_lang.sh: validating output directory
Checking data/lang/phones.txt ...
--> data/lang/phones.txt is OK

Checking data/lang/phones/context_indep.{txt, int, csl} ...
Checking data/lang/phones/disambig.{txt, int, csl} ...
Checking data/lang/phones/nonsilence.{txt, int, csl} ...
Checking data/lang/phones/silence.{txt, int, csl} ...
Checking data/lang/phones/optional_silence.{txt, int, csl} ...
Checking data/lang/phones/roots.{txt, int} ...
Checking data/lang/phones/sets.{txt, int} ...
Checking data/lang/phones/extra_questions.{txt, int} ...
Checking optional_silence.txt ...
Checking disambiguation symbols: #0 and #1
Checking topo ...
Checking data/lang/oov.{txt, int} ...
--> data/lang/L.fst is olabel sorted
--> data/lang/L_disambig.fst is olabel sorted
--> WARNING (check output above for warnings)

phones.txt :    create phone symbol table
words.txt:       create word symbol table
roots file : forece all the phones into one file
topo:          use utils/gen_topo.pl  to generate phone topology file. This controls the number of states in the non-silence HMMs and in the silence HMMs.
oov:        contains a word that will map any OOVs to during training
L.fst / L_disambig.fst: utils/make_lexicon_fst_silprob.pl / utils/make_lexicon_fst.pl  generates the files.




This will create a new folder called lang within the local folder which will contain an FST describing the language in question.  Look at the script.

It transforms some of the files created in data/ to a more normalized form that is read by Kaldi.
This script creates its output in the data/lang/ directory. The files we mention below will be in that directory.

The first two files this script creates are called words.txt and phones.txt (both in the directory data/lang/).

Look at the files with suffix .csl (in data/lang/phones). These are colon-separated lists of the integer id's of non-silence, and silence, phones respectively.

Look at phones.txt (in data/lang/). This file is a phone symbol table that also handles the "disambiguation symbols" used in the standard FST recipe. These symbols are conventionally called #1, #2 and so on; see the paper "Speech Recognition with Weighted Finite State Transducers" . We also add a symbol #0 which replaces epsilon transitions in the language model; see Disambiguation symbols for more information. How many disambiguation symbols are there? In some recipes the number of disambiguation symbols is the same as the maximum number of words that share the same pronunciation.

The file L.fst is the compiled lexicon in FST format. To see what kind of information is in it, you can (from s5/), do:

fstprint --isymbols=data/lang/phones.txt --osymbols=data/lang/words.txt data/lang/L.fst | head


【4】 build the language model
local/prepare_lm.sh

【5】feature extraction
# Feature extraction
for x in train_yesno test_yesno; do
 steps/make_mfcc.sh --nj 1 data/$x exp/make_mfcc/$x mfcc
 steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x mfcc
done

To see the logging output of the program that creates the MFCC,
 $vim exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log 

【6】 train monophone models

# Mono training
steps/train_mono.sh --nj 1 --cmd "$train_cmd" \
  --totgauss 400 \
  data/train_yesno data/lang exp/mono0a

【7】create the decode graph

# Graph compilation
utils/mkgraph.sh --mono data/lang_test_tg exp/mono0a exp/mono0a/graph_tgpr


【8】 monophone decoding

# Decoding
steps/decode.sh --nj 1 --cmd "$decode_cmd" \
    exp/mono0a/graph_tgpr data/test_yesno exp/mono0a/decode_test_yesno


To print the final results,
for x in exp/*/decode*; do [ -d $x ] && grep WER $x/wer_* | utils/best_wer.sh; done

Here is a sample output.

%WER 0.00 [ 0 / 232, 0 ins, 0 del, 0 sub ] exp/mono0a/decode_test_yesno/wer_10


--------------------------------------------------------------------------------------------------------------------------
http://kaldi.sourceforge.net/tutorial_running.html

Thursday, April 9, 2015

Useful Materials

Here are four lectures shared by Dan Povey
https://sites.google.com/site/dpovey/kaldi-lectures



Terms

[1] statistical model
a probability distribution that can adequately approximate the true distribution of observed data.

[2] sphere format
speech data are compressed using compression scheme called "shorten"

[3] n-gram model

[4]likelihood vs. probability




Discussion on the input data source.
[1] http://sourceforge.net/p/kaldi/mailman/kaldi-users/thread/5341091F.80806@cantabResearch.com/


Installation

http://kaldi.sourceforge.net/install.html

$svn co https://svn.code.sf.net/p/kaldi/code/trunk kaldi-trunk

I am using ubuntu 14.04 64bit, gcc-4.7.

contact kaldi-developers@lists.sourceforge.net for questions.

sudo apt-get install gcc-4.7 g++-4.7 -y
To configure the gcc version, please check this link.

Install external libraries (inside toos/ folder)

ATLAS
turn off cpu throttling,
$/usr/bin/cpufreq-selector -g performance

I came across lots of errors after running the script install_atlas.sh under kaldi-trunk/tools/. 

You can install the atlas or openblas in ubuntu repo.
$sudo apt-get install libatlas3gf-base libatlas-dev -y
$sudo apt-get install libopenblas-dev -y

if 'libopenblas-dev' has no installation candidate, try the following
$sudo apt-get install libblas-dev

$sudo apt-get install libtool


Here are some quotes.
1) You don't have to build the ATLAS library, as Dan said, it is sufficient to install whatever package your system distribution provides. Atlas is sometimes quite hard to build because it tries to optimize everything for the machine on which it's being compiled. It also means that when you have
some specific machine configuration, the automatic build fails.
2) Kaldi supports OpenBLAS and/or Intel MKL already. You just have to install either one of those and call the configure script with the correct parameters. Configure --help lists these parameters.

run the script under ~/kaldi-trunk/tools/install_portaudio.sh
p.s. no sudo is needed, no errors
you will have the following msg:
Libraries have been installed in:
   /home/leiming/kaldi-trunk/tools/portaudio/install/lib
...
On some systems (e.g. Linux) you should run 'ldconfig' now to make the shared object available.  You may also need to modify your LD_LIBRARY_PATH environment variable to include the directory /home/leiming/kaldi-trunk/tools/portaudio/install/lib

add the lib in the ~/.bashrc, and source ~/.bashrc

download atlas, and unzip the package
wget -T 10 -t 3 http://sourceforge.net/projects/math-atlas/files/Stable/3.10.0/atlas3.10.0.tar.bz2 
tar -xvjf atlas3.10.0.tar.bz2

remain inside the tools folder, according to kaldi-trunk/tools/INSTALL, you can directly type make to install the tools.

My pc is intel i7, I will use 8 threads to make.
$make -j 8
Since no errors showed up, we are good. 

By default, Kaldi builds against OpenFst-1.3.4. If you want to build against
OpenFst-1.4, edit the Makefile in this folder. Note that this change requires
a relatively new compiler with C++11 support, e.g. gcc >= 4.6, clang >= 3.0.


install openfst-1.4
download openfst 1.4.1,  
$wget http://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.4.1.tar.gz 
$tar -xvzf openfst-1.4.1.tar.gz
now cd to /kaldi-trunk/tools/openfst-1.4.1/src/include/fst
patch -p0 -N <../../../../extras/openfst-1.4.1.patch
rm openfst 2>/dev/null
ln -s openfst-1.4.1 openfst


cd openfst-1.4.1/
./configure --prefix=`pwd` --enable-static --disable-shared
make

make install


Here is some suggestion!
----------------------------------------------------------------------
Libraries have been installed in:
   /home/leiming/kaldi-trunk/tools/openfst-1.3.4/lib/fst

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

In my case, I edited the ~/.bashrc file and added the following line.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/leiming/kaldi-trunk/tools/openfst-1.3.4/lib/fst

Then, $source ~/.bashrc

Build the source file (in src/ folder)

go to /kaldi-trunk/src,
$./configure

After configure, you will have the following msg.
Configuring ...
Checking OpenFST library in /home/leiming/kaldi-trunk/tools/openfst ...
Checking OpenFst library was patched.
Doing OS specific configurations ...
On Linux: Checking for linear algebra header files ...
Using ATLAS as the linear algebra library.
Successfully configured for Debian/Ubuntu Linux [dynamic libraries] with ATLASLIBS =/usr/lib/libatlas.so.3  /usr/lib/libf77blas.so.3 /usr/lib/libcblas.so.3  /usr/lib/liblapack_atlas.so.3
Using CUDA toolkit /usr/local/cuda (nvcc compiler and runtime libraries)
Static=[false] Speex library not found: You can still build Kaldi without Speex.

speex library is used for audio compression when doing online decoding.


if you see "cuda will not be used!" msg, either you don't have cuda installed or the path is wrong.
In my case, i have several versions of cuda. I make a soft link to solve this issue(/usr/local).
$sudo ln -s cuda-7.0 cuda

$make depend -j 8
$make -j 8

automatically determine the cores of cpu when using make
make -j$(NUM_PROCESSORS)


For the new CUDA driver >= 6.5
nvcc fatal   : Unsupported gpu architecture 'compute_13'
make[1]: *** [cu-kernels.o] Error 1

go to /src/cudamatrix/, find the Makefile
  CUDA_ARCH=-gencode arch=compute_13,code=sm_13 \

remove ”-gencode arch=compute_13,code=sm_13”

then,  make -j 8