Kaldi-related: Example 1 : yesno

go to ~/kaldi-trunk/egs/yesno/s5
view run.sh
follow its steps.

Here are the printout.

waves_yesno/
waves_yesno/1_0_0_0_0_0_1_1.wav
waves_yesno/1_1_0_0_1_0_1_0.wav
waves_yesno/1_0_1_1_1_1_0_1.wav
waves_yesno/1_1_1_1_0_1_0_0.wav
waves_yesno/0_0_1_1_1_0_0_0.wav
waves_yesno/0_1_1_1_1_1_1_1.wav
waves_yesno/0_1_0_1_1_1_0_0.wav
waves_yesno/1_0_1_1_1_0_1_0.wav
waves_yesno/1_0_0_1_0_1_1_1.wav
waves_yesno/0_0_1_0_1_0_0_0.wav
waves_yesno/0_1_0_1_1_0_1_0.wav
waves_yesno/0_0_1_1_0_1_1_0.wav
waves_yesno/1_0_0_0_1_0_0_1.wav
waves_yesno/1_1_0_1_1_1_1_0.wav
waves_yesno/0_0_1_1_1_1_0_0.wav
waves_yesno/1_1_0_0_1_1_1_0.wav
waves_yesno/0_0_1_1_0_1_1_1.wav
waves_yesno/1_1_0_1_0_1_1_0.wav
waves_yesno/0_1_0_0_0_1_1_0.wav
waves_yesno/0_0_0_1_0_0_0_1.wav
waves_yesno/0_0_1_0_1_0_1_1.wav
waves_yesno/0_0_1_0_0_0_1_0.wav
waves_yesno/1_1_0_1_1_0_0_1.wav
waves_yesno/0_1_1_1_0_1_0_1.wav
waves_yesno/0_1_1_1_0_0_0_0.wav
waves_yesno/README~
waves_yesno/0_1_0_0_0_1_0_0.wav
waves_yesno/1_0_0_0_0_0_0_1.wav
waves_yesno/1_1_0_1_1_0_1_1.wav
waves_yesno/1_1_0_0_0_0_0_1.wav
waves_yesno/1_0_0_0_0_0_0_0.wav
waves_yesno/0_1_1_1_1_0_1_0.wav
waves_yesno/0_0_1_1_0_1_0_0.wav
waves_yesno/1_1_1_0_0_0_0_1.wav
waves_yesno/1_0_1_0_1_0_0_1.wav
waves_yesno/0_1_0_0_1_0_1_1.wav
waves_yesno/0_0_1_1_1_1_1_0.wav
waves_yesno/1_1_0_0_0_1_1_1.wav
waves_yesno/0_1_1_1_0_0_1_0.wav
waves_yesno/1_1_0_1_0_1_0_0.wav
waves_yesno/1_1_1_1_1_1_1_1.wav
waves_yesno/0_0_1_0_1_0_0_1.wav
waves_yesno/1_1_1_1_0_0_1_0.wav
waves_yesno/0_0_1_1_1_0_0_1.wav
waves_yesno/0_1_0_1_0_0_0_0.wav
waves_yesno/1_1_1_1_1_0_0_0.wav
waves_yesno/README
waves_yesno/0_1_1_0_0_1_1_1.wav
waves_yesno/0_0_1_0_0_1_1_0.wav
waves_yesno/1_1_0_0_1_0_1_1.wav
waves_yesno/1_1_1_0_0_1_0_1.wav
waves_yesno/0_0_1_0_0_1_1_1.wav
waves_yesno/0_0_1_1_0_0_0_1.wav
waves_yesno/1_0_1_1_0_1_1_1.wav
waves_yesno/1_1_1_0_1_0_1_0.wav
waves_yesno/1_1_1_0_1_0_1_1.wav
waves_yesno/0_1_0_0_1_0_1_0.wav
waves_yesno/1_1_1_0_0_1_1_1.wav
waves_yesno/0_1_1_0_0_1_1_0.wav
waves_yesno/0_0_0_1_0_1_1_0.wav
waves_yesno/1_1_1_1_1_1_0_0.wav
waves_yesno/0_0_0_0_1_1_1_1.wav
Preparing train and test data
Dictionary preparation succeeded
Checking data/local/dict/silence_phones.txt ...
--> reading data/local/dict/silence_phones.txt
--> data/local/dict/silence_phones.txt is OK

Checking data/local/dict/optional_silence.txt ...
--> reading data/local/dict/optional_silence.txt
--> data/local/dict/optional_silence.txt is OK

Checking data/local/dict/nonsilence_phones.txt ...
--> reading data/local/dict/nonsilence_phones.txt
--> data/local/dict/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt
--> disjoint property is OK.

Checking data/local/dict/lexicon.txt
--> reading data/local/dict/lexicon.txt
--> data/local/dict/lexicon.txt is OK

Checking data/local/dict/extra_questions.txt ...
--> data/local/dict/extra_questions.txt is empty (this is OK)
--> SUCCESS [validating dictionary directory data/local/dict]

**Creating data/local/dict/lexiconp.txt from data/local/dict/lexicon.txt
fstaddselfloops 'echo 4 |' 'echo 4 |'
prepare_lang.sh: validating output directory
Checking data/lang/phones.txt ...
--> data/lang/phones.txt is OK

Checking words.txt: #0 ...
--> data/lang/words.txt has "#0"
--> data/lang/words.txt is OK

Checking disjoint: silence.txt, nonsilence.txt, disambig.txt ...
--> silence.txt and nonsilence.txt are disjoint
--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...
--> summation property is OK

Checking data/lang/phones/context_indep.{txt, int, csl} ...
--> 1 entry/entries in data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.int corresponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.csl corresponds to data/lang/phones/context_indep.txt
--> data/lang/phones/context_indep.{txt, int, csl} are OK

Checking data/lang/phones/disambig.{txt, int, csl} ...
--> 2 entry/entries in data/lang/phones/disambig.txt
--> data/lang/phones/disambig.int corresponds to data/lang/phones/disambig.txt
--> data/lang/phones/disambig.csl corresponds to data/lang/phones/disambig.txt
--> data/lang/phones/disambig.{txt, int, csl} are OK

Checking data/lang/phones/nonsilence.{txt, int, csl} ...
--> 2 entry/entries in data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.int corresponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.csl corresponds to data/lang/phones/nonsilence.txt
--> data/lang/phones/nonsilence.{txt, int, csl} are OK

Checking data/lang/phones/silence.{txt, int, csl} ...
--> 1 entry/entries in data/lang/phones/silence.txt
--> data/lang/phones/silence.int corresponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.csl corresponds to data/lang/phones/silence.txt
--> data/lang/phones/silence.{txt, int, csl} are OK

Checking data/lang/phones/optional_silence.{txt, int, csl} ...
--> 1 entry/entries in data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.int corresponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.csl corresponds to data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.{txt, int, csl} are OK

Checking data/lang/phones/roots.{txt, int} ...
--> 3 entry/entries in data/lang/phones/roots.txt
--> data/lang/phones/roots.int corresponds to data/lang/phones/roots.txt
--> data/lang/phones/roots.{txt, int} are OK

Checking data/lang/phones/sets.{txt, int} ...
--> 3 entry/entries in data/lang/phones/sets.txt
--> data/lang/phones/sets.int corresponds to data/lang/phones/sets.txt
--> data/lang/phones/sets.{txt, int} are OK

Checking data/lang/phones/extra_questions.{txt, int} ...
--> WARNING: the optional data/lang/phones/extra_questions.{txt, int} are empty!

Checking optional_silence.txt ...
--> reading data/lang/phones/optional_silence.txt
--> data/lang/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1
--> data/lang/phones/disambig.txt has "#0" and "#1"
--> data/lang/phones/disambig.txt is OK

Checking topo ...
--> data/lang/topo's nonsilence section is OK
--> data/lang/topo's silence section is OK
--> data/lang/topo is OK

Checking data/lang/oov.{txt, int} ...
--> 1 entry/entries in data/lang/oov.txt
--> data/lang/oov.int corresponds to data/lang/oov.txt
--> data/lang/oov.{txt, int} are OK

--> data/lang/L.fst is olabel sorted
--> data/lang/L_disambig.fst is olabel sorted
--> WARNING (check output above for warnings)
Preparing language models for test
arpa2fst -
Processing 1-grams
Connected 0 states without outgoing arcs.
fstisstochastic data/lang_test_tg/G.fst
1.20397 0
Succeeded in formatting data.
steps/make_mfcc.sh --nj 1 data/train_yesno exp/make_mfcc/train_yesno mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word 'bold' in http://kaldi.sourceforge.net/data_prep.html
for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/train_yesno
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
Succeeded creating MFCC features for train_yesno
steps/compute_cmvn_stats.sh data/train_yesno exp/make_mfcc/train_yesno mfcc
Succeeded creating CMVN stats for train_yesno
steps/make_mfcc.sh --nj 1 data/test_yesno exp/make_mfcc/test_yesno mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word 'bold' in http://kaldi.sourceforge.net/data_prep.html
for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/test_yesno
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
It seems not all of the feature files were successfully processed (29 != 31);
consider using utils/fix_data_dir.sh data/test_yesno
Less than 95% the features were successfully generated. Probably a serious error.
steps/compute_cmvn_stats.sh data/test_yesno exp/make_mfcc/test_yesno mfcc
Succeeded creating CMVN stats for test_yesno
steps/train_mono.sh --nj 1 --cmd utils/run.pl --totgauss 400 data/train_yesno data/lang exp/mono0a
steps/train_mono.sh: Initializing monophone system.
steps/train_mono.sh: Compiling training graphs
steps/train_mono.sh: Aligning data equally (pass 0)
steps/train_mono.sh: Pass 1
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 2
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 3
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 4
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 5
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 6
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 7
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 8
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 9
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 10
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 11
steps/train_mono.sh: Pass 12
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 13
steps/train_mono.sh: Pass 14
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 15
steps/train_mono.sh: Pass 16
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 17
steps/train_mono.sh: Pass 18
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 19
steps/train_mono.sh: Pass 20
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 21
steps/train_mono.sh: Pass 22
steps/train_mono.sh: Pass 23
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 24
steps/train_mono.sh: Pass 25
steps/train_mono.sh: Pass 26
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 27
steps/train_mono.sh: Pass 28
steps/train_mono.sh: Pass 29
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 30
steps/train_mono.sh: Pass 31
steps/train_mono.sh: Pass 32
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 33
steps/train_mono.sh: Pass 34
steps/train_mono.sh: Pass 35
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 36
steps/train_mono.sh: Pass 37
steps/train_mono.sh: Pass 38
steps/train_mono.sh: Aligning data
steps/train_mono.sh: Pass 39
1 warnings in exp/mono0a/log/update.*.log
Done
fstdeterminizestar --use-log=true
fstminimizeencoded
fsttablecompose data/lang_test_tg/L_disambig.fst data/lang_test_tg/G.fst
fstisstochastic data/lang_test_tg/tmp/LG.fst
1.20412 0
[info]: LG not stochastic.
fstcomposecontext --context-size=1 --central-position=0 --read-disambig-syms=data/lang_test_tg/phones/disambig.int --write-disambig-syms=data/lang_test_tg/tmp/disambig_ilabels_1_0.int data/lang_test_tg/tmp/ilabels_1_0
fstisstochastic data/lang_test_tg/tmp/CLG_1_0.fst
1.20412 0
[info]: CLG not stochastic.
make-h-transducer --disambig-syms-out=exp/mono0a/graph_tgpr/disambig_tid.int --transition-scale=1.0 data/lang_test_tg/tmp/ilabels_1_0 exp/mono0a/tree exp/mono0a/final.mdl
fsttablecompose exp/mono0a/graph_tgpr/Ha.fst data/lang_test_tg/tmp/CLG_1_0.fst
fstminimizeencoded
fstdeterminizestar --use-log=true
fstrmsymbols exp/mono0a/graph_tgpr/disambig_tid.int
fstrmepslocal
fstisstochastic exp/mono0a/graph_tgpr/HCLGa.fst
1.20412 -0.000430956
HCLGa is not stochastic
add-self-loops --self-loop-scale=0.1 --reorder=true exp/mono0a/final.mdl
steps/decode.sh --nj 1 --cmd utils/run.pl exp/mono0a/graph_tgpr data/test_yesno exp/mono0a/decode_test_yesno
** split_data.sh: warning, #lines is (utt2spk,feats.scp) is (31,29); you can
** use utils/fix_data_dir.sh data/test_yesno to fix this.
decode.sh: feature type is delta
%WER 0.00 [ 0 / 232, 0 ins, 0 del, 0 sub ] [PARTIAL] exp/mono0a/decode_test_yesno/wer_10

【1】

train_cmd="utils/run.pl"
decode_cmd="utils/run.pl"

# Download speeches

if [ ! -d waves_yesno ]; then

wget http://www.openslr.org/resources/1/waves_yesno.tar.gz || exit 1;

tar -xvzf waves_yesno.tar.gz || exit 1;

train_yesno=train_yesno

test_base_name=test_yesno

rm -rf data exp mfcc

# Data preparation

local/prepare_data.sh waves_yesno

Preparing train and test data.

A new directory called "data" was created.
You should see three main types of folders :
local : Contains the dictionary for the current data.
train_*: The data segmented from the corpora for training purposes.
test_* : The data segmented from the corpora for testing purposes.

In the prepare_data.sh script,
ls -1 ../../$waves_dir > waves_all.list
it saves the input audio name into the list

../../local/create_yesno_waves_test_train.pl waves_all.list waves.test waves.train
The perl script to trunk the waves list into 1st (train) and 2nd part (test).
The chomp is used to remove any trailing string such as "\n".

../../local/create_yesno_wav_scp.pl ${waves_dir} waves.test > ${test_base_name}_wav.scp
../../local/create_yesno_wav_scp.pl ${waves_dir} waves.train > ${train_base_name}_wav.scp
it generates the sound label and sound path for each wave

0_0_0_0_1_1_1_1 waves_yesno/0_0_0_0_1_1_1_1.wav

../../local/create_yesno_txt.pl waves.test > ${test_base_name}.txt
../../local/create_yesno_txt.pl waves.train > ${train_base_name}.txt
It translates the 1 and 0 into Yes and No in the test and training set.

cp ../../input/task.arpabo lm_tg.arpa
It copies the language model to lm_tg.arpa.

\data\
ngram 1=3

\1-grams:
-1 NO
-1 YES
-99 <s>
-1 </s>

\end\

Then, go to the top level of s5 folder.

for x in train_yesno test_yesno; do
mkdir -p data/$x
cp data/local/${x}_wav.scp data/$x/wav.scp
cp data/local/$x.txt data/$x/text
cat data/$x/text | awk '{printf("%s global\n", $1);}' > data/$x/utt2spk
utils/utt2spk_to_spk2utt.pl <data/$x/utt2spk >data/$x/spk2utt
done

copy the data/local to the data/ folder for both train_ and test_yesno
utt2spk contains the 1 and 0 for speech utterance, with global at the end.

0_0_0_0_1_1_1_1 global
0_0_0_1_0_0_0_1 global
...

the last line concatenates all the 1s and 0s in the utt2spk file into a single line of string.

global 0_0_0_0_1_1_1_1 0_0_0_1_0_0_0_1 0_0_0_1_0_1_1_0 0_0_1_0_0_0_1_0 0_0_1_0_0_1_1_0 0_0_1_0_0_1_1_1 0_0_1_0_1_0_0 ........

【2】generate(prepare) the dictionary

local/prepare_dict.sh

it is located at ~/kaldi-trunk/egs/yesno/s5/data/local

mkdir -p data/local/dict
cp input/lexicon_nosil.txt data/local/dict/lexicon_words.txt
cp input/lexicon.txt data/local/dict/lexicon.txt

make dict directory in data/local folder
copy the lexicon (with and without silence<SIL>) to the lexicon_*.txt files.

cat input/phones.txt | grep -v SIL > data/local/dict/nonsilence_phones.txt
echo "SIL" > data/local/dict/silence_phones.txt
echo "SIL" > data/local/dict/optional_silence.txt

produce phone list with and without SIL

not all of these files are "native" Kaldi formats, i.e. not all of them could be read by Kaldi's C++ programs and need to be processed using OpenFST tools before Kaldi can use them.

lexicon.txt : This is the lexicon.
silence*.txt : These files contain information about which phones are silent and which are not.

【3】language parameters
The next step is to create the raw language files that Kaldi uses. In most cases, these will be text files in integer formats. Make sure that you are back in the s5 directory and execute the following command:

utils/prepare_lang.sh --position-dependent-phones false data/local/dict "<SIL>" data/local/lang data/lang

it first creates lexiconp.txt (example below) and lexicon.txt under data/local/dict

<SIL> 1.0 SIL
YES 1.0 Y
NO 1.0 N

create phone_map.txt. The following is an example.
# AA AA_B AA_E AA_I AA_S
# for (B)egin, (E)nd, (I)nternal and (S)ingleton
# and in the case of silence
# SIL SIL SIL_B SIL_E SIL_I SIL_S

There are silence_phones.txt and nonsilence_phones.txt files.

#here is some printed out msg

Checking data/local/dict/silence_phones.txt
Checking data/local/dict/optional_silence.txt
Checking data/local/dict/lexicon.txt

Checking data/local/dict/extra_questions.txt ...
--> data/local/dict/extra_questions.txt is empty (this is OK)

**Creating data/local/dict/lexiconp.txt from data/local/dict/lexicon.txt
prepare_lang.sh: validating output directory
Checking data/lang/phones.txt ...
--> data/lang/phones.txt is OK

Checking data/lang/phones/context_indep.{txt, int, csl} ...
Checking data/lang/phones/disambig.{txt, int, csl} ...
Checking data/lang/phones/nonsilence.{txt, int, csl} ...

Checking data/lang/phones/silence.{txt, int, csl} ...

Checking data/lang/phones/optional_silence.{txt, int, csl} ...

Checking data/lang/phones/roots.{txt, int} ...

Checking data/lang/phones/sets.{txt, int} ...

Checking data/lang/phones/extra_questions.{txt, int} ...

Checking optional_silence.txt ...

Checking disambiguation symbols: #0 and #1

Checking topo ...

Checking data/lang/oov.{txt, int} ...

--> data/lang/L.fst is olabel sorted

--> data/lang/L_disambig.fst is olabel sorted

--> WARNING (check output above for warnings)

phones.txt : create phone symbol table
words.txt: create word symbol table
roots file : forece all the phones into one file
topo: use utils/gen_topo.pl to generate phone topology file. This controls the number of states in the non-silence HMMs and in the silence HMMs.
oov: contains a word that will map any OOVs to during training
L.fst / L_disambig.fst: utils/make_lexicon_fst_silprob.pl / utils/make_lexicon_fst.pl generates the files.

This will create a new folder called lang within the local folder which will contain an FST describing the language in question. Look at the script.

It transforms some of the files created in data/ to a more normalized form that is read by Kaldi.
This script creates its output in the data/lang/ directory. The files we mention below will be in that directory.

The first two files this script creates are called words.txt and phones.txt (both in the directory data/lang/).

Look at the files with suffix .csl (in data/lang/phones). These are colon-separated lists of the integer id's of non-silence, and silence, phones respectively.

Look at phones.txt (in data/lang/). This file is a phone symbol table that also handles the "disambiguation symbols" used in the standard FST recipe. These symbols are conventionally called #1, #2 and so on; see the paper "Speech Recognition with Weighted Finite State Transducers" . We also add a symbol #0 which replaces epsilon transitions in the language model; see Disambiguation symbols for more information. How many disambiguation symbols are there? In some recipes the number of disambiguation symbols is the same as the maximum number of words that share the same pronunciation.

The file L.fst is the compiled lexicon in FST format. To see what kind of information is in it, you can (from s5/), do:

fstprint --isymbols=data/lang/phones.txt --osymbols=data/lang/words.txt data/lang/L.fst | head

【4】 build the language model
local/prepare_lm.sh

【5】feature extraction

# Feature extraction
for x in train_yesno test_yesno; do
steps/make_mfcc.sh --nj 1 data/$x exp/make_mfcc/$x mfcc
steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x mfcc
done

To see the logging output of the program that creates the MFCC,

$vim exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log

【6】 train monophone models

# Mono training
steps/train_mono.sh --nj 1 --cmd "$train_cmd" \
--totgauss 400 \
data/train_yesno data/lang exp/mono0a

【7】create the decode graph

# Graph compilation
utils/mkgraph.sh --mono data/lang_test_tg exp/mono0a exp/mono0a/graph_tgpr

【8】 monophone decoding

# Decoding
steps/decode.sh --nj 1 --cmd "$decode_cmd" \
exp/mono0a/graph_tgpr data/test_yesno exp/mono0a/decode_test_yesno

To print the final results,

for x in exp/*/decode*; do [ -d $x ] && grep WER $x/wer_* | utils/best_wer.sh; done

Here is a sample output.

%WER 0.00 [ 0 / 232, 0 ins, 0 del, 0 sub ] exp/mono0a/decode_test_yesno/wer_10

--------------------------------------------------------------------------------------------------------------------------
http://kaldi.sourceforge.net/tutorial_running.html

Kaldi-related

Friday, April 10, 2015

Example 1 : yesno

No comments:

Post a Comment