Command-line programs reference
A number of command-line tools are installed along with the ERTK Python library. Three executables are installed:
These each have a number of subcommands available, shown with the
--help option.
ertk-cli
Experiments
ertk-cli exp
ertk-cli exp [OPTIONS] [CORPUS_INFO]...
Options
- --data_config <data_config>
- --subset <subset>
Subset selection.
- --map_groups <map_groups>
Group name mapping.
- --sel_groups <sel_groups>
Group selection. This is a map from partition to group(s).
- --remove_groups <remove_groups>
Group deletion. This is a map from partition to group(s).
- --clip_seq <clip_seq>
Clip sequences to this length (before pad).
- --pad_seq <pad_seq>
Pad sequences to multiple of this length (after clip).
- --cv_part <cv_part>
Partition for LOGO CV.
- --kfold <kfold>
k when using (group) k-fold cross-validation, or leave-one-out.
- --inner_kfold <inner_kfold>
k for inner k-fold CV (where relevant). If -1 then LOGO is used. If 1 then a random split is used.
- --test_size <test_size>
Test size when kfold=1.
- --inner_part <inner_part>
Which partition to use for group-based inner CV.
- --train <train>
Train data.
- --valid <valid>
Validation data.
- --test <test>
Test data.
- --inner_cv, --noinner_cv
[deprecated] Whether to use inner CV. This is deprecated and only exists for backwards compatibility.
- --clf <clf_type>
Required Classifier to use.
- --clf_args, --model_args <clf_args_file>
File containing keyword arguments to give to model initialisation.
- --param_grid <param_grid_file>
File with parameter grid data.
- --results <results>
Results directory.
- --logdir <logdir>
TF/PyTorch logs directory.
- --features <features>
Required Features to load.
- --label <label>
Label annotation to use.
- --learning_rate <learning_rate>
- Default:
0.0001
- --batch_size <batch_size>
- Default:
64
- --epochs <epochs>
- Default:
50
- --balanced, --imbalanced
Balances sample weights.
- --sample_rate <sample_rate>
Sample rate if loading raw audio.
- --n_gpus <n_gpus>
Number of GPUs to use.
- Default:
1
- --reps <reps>
The number of repetitions to do per test.
- Default:
1
- --normalise <normalise>
Normalisation method. ‘online’ means use training data for normalisation.
- Default:
online
- --seq_transform <seq_transform>
Normalisation method for sequences.
- Default:
feature
- --transform <transform>
Transformation class.
- Default:
std- Options:
std | minmax
- --n_jobs <n_jobs>
Number of parallel executions.
- --train_config <train_config_path>
Path to train config file.
- --verbose <verbose>
Verbosity. -1=nothing, 0=normal output, 1=INFO, 2=DEBUG
Arguments
- CORPUS_INFO
Optional argument(s)
ertk-cli exp2
ertk-cli exp2 [OPTIONS] CONFIG_PATH [RESTARGS]...
Options
- --verbose <verbose>
Verbosity. -1=nothing, 0=normal output, 1=INFO, 2=DEBUG
Arguments
- CONFIG_PATH
Required argument
- RESTARGS
Optional argument(s)
Model training and inference
ertk-cli classify
Perform inference on given INPUT features using a pre-trained model. Gives the predicted class and confidence and writes CSV to OUTPUT.
ertk-cli classify [OPTIONS] INPUT OUTPUT
Options
- --model <model>
Required Pickled model.
- --norm <norm>
Normalisation scheme.
- Default:
speaker- Options:
speaker | corpus | all
Arguments
- INPUT
Required argument
- OUTPUT
Required argument
ertk-cli train
Trains a model on the given INPUT datasets. INPUT files must be corpus info files.
Optionally pickles the model.
ertk-cli train [OPTIONS] [INPUT]...
Options
- --features <features>
Required Features to load.
- --clf <clf_type>
Required Classifier to use.
- --save <save>
Required Location to save the model.
- --cv <cv>
Cross-validation method.
- Options:
speaker | corpus
- --balanced, --imbalanced
Balances sample weights.
- --normalise <normalise>
Normalisation method.
- Options:
speaker | corpus | online
- --transform <transform>
Transformation class.
- Default:
std- Options:
std | minmax
- --subset <subset>
Subset selection.
- --map_groups <map_groups>
Group name mapping.
- --sel_groups <sel_groups>
Group selection. This is a map from partition to group(s).
- --clip_seq <clip_seq>
Clip sequences to this length (before pad).
- --pad_seq <pad_seq>
Pad sequences to multiple of this length (after clip).
- --target <target>
Classifier/regressor target.
- --clf_args <clf_args_file>
File containing keyword arguments to give to model initialisation.
- --param_grid <param_grid_file>
File with parameter grid data.
- --verbose
Verbose training.
Arguments
- INPUT
Optional argument(s)
ertk-dataset
Info
ertk-dataset annotation
Calculate statistics from annotations in INPUT(s). If multiple INPUTs are given, additional statistics for the other INPUTs are shown for each level of INPUT.
ertk-dataset annotation [OPTIONS] [INPUT]...
Options
- --plot
Plot histogram/bar chart.
- --files <files>
File with names to include for statistics.
- --dtype <dtype>
Arguments
- INPUT
Optional argument(s)
ertk-dataset info
Print info about a dataset or combination of datasets.
ertk-dataset info [OPTIONS] [CORPUS_INFO]...
Options
- --data_config <data_config>
- --subset <subset>
Subset selection.
- --map_groups <map_groups>
Group name mapping.
- --sel_groups <sel_groups>
Group selection. This is a map from partition to group(s).
- --remove_groups <remove_groups>
Group deletion. This is a map from partition to group(s).
- --clip_seq <clip_seq>
Clip sequences to this length (before pad).
- --pad_seq <pad_seq>
Pad sequences to multiple of this length (after clip).
- --verbose <verbose>
Verbosity. -1=nothing, 0=normal output, 1=INFO, 2=DEBUG
- --output_list <output_list>
Arguments
- CORPUS_INFO
Optional argument(s)
Running processors and feature extractors
ertk-dataset process
Process features or audio files in INPUT, write to OUTPUT.
ertk-dataset process [OPTIONS] INPUT OUTPUT [RESTARGS]...
Options
- --processor, --features <processor>
Required Processor to apply.
- --list_processors
Show options for registered processors.
- --config <config>
Extractor config file.
- --corpus <corpus>
Corpus name to set.
- --batch_size <batch_size>
Batch size for processing. If batch_size is greater than 1, clips will be batched together.
- --sample_rate <sample_rate>
Resample to this rate for audio input.
- --n_jobs <n_jobs>
Number of parallel jobs to run.
- --verbose <verbose>
Arguments
- INPUT
Required argument
- OUTPUT
Required argument
- RESTARGS
Optional argument(s)
Features
ertk-dataset combine
Combines multiple INPUT features and writes to OUTPUT.
ertk-dataset combine [OPTIONS] [INPUT]... OUTPUT
Options
- --prefix_corpus
Prefix corpus names to instance names.
- --corpus <corpus>
Output corpus name.
Arguments
- INPUT
Optional argument(s)
- OUTPUT
Required argument
ertk-dataset convert
Convert INPUT dataset format to OUTPUT format. Note that no label information is written to OUTPUT.
ertk-dataset convert [OPTIONS] INPUT OUTPUT
Options
- --corpus <corpus>
Corpus attribute to set, if required.
Arguments
- INPUT
Required argument
- OUTPUT
Required argument
ertk-dataset remove_instances
Remove instances from INPUT features that aren’t in the NAMES file, and write to OUTPUT. If OUTPUT is not given, overwrites INPUT in-place.
ertk-dataset remove_instances [OPTIONS] INPUT [OUTPUT]
Options
- --names <names_file>
Arguments
- INPUT
Required argument
- OUTPUT
Optional argument
ertk-dataset vis
Tools for visualising data.
ertk-dataset vis [OPTIONS] COMMAND [ARGS]...
Commands
- feats
Displays plot of INSTANCE in INPUT.
- xy
Plots INPUT in 2D using one of a number of…
ertk-util
Run parallel CPU or GPU jobs with a simple command:
ertk-util parallel_jobs
Run all commands specified in the INPUT file(s), splitting the work across multiple CPU threads or GPUs such that each command runs solely on whichever thread/GPU is next available.
Each GPU still has it’s own thread to run processes using CUDA_VISIBLE_DEVICES. Each thread reads from a synchronous queue and runs the command in a shell.
ertk-util parallel_jobs [OPTIONS] [INPUT]...
Options
- --failed <failed>
Required Where to log failed commands.
- --cpus <cpus>
Number of CPU threads to use
- --gpus <gpus>
GPU IDs to run on.
Arguments
- INPUT
Optional argument(s)
Other utilities
ertk-util create_cv_dirs
Create directory structure with group-independent cross-validation folds. Each group has a directory which is patitioned by label.
ertk-util create_cv_dirs [OPTIONS] INPUT OUTPUT
Options
- --label <label_name>
Categorical label to use.
- --partition <partition>
Partition to split on.
Arguments
- INPUT
Required argument
- OUTPUT
Required argument
ertk-util grid_to_conf
Creates a new parameters YAML file in the OUTPUT directory for each combination of parameters in the PARAM_GRID file. The names of the files will be formatted according to the –format parameter if given, or else assigned a number starting from 1.
ertk-util grid_to_conf [OPTIONS] PARAM_GRID OUTPUT
Options
- --format <format>
Format string.
Arguments
- PARAM_GRID
Required argument
- OUTPUT
Required argument
ertk-util names_to_filenames
Convert names in CLIPS to filepaths from FILELIST. Write to stdout.
ertk-util names_to_filenames [OPTIONS] CLIPS FILELIST
Arguments
- CLIPS
Required argument
- FILELIST
Required argument
CHAT and ELAN file formats
ertk-util split_chat
Splits CHAT (.cha) files and associated audio into segments.
ertk-util split_chat [OPTIONS] [INPUT]... OUTPUT
Options
- --prefix <prefix>
Arguments
- INPUT
Optional argument(s)
- OUTPUT
Required argument
ertk-util split_elan
Splits ELAN (.eaf) files and associated audio into segments.
ertk-util split_elan [OPTIONS] [INPUT]... OUTPUT
Options
- --prefix <prefix>
- --tier <tier>
Tier from which to extract transcripts.
- --break_on <break_on>
Break the utterance at these annotation values.
Arguments
- INPUT
Optional argument(s)
- OUTPUT
Required argument