ertk.dataset.Dataset

class ertk.dataset.Dataset(corpus_info: PathLike | str, features: str | PathLike | None = None, subset: str = 'default', label: str = 'label')

Bases: object

Class representing a generic dataset, consisting of a set of features and optional partitions and annotations. Has various preprocessing methods.

An annotation is a scalar value for all instances in the dataset.

A partition is a partition of instances into disjoint groups (e.g. speakers). A partition should be complete in that each instance is in exactly one group in the partition. Each partition has a corresponding annotation with the same name, using the group names as categorical annotations.

Parameters:
corpus_info: Pathlike or str, optional

Path to corpus info in YAML format.

features: Pathlike or str, optional

Path to features file, or unique name of features in corpus features directory.

subset: str, optional

The subset of instances to use.

__init__(corpus_info: PathLike | str, features: str | PathLike | None = None, subset: str = 'default', label: str = 'label')

Methods

__init__(corpus_info[, features, subset, label])

annotation_type(annot_name)

clip_arrays(length)

Clips each array to the specified maximum length.

clone()

copy()

frame_arrays(frame_size, frame_shift[, ...])

Create a sequence of frames from the raw signal.

get_annotations(annot_name)

Get a list of annotations, one for each instance currently in the dataset.

get_audio_paths()

get_group_counts(annot_name)

Get group counts for a partition.

get_group_indices(annot_name)

Gets the group indices (i.e.

get_group_names(annot_name)

Get the names of groups in a partition.

get_idx_for_names(names)

Gets indices of instances corresponding to names.

get_idx_for_split()

Gets indices of instances corresponding to the selection given by split.

get_ratings(name[, rating_set])

Get per-annotator ratings of a specified column, for this dataset.

init_corpus_info(path)

Initialise corpus metadata from YAML.

map_and_select(map, select, remove)

Convenience function for mapping one or more partitions and then selecting one or more groups.

map_classes(mapping)

Modifies classses based on the mapping in map.

map_groups(part_name, mapping)

Map group names in a partition.

normalise(partition[, normaliser])

Transforms the data matrix of this dataset in-place using some (offline) normalisation method.

pad_arrays([pad])

Pads each array to the nearest multiple of pad greater than the array size.

remove_annotation(annot_name)

Removes a set of annotations from the dataset.

remove_classes(*[, drop, keep])

Remove instances with labels not in keep.

remove_groups(part_name, *[, drop, keep])

Remove instances corresponding to groups from the given partition.

remove_instances(*[, drop, keep])

Remove instances from dataset.

remove_ratings(rating_set)

Delete a set of ratings for this dataset.

rename_annotation(old_name, new_name)

Renames an annotation.

transpose_time()

Transpose the time and feature axis of each instance.

update_annotation(annot_name, annotations[, ...])

Add or update an annotation.

update_features(features, **read_kwargs)

Update the features matrix and feature names for this dataset.

update_labels(labels)

update_ratings(rating_set, ratings)

Update a set of ratings.

use_subset([subset])

Use a different subset of the instances.

Inherited Methods

Attributes

annotations

Full annotation matrix for dataset.

class_counts

Number of instances for each class.

classes

List of unique class labels.

corpus

The corpus this dataset represents.

description

The descriptive name of this corpus

feature_names

List of feature names.

label_annot

The annotation used as target label.

labels

List of labels for instances.

n_classes

Number of unique classes.

n_features

Number of features.

n_instances

Number of instances in this dataset.

n_speakers

names

List of instance names.

partitions

Partitions in this dataset.

ratings

Full ratings for dataset.

speaker_counts

Number of instances for each speaker.

speaker_indices

Indices into speakers array of corresponding speaker for each instance.

speaker_names

List of unique speakers in this dataset.

speakers

subset

Name of clip subset used.

subsets

Dict from subset name to set of clip names.

x

The data matrix.

y

The class label array; one label per instance.

property annotations: DataFrame

Full annotation matrix for dataset.

property class_counts: ndarray

Number of instances for each class.

property classes: List[str]

List of unique class labels.

clip_arrays(length: int)

Clips each array to the specified maximum length.

property corpus: str

The corpus this dataset represents.

property description: str

The descriptive name of this corpus

property feature_names: List[str]

List of feature names.

frame_arrays(frame_size: int, frame_shift: int, max_frames: int | None = None)

Create a sequence of frames from the raw signal.

get_annotations(annot_name: str) ndarray

Get a list of annotations, one for each instance currently in the dataset.

Parameters:
annot_name: str

Annotation name.

Returns:
A pd.Series of values, one for each instance in the datset, in
the same order they appear in names and x.
get_group_counts(annot_name: str) ndarray

Get group counts for a partition.

Parameters:
annot_name: str

The partition name.

Returns:
A NumPy array of counts for the corresponding group in this
partition.
get_group_indices(annot_name: str) ndarray

Gets the group indices (i.e. indices into the groups array) for a given partition.

Parameters:
annot_name: str

The partition name.

Returns:
A NumPy array of group indices for each instance in the dataset.
get_group_names(annot_name: str) List[str]

Get the names of groups in a partition.

Parameters:
annot_name: str

Annotation name.

get_idx_for_names(names: Collection[str]) ndarray

Gets indices of instances corresponding to names.

Parameters:
names: collection of str

The names to get indices for.

Returns:
idx: np.ndarray

The indices corresponding to names, in order.

get_idx_for_split(split: str | Dict[str, Collection[str]] | DataSelector, return_complement: Literal[False] = False) ndarray
get_idx_for_split(split: str | Dict[str, Collection[str]] | DataSelector, return_complement: Literal[True]) Tuple[ndarray, ndarray]

Gets indices of instances corresponding to the selection given by split.

Parameters:
split: str or dict

Either a string containing the subset to select, groups to select, a path to such a config file, or a mapping object containing the groups to select.

return_complement: bool

If True, return a tuple of idx, comp_idx where comp_idx contains the complement indices not in the split.

Returns:
idx: np.ndarray

The corresponding indices.

comp_idx: np.ndarray, optional

If return_complement is True, this is also returned and contains the complement indices.

get_ratings(name: str, rating_set: str = 'ratings') Series

Get per-annotator ratings of a specified column, for this dataset.

Parameters:
name: str

The name of the rating column to get.

rating_set: str

The name of the rating set to get.

Returns:
pd.Series

Pandas Series with (name, rater) multiindex.

init_corpus_info(path: PathLike | str) None

Initialise corpus metadata from YAML.

Parameters:
path: os.Pathlike or str

The path to a YAML file containing corpus metadata.

property label_annot: str

The annotation used as target label.

property labels: ndarray

List of labels for instances.

map_and_select(map: Mapping[str, Mapping[str, str]], select: Mapping[str, str | Collection[str]], remove: Mapping[str, str | Collection[str]]) None

Convenience function for mapping one or more partitions and then selecting one or more groups.

Parameters:
map: mapping

The groups mapping. May have one or more partitions.

select: mapping

Mapping from partitions to groups to select.

map_classes(mapping: Mapping[str, str])

Modifies classses based on the mapping in map. Keys not corresponding to classes are ignored. The new classes will be sorted lexicographically.

map_groups(part_name: str, mapping: Mapping[str, str]) None

Map group names in a partition.

Parameters:
part_name: str

Name of partition.

mapping: dict

Group name mapping.

property n_classes: int

Number of unique classes.

property n_features: int

Number of features.

property n_instances: int

Number of instances in this dataset.

property names: Index

List of instance names.

normalise(partition: str, normaliser: TransformerMixin = StandardScaler())

Transforms the data matrix of this dataset in-place using some (offline) normalisation method.

Parameters:
normaliser:

The transform to apply. Must implement fit_transform().

partition: str

The partition to apply in a per-group fashion. If “all”, then perform global normalisation on all the data.

pad_arrays(pad: int = 32)

Pads each array to the nearest multiple of pad greater than the array size. Assumes axis 0 of x is time.

property partitions: Set[str]

Partitions in this dataset.

property ratings: Dict[str, DataFrame]

Full ratings for dataset.

remove_annotation(annot_name: str) None

Removes a set of annotations from the dataset. This does not remove any instances.

Parameters:
annot_name: str

Annotation name.

remove_classes(*, drop: Collection[str] | None = None, keep: Collection[str] | None = None)

Remove instances with labels not in keep.

remove_groups(part_name: str, *, drop: Collection[str] | None = None, keep: Collection[str] | None = None) None

Remove instances corresponding to groups from the given partition.

Parameters:
part_name: str

The partition name.

drop: collection of str

The groups to remove in given partition.

keep: collection of str

The groups to keep in given partition.

remove_instances(*, drop: Collection[str] | None = None, keep: Collection[str] | None = None)

Remove instances from dataset. Recalculate annotations, partitions, etc.

Parameters:
drop: collection of str

Instances to drop.

keep: collection of str

Instances to keep. Exactly one of drop and keep should be given.

remove_ratings(rating_set: str) None

Delete a set of ratings for this dataset.

Parameters:
rating_set: str

The name of a set of ratings.

rename_annotation(old_name: str, new_name: str) None

Renames an annotation. This is useful for example if you want to select a different labelling to use. If an annotation already exists with the given new_name, then it is replaced destructively.

Parameters:
old_name: str

The name of the annotation to rename.

new_name: str

The new name of the annotation.

property speaker_counts: ndarray

Number of instances for each speaker.

property speaker_indices: ndarray

Indices into speakers array of corresponding speaker for each instance.

property speaker_names: List[str]

List of unique speakers in this dataset.

property subset: str

Name of clip subset used.

property subsets: Dict[str, List[str]]

Dict from subset name to set of clip names.

transpose_time()

Transpose the time and feature axis of each instance.

update_annotation(annot_name: str, annotations: PathLike | str | Mapping[str, Any] | Sequence[Any] | Series, dtype: Type | Literal['category'] | None = None) None

Add or update an annotation.

Parameters:
annot_name: str

The name of the annotation.

annotations: PathLike, str, mapping, DataFrame, Series or sequence

Annotations to add, similar to update_partition(). If PathLike or str, annotations are read from a CSV. If a dict, should be of the form {instance: annotation}. If a list, should have an annotation for each instance.

dtype: type, optional

The type of annotations for reading from CSV file. If the literal “category” is given the annotations are converted to the Pandas categorical dtype.

update_features(features: PathLike | str, **read_kwargs) None

Update the features matrix and feature names for this dataset.

Parameters:
features: os.PathLike or str

Path to a set of features or unique name of features in corpus features dir.

**read_kwargs:

Other arguments to pass to read_features().

update_ratings(rating_set: str, ratings: PathLike | str | Mapping[str, Mapping[str, Any]] | Series) None

Update a set of ratings.

Parameters:
rating_set: str

The name for this set of ratings.

ratings: PathLike, str, mapping, DataFrame, Series

The ratings to add. Must have a joint index where

use_subset(subset: str = 'default') None

Use a different subset of the instances.

Parameters:
subset: str

Name of subset to use. Default is “default” which uses the default subset specified in corpus_info.

property x: ndarray

The data matrix.

property y: ndarray

The class label array; one label per instance.