ertk.dataset.Dataset

class ertk.dataset.Dataset(corpus_info: PathLike | str, features: str | PathLike | None = None, subset: str = 'default', label: str = 'label')

Bases: object

Class representing a generic dataset, consisting of a set of features and optional partitions and annotations. Has various preprocessing methods.

An annotation is a scalar value for all instances in the dataset.

A partition is a partition of instances into disjoint groups (e.g. speakers). A partition should be complete in that each instance is in exactly one group in the partition. Each partition has a corresponding annotation with the same name, using the group names as categorical annotations.

Parameters:

corpus_info: Pathlike or str, optional: Path to corpus info in YAML format.
features: Pathlike or str, optional: Path to features file, or unique name of features in corpus features directory.
subset: str, optional: The subset of instances to use.

__init__(corpus_info: PathLike | str, features: str | PathLike | None = None, subset: str = 'default', label: str = 'label')

Methods

`__init__`(corpus_info[, features, subset, label])
`annotation_type`(annot_name)
`clip_arrays`(length)	Clips each array to the specified maximum length.
`clone`()
`copy`()
`frame_arrays`(frame_size, frame_shift[, ...])	Create a sequence of frames from the raw signal.
`get_annotations`(annot_name)	Get a list of annotations, one for each instance currently in the dataset.
`get_audio_paths`()
`get_group_counts`(annot_name)	Get group counts for a partition.
`get_group_indices`(annot_name)	Gets the group indices (i.e.
`get_group_names`(annot_name)	Get the names of groups in a partition.
`get_idx_for_names`(names)	Gets indices of instances corresponding to `names`.
`get_idx_for_split`()	Gets indices of instances corresponding to the selection given by `split`.
`get_ratings`(name[, rating_set])	Get per-annotator ratings of a specified column, for this dataset.
`init_corpus_info`(path)	Initialise corpus metadata from YAML.
`map_and_select`(map, select, remove)	Convenience function for mapping one or more partitions and then selecting one or more groups.
`map_classes`(mapping)	Modifies classses based on the mapping in map.
`map_groups`(part_name, mapping)	Map group names in a partition.
`normalise`(partition[, normaliser])	Transforms the data matrix of this dataset in-place using some (offline) normalisation method.
`pad_arrays`([pad])	Pads each array to the nearest multiple of `pad` greater than the array size.
`remove_annotation`(annot_name)	Removes a set of annotations from the dataset.
`remove_classes`(*[, drop, keep])	Remove instances with labels not in `keep`.
`remove_groups`(part_name, *[, drop, keep])	Remove instances corresponding to groups from the given partition.
`remove_instances`(*[, drop, keep])	Remove instances from dataset.
`remove_ratings`(rating_set)	Delete a set of ratings for this dataset.
`rename_annotation`(old_name, new_name)	Renames an annotation.
`transpose_time`()	Transpose the time and feature axis of each instance.
`update_annotation`(annot_name, annotations[, ...])	Add or update an annotation.
`update_features`(features, **read_kwargs)	Update the features matrix and feature names for this dataset.
`update_labels`(labels)
`update_ratings`(rating_set, ratings)	Update a set of ratings.
`use_subset`([subset])	Use a different subset of the instances.

Inherited Methods

Attributes

`annotations`	Full annotation matrix for dataset.
`class_counts`	Number of instances for each class.
`classes`	List of unique class labels.
`corpus`	The corpus this dataset represents.
`description`	The descriptive name of this corpus
`feature_names`	List of feature names.
`label_annot`	The annotation used as target label.
`labels`	List of labels for instances.
`n_classes`	Number of unique classes.
`n_features`	Number of features.
`n_instances`	Number of instances in this dataset.
`n_speakers`
`names`	List of instance names.
`partitions`	Partitions in this dataset.
`ratings`	Full ratings for dataset.
`speaker_counts`	Number of instances for each speaker.
`speaker_indices`	Indices into speakers array of corresponding speaker for each instance.
`speaker_names`	List of unique speakers in this dataset.
`speakers`
`subset`	Name of clip subset used.
`subsets`	Dict from subset name to set of clip names.
`x`	The data matrix.
`y`	The class label array; one label per instance.

property annotations: DataFrame: Full annotation matrix for dataset.

property class_counts: ndarray: Number of instances for each class.

property classes: List[str]: List of unique class labels.

clip_arrays(length: int): Clips each array to the specified maximum length.

property corpus: str: The corpus this dataset represents.

property description: str: The descriptive name of this corpus

property feature_names: List[str]: List of feature names.

frame_arrays(frame_size: int, frame_shift: int, max_frames: int | None = None): Create a sequence of frames from the raw signal.

get_annotations(annot_name: str) → ndarray

Get a list of annotations, one for each instance currently in the dataset.

Parameters:

annot_name: str: Annotation name.

Returns:

A pd.Series of values, one for each instance in the datset, in
the same order they appear in names and x.

get_group_counts(annot_name: str) → ndarray

Get group counts for a partition.

Parameters:

annot_name: str: The partition name.

Returns:

A NumPy array of counts for the corresponding group in this
partition.

get_group_indices(annot_name: str) → ndarray

Gets the group indices (i.e. indices into the groups array) for a given partition.

Parameters:

annot_name: str: The partition name.

Returns:

A NumPy array of group indices for each instance in the dataset.

get_group_names(annot_name: str) → List[str]

Get the names of groups in a partition.

Parameters:

annot_name: str: Annotation name.

get_idx_for_names(names: Collection[str]) → ndarray

Gets indices of instances corresponding to names.

Parameters:

names: collection of str: The names to get indices for.

Returns:

idx: np.ndarray: The indices corresponding to names, in order.

get_idx_for_split(split: str | Dict[str, Collection[str]] | DataSelector, return_complement: Literal[False] = False) → ndarray

get_idx_for_split(split: str | Dict[str, Collection[str]] | DataSelector, return_complement: Literal[True]) → Tuple[ndarray, ndarray]

Gets indices of instances corresponding to the selection given by split.

Parameters:

split: str or dict: Either a string containing the subset to select, groups to select, a path to such a config file, or a mapping object containing the groups to select.
return_complement: bool: If True, return a tuple of idx, comp_idx where comp_idx contains the complement indices not in the split.

Returns:

idx: np.ndarray: The corresponding indices.
comp_idx: np.ndarray, optional: If return_complement is True, this is also returned and contains the complement indices.

get_ratings(name: str, rating_set: str = 'ratings') → Series

Get per-annotator ratings of a specified column, for this dataset.

Parameters:

name: str: The name of the rating column to get.
rating_set: str: The name of the rating set to get.

Returns:

pd.Series: Pandas Series with (name, rater) multiindex.

init_corpus_info(path: PathLike | str) → None

Initialise corpus metadata from YAML.

Parameters:

path: os.Pathlike or str: The path to a YAML file containing corpus metadata.

property label_annot: str: The annotation used as target label.

property labels: ndarray: List of labels for instances.

map_and_select(map: Mapping[str, Mapping[str, str]], select: Mapping[str, str | Collection[str]], remove: Mapping[str, str | Collection[str]]) → None

Convenience function for mapping one or more partitions and then selecting one or more groups.

Parameters:

map: mapping: The groups mapping. May have one or more partitions.
select: mapping: Mapping from partitions to groups to select.

map_classes(mapping: Mapping[str, str]): Modifies classses based on the mapping in map. Keys not corresponding to classes are ignored. The new classes will be sorted lexicographically.

map_groups(part_name: str, mapping: Mapping[str, str]) → None

Map group names in a partition.

Parameters:

part_name: str: Name of partition.
mapping: dict: Group name mapping.

property n_classes: int: Number of unique classes.

property n_features: int: Number of features.

property n_instances: int: Number of instances in this dataset.

property names: Index: List of instance names.

normalise(partition: str, normaliser: TransformerMixin = StandardScaler())

Transforms the data matrix of this dataset in-place using some (offline) normalisation method.

Parameters:

normaliser:: The transform to apply. Must implement fit_transform().
partition: str: The partition to apply in a per-group fashion. If “all”, then perform global normalisation on all the data.

pad_arrays(pad: int = 32): Pads each array to the nearest multiple of pad greater than the array size. Assumes axis 0 of x is time.

property partitions: Set[str]: Partitions in this dataset.

property ratings: Dict[str, DataFrame]: Full ratings for dataset.

remove_annotation(annot_name: str) → None

Removes a set of annotations from the dataset. This does not remove any instances.

Parameters:

annot_name: str: Annotation name.

remove_classes(*, drop: Collection[str] | None = None, keep: Collection[str] | None = None): Remove instances with labels not in keep.

remove_groups(part_name: str, *, drop: Collection[str] | None = None, keep: Collection[str] | None = None) → None

Remove instances corresponding to groups from the given partition.

Parameters:

part_name: str: The partition name.
drop: collection of str: The groups to remove in given partition.
keep: collection of str: The groups to keep in given partition.

remove_instances(*, drop: Collection[str] | None = None, keep: Collection[str] | None = None)

Remove instances from dataset. Recalculate annotations, partitions, etc.

Parameters:

drop: collection of str: Instances to drop.
keep: collection of str: Instances to keep. Exactly one of drop and keep should be given.

remove_ratings(rating_set: str) → None

Delete a set of ratings for this dataset.

Parameters:

rating_set: str: The name of a set of ratings.

rename_annotation(old_name: str, new_name: str) → None

Renames an annotation. This is useful for example if you want to select a different labelling to use. If an annotation already exists with the given new_name, then it is replaced destructively.

Parameters:

old_name: str: The name of the annotation to rename.
new_name: str: The new name of the annotation.

property speaker_counts: ndarray: Number of instances for each speaker.

property speaker_indices: ndarray: Indices into speakers array of corresponding speaker for each instance.

property speaker_names: List[str]: List of unique speakers in this dataset.

property subset: str: Name of clip subset used.

property subsets: Dict[str, List[str]]: Dict from subset name to set of clip names.

transpose_time(): Transpose the time and feature axis of each instance.

Add or update an annotation.

Parameters:

annot_name: str: The name of the annotation.
annotations: PathLike, str, mapping, DataFrame, Series or sequence: Annotations to add, similar to update_partition(). If PathLike or str, annotations are read from a CSV. If a dict, should be of the form {instance: annotation}. If a list, should have an annotation for each instance.
dtype: type, optional: The type of annotations for reading from CSV file. If the literal “category” is given the annotations are converted to the Pandas categorical dtype.

update_features(features: PathLike | str, **read_kwargs) → None

Update the features matrix and feature names for this dataset.

Parameters:

features: os.PathLike or str: Path to a set of features or unique name of features in corpus features dir.
**read_kwargs:: Other arguments to pass to read_features().

update_ratings(rating_set: str, ratings: PathLike | str | Mapping[str, Mapping[str, Any]] | Series) → None

Update a set of ratings.

Parameters:

rating_set: str: The name for this set of ratings.
ratings: PathLike, str, mapping, DataFrame, Series: The ratings to add. Must have a joint index where

use_subset(subset: str = 'default') → None

Use a different subset of the instances.

Parameters:

subset: str: Name of subset to use. Default is “default” which uses the default subset specified in corpus_info.

property x: ndarray: The data matrix.

property y: ndarray: The class label array; one label per instance.