dgs.models.dataset.dataset.VideoDataset.transform_crop_resize

static VideoDataset.transform_crop_resize() torchvision.transforms.v2.Compose

Given one single image, with its corresponding bounding boxes and key-points, obtain a cropped image for every bounding box with localized key-points.

This transform expects a custom structured input as a dict.

>>> structured_input: dict[str, any] = {
    "image": tv_tensors.Image,
    "box": tv_tensors.BoundingBoxes,
    "keypoints": torch.Tensor,
    "output_size": ImgShape,
    "mode": str,
}
Returns:

A composed torchvision function that accepts a dict as input.

After calling this transform function, some values will have different shapes:

image

Now contains the image crops as tensor of shape [N x C x H x W].

bboxes

Zero, one, or multiple bounding boxes for this image as tensor of shape [N x 4]. And the bounding boxes got transformed into the XYWH format.

coordinates

Now contains the joint coordinates of every detection in local coordinates in shape [N x J x 2|3].