Metrics

To evaluate the performance of different trackers, metrics were defined. These metrics either evaluate the tracking of the detected people or they evaluate the detections themselves.

Detection Metrics

  • IoU: intersection over union between two bounding boxes

    \[\text{IoU} &= \frac{\text{Area of Overlap}}{\text{Area of Union}}\]
  • OKS: object keypoint similarity between two sets of keypoints (see e.g. here)

    \[\begin{split}\texttt{OKS} &= \frac{\sum_i \texttt{KS}_i \cdot \delta \left( v_i > 0 \right)}{\sum_i \delta \left( v_i > 0 \right)} = \frac{\texttt{key point similarity}}{\texttt{nof labeled}}\\ \texttt{KS}_i &= \exp{\left( \frac{- d^2_i}{2s^2 k^2_i} \right)}\end{split}\]
  • LocA: location accuracy

    \[\texttt{LocA} &= \frac{1}{|\texttt{TP}|}\sum_{c\in \texttt{TP}}\texttt{IoU}\left( c \right)\]
  • DetA: detection accuracy

    \[\texttt{DetA} &= \frac{|\texttt{TP}|}{|\texttt{TP}| + |\texttt{FP}| + |\texttt{FN}|}\]

Tracking Metrics

The performance of a tracker is evaluated after the whole video sequence was processed and can be evaluated using the following metrics:

Multi-Object Tracking Accuracy

The multi-object tracking accuracy or \(\texttt{MOTA}\) was the gold standard for tracking. But it puts more weight on detection accuracy than on tracking accuracy, thus resulting in imprecise tracking evaluation metrics.

\[\begin{split}\texttt{MOTA} &= 1 - \frac{|\texttt{FN}| + |\texttt{FP}| + |\texttt{IDSW}|}{|\texttt{TP}|} \quad \begin{cases} \texttt{FN}\text{: nof false negatives}\\ \texttt{FP}\text{: nof false positives}\\ \texttt{IDSW}\text{: nof ID switches}\\ \texttt{TP}\text{: nof true positives} \end{cases}\end{split}\]

Within this code, the \(\texttt{MOTA}\) is computed using the evaluation tools of the respective datasets. For the DanceTrack dataset the DanceTrack internal TrackEval package was used. For the PoseTrack21 dataset, the poseval package was used to be able to evaluate more joint-dependent parameters of the PoseTrack21 dataset. All modules are (or at least can be) loaded as submodules while setting up the DGS framework.

Higher-Order Tracking Accuracy

THe higher order tracking accuracy or \(\texttt{HOTA}\) was developed to balance between the detection and tracking accuracy. It can only be computed over the whole video sequence after all predictions were made. A more in depth explanation of the HOTA can be found on autonomousvision.github.io .

\[\begin{split}\texttt{HOTA}_\alpha &= \sqrt{\texttt{DetA}_\alpha \cdot \texttt{AssA}_\alpha}\\ &= \sqrt{\frac{\sum_{c \in \lbrace \texttt{TP} \rbrace} \mathcal{A} \left( c \right)}{|\texttt{TP}| + |\texttt{FN}| + |\texttt{FP}|}}\\ \\ \texttt{Ass-IoU} &= \frac{|\texttt{TPA}|}{|\texttt{TPA}| + |\texttt{FNA}| + |\texttt{FPA}|}\\ \texttt{AssA} &= \frac{1}{|\texttt{TP}|}\sum_{c\in \texttt{TP}}\texttt{Ass-IoU}\left(c\right)\\ \\ \mathcal{A} \left( c \right) &= \frac{|\texttt{TPA}\left(c\right)|}{|\texttt{TPA}\left(c\right)| + |\texttt{FNA}\left(c\right)| + |\texttt{FPA}\left( c \right) |} \quad \begin{cases} \texttt{TPA}\text{: True Positive Associations}\\ \texttt{FNA}\text{: False Negative Associations}\\ \texttt{FPA}\text{: False Positive Associations} \end{cases}\end{split}\]

The overall \(\texttt{HOTA}\) is the integral over the \(\texttt{HOTA}_\alpha\) for all values of \(\alpha\).

\[\begin{split}\texttt{HOTA} &= \int_0^1 \texttt{HOTA}_\alpha \, d\alpha\\\end{split}\]

Within this code, the \(\texttt{HOTA}\) is computed using the evaluation tools of the respective datasets. For the DanceTrack dataset the DanceTrack internal TrackEval package was used. For the PoseTrack21 dataset, the PoseTrack21 Evaluation Kit was used. All modules are (or at least can be) loaded as submodules while setting up the DGS framework.