HorsesSplit3_copy-compressed-01.png

Horse-10:

an animal pose estimation benchmark for out-of-domain robustness

PWC

This paper was published Winter Conference on Applications of Computer Vision (WACV) 2021! [arxiv version][WACV Paper][Suppl].

This paper introduces the new benchmarks, and new SOTA EfficientNets for DeepLabCut. Note, a shorter workshop version of our paper was presented at the Uncertainty & Robustness in Deep Learning Workshop at ICML '20.

If you use this data or are inspired by this work, we ask you to please cite:

@InProceedings{Mathis_2021_WACV, author = {Mathis, Alexander and Biasi, Thomas and Schneider, Steffen and Yuksekgonul, Mert and Rogers, Byron and Bethge, Matthias and Mathis, Mackenzie W.}, title = {Pretraining Boosts Out-of-Domain Robustness for Pose Estimation}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2021}, pages = {1859-1868} }

Motivation

Pose estimation is an important tool for measuring behavior, and thus widely used in technology, medicine and biology. Due to innovations in both deep learning algorithms and large-scale datasets pose estimation on humans has gotten very powerful. However, typical human pose estimation benchmarks, such as MPII pose and COCO, contain many different individuals (>10K) in different contexts, but only very few example postures per individual. In real world application of pose estimation, users want to estimate the location of user-defined bodyparts by only labeling a few hundred frames on a small subset of individuals, yet want this to generalize to new individuals. Thus, one naturally asks the following question: Assume you have trained an algorithm that performs with high accuracy on a given (individual) animal for the whole repertoire of movement - how well will it generalize to different individuals that have slightly or a dramatically different appearance? Unlike in common human pose estimation benchmarks here the setting is that datasets have many (annotated) poses per individual (>200) but only few individuals (1-25).

To allow the field to tackle this challenge, we developed a novel benchmark, called Horse-10, comprising 30 diverse Thoroughbred horses, for which 22 body parts were labeled by an expert in 8,114 frames. Horses have various coat colors and the “in-the-wild” aspect of the collected data at various Thoroughbred yearling sales and farms added additional complexity.

Moreover, we present Horse-C to contrast the domain shift inherent in the Horse-10 dataset with domain shift induced by common image corruptions.

The data:

  • >8,000 expertly labeled frames across 30 individual thoroughbred horses (called Horse-30).

The Tasks:

  • Horse-10: Train on a subset of individuals (10) and evaluate on held-out “out-of-domain” horses (20).

  • Horse-C: We apply common image corruptions (Hendrycks et al., 2019) to the Horse-10 dataset. The resulting Horse-C images are corrupted with different forms of digital transforms, blurring filters, point-wise noise or weather conditions. All conditions are applied following the evaluation protocol and implementation by Michaelis et al. 2019. In total, we arrive at 75 variants of Horse-C (15 different corruptions at 5 different severities), totalling over 600K frames.

Horse-10:

 
Screen Shot 2021-01-01 at 8.55.22 PM.png

Train:

The ground truth training data is provided as 3 splits of 10 Horses each. The download provides you a project compatible with loading into the deeplabcut framework, but ground truth labels/training data can be easily loaded in pandas to accommodate your framework (example loader here).

Please do NOT train on all three splits simultaneously. You must train independently (as some horses can be considered out-of-domain in other splits for evaluation!). Integrity matters!

The download also includes all of Horse-30 images and annotations (thus is ~850MB).

NOTE: Horse-10/Horse-30 dataset (c) by Rogers, Byron and Mathis, Alexander and Mathis, Mackenzie W. Horse-10 is licensed under a Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

Evaluation Metrics:

  • average PCK@0.3, within domain, and out-of-domain (held out horses); what fraction of machine-applied points fall within a specific range of human-labeled ground-truth labels; we use a matching threshold of 30% of the head segment length (nose to eye for horse), which is computed by taking the median for all annotated images per horse.

  • Normalized Error (RMSE). Due to the variant in sizes of the horse, we provide RMSE in relation to the eye-to-nose distance. 1 equalling the full eye-to-nose distance (on average of ~18 pixels).

Evaluation:

To evaluate on out-of-domain horses, please run evaluation on each splits held-out test images. If you have any questions, we are happy to help, so please reach out to: alexander.mathis@epfl.ch.

Horse-10 Leaderboard on PapersWithCode:

https://paperswithcode.com/sota/animal-pose-estimation-on-horse-10

https://paperswithcode.com/sota/animal-pose-estimation-on-horse-10