Automatically identifying wild animals in camera trap images with deep learning

Norouzzadeh M
Nguyen A
Kosmala M
Swanson A
Parker C
Clune J

Having accurate, detailed, and up-to-date information about wildlife location and behavior across broad geographic areas would revolutionize our ability to study, conserve, and manage species and ecosystems. Currently such data are mostly gathered manually at great expense, and thus are sparsely and infrequently collected. Here we investigate the ability to automatically, accurately, and inexpensively collect such data, which could transform many fields of biology, ecology, and zoology into "big data" sciences. Motion sensor cameras called "camera traps" enable pictures of wildlife to be collected inexpensively, unobtrusively, and at high-volume. However, identifying the animals, animal attributes, and behaviors in these pictures remains an expensive, time-consuming, manual task often performed by researchers, hired technicians, or crowdsourced teams of human volunteers. In this paper, we demonstrate that such data can be automatically extracted by deep neural networks (aka deep learning), which is a cutting-edge type of artificial intelligence. In particular, we use the existing human-labeled images from the Snapshot Serengeti dataset to train deep convolutional neural networks for identifying 48 species in 3.2 million images taken from Tanzania's Serengeti National Park. In this paper we train neural networks that automatically identify animals with over 92% accuracy, and we expect that number to improve rapidly in years to come. More importantly, we can choose to have our system classify only the images it is highly confident about, allowing valuable human time to be focused only on challenging images. In this case, our system can automate animal identification for 98.2% of the data while still performing at the same 96.6% accuracy level of crowdsourced teams of human volunteers, saving approximately ~8.3 years (at 40 hours per week) of human labeling effort (i.e. over 17,000 hours) on a 3.2-million-image dataset. Those efficiency gains immediately highlight the importance of using deep neural networks to automate data extraction from camera trap images. The improvements in accuracy we expect in years to come suggest that this technology could enable the inexpensive, unobtrusive, high-volume and perhaps even realtime collection of information about vast numbers of animals in the wild.

Figure 1: Fixed cameras are usually mounted on trees or posts to gather data about animals that pass before them. They are triggered by motion and/or infrared sensors. They can capture pictures of animals automatically, inexpensively, and without disturbing the animals. Images from [Swanson, Alexandra, et al. "Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna." Scientific data 2 (2015): 150026.]

Pub. Info: 
arXiv 1703.05830