Résumé : |
(auteur) The interest surrounding the study of crowd phenomena spanned during the last decade across multiple fields, including computer vision, physics, sociology, simulation and visualization. There are different levels of granularity at which crowd studies can be performed, namely a finer microanalysis, aimed to detect and then track each pedestrian individually; and a coarser macro-analysis, aimed to model the crowd as a whole.
One of the most difficult challenges when working with human crowds is that usual pedestrian detection methodologies do not scale well to the case where only heads are visible, for a number of reasons such as absence of background, high visual homogeneity, small size of the objects, and heavy occlusions. For this reason, most micro-analysis studies by means of pedestrian detection and tracking methodologies are performed in low to medium-density crowds, whereas macro-analysis through density estimation and people counting is more suited in presence of high-density crowds, where the exact position of each individual is not necessary. Nevertheless, in order to analyze specific events involving high-density crowds for monitoring the flow and preventing disasters such as stampedes, a complete understanding of the scene must be reached. This study deals with pedestrian detection in high-density crowds from a monocamera system, striving to obtain localized detections of all the individuals which are part of an extremely dense crowd. The detections can be then used both to obtain robust density estimation, and to initialize a tracking algorithm. In presence of difficult problems such as our application, supervised learning techniques are well suited. However, two different questions arise, namely which classifier is the most adapted for the considered environment, and which data to use to learn from. We cast the detection problem as a Multiple Classifier System (MCS), composed by two different ensembles of classifiers, the first one based on SVM (SVM-ensemble) and the second one based on CNN (CNN-ensemble), combined relying on the Belief Function Theory (BFT) designing a fusion method which is able to exploit their strengths for pixel-wise classification. SVM-ensemble is composed by several SVM detectors based on different gradient, texture and orientation descriptors, able to tackle the problem from different perspectives. BFT allows us to take into account the imprecision in addition to the uncertainty value provided by each classifier, which we consider coming from possible errors in the calibration procedure and from pixel neighbor’s heterogeneity in the image space due to the close resolution of the target (head) and
descriptor respectively. However, scarcity of labeled data for specific dense crowd contexts reflects in the impossibility to easily obtain robust training and validation sets. By exploiting belief functions directly derived
from the classifiers’ combination, we therefore propose an evidential Query-by-Committee (QBC) active learning algorithm to automatically select the most informative training samples. On the other side, we explore deep learning techniques by casting the problem as a segmentation task in presence of soft labels, with a fully convolutional network architecture designed to recover small objects (heads) thanks to a tailored use of dilated convolutions. In order to obtain a pixel-wise measure of reliability about the network’s predictions, we create a CNN-ensemble by means of dropout at inference time, and we combine the different obtained realizations in the
context of BFT. To conclude, we show that the dense output map given by the MCS can be employed not only
for pedestrian detection at microscopic level, but also to perform macroscopic analysis, bridging the gap between the two levels of granularity. We therefore finally focus our attention to people counting, proposing an evaluation method that can be applied at every scale, resulting to be more precise in the error and uncertainty evaluation (disregarding possible compensations) as well as more useful for the modeling community that could use it to improve and validate local density estimation. |