Our model is based on recent scene information. Given the last n observations of a pixel, denoted by xi , i = 1, ..., n in the d-dimensional observation space R d, which enclose all the sensor data values. It is possible to estimate the probability density function (pdf) of each pixel with respect to all previously observed values:
where K is a multivariate kernel, H is the bandwidth matrix, which is a symmetric positive d x d-matrix. The choice of the bandwidth matrix H is the single most important factor affecting the estimation accuracy, since it controls the amount of and orientation of smoothing induced. Diagonal matrix bandwidth kernels allow different amounts of smoothing in each of the dimensions and are the most widespread due to computational reasons. Most commonly used kernel density function is the Normal function, in our approach N(0, H) is selected.
The final probability density function can be written as:
Given this estimate at each pixel, a pixel is considered foreground if its probability is under a certain threshold.
Kernel Width Estimation
In order to estimate the kernel bandwidth σ 2j for the jth dimension for a given pixel, we compute the median absolute deviation over the data for consecutive values of the pixel. That is, the median, mj, of each consecutive pair in the data is calculated independently for each dimension. Since we are measuring deviations between two consecutive values. Each pair usually comes from the same local-in-time distribution and only few pairs are expected to come from cross distributions. Assuming that this local in-time distribution is Normal then the deviation is Normal N(0, 2σ 2j). So the standard deviation of the first distribution can be estimated as in:
More details can be found in the full article.
When depth can not be measured at a given pixel, the sensor return a special non-value code to indicate it’s inability to measure depth. Such pixels appears as holes in the images with absence of depth value. In this paper we denote these pixels as Absent Depth Observations (ADO).
The scene model cannot be applied in a standard way because the sensor’s ADO need a special treatment where depth is just treated as a forth channel besides RGB. These ADO can introduce errors to our model as well as to any typical background model. A pixel can be ADO all over the sequence or switch in a random manner between ADO and a valid value.
There is no general purpose RGBD dataset which covers all the desirable types of sequences in order to properly evaluate a scene modelling algorithm. Each algorithm is evaluated using its own proposed data and different metrics. This fact, makes very difficult to perform an unified comparison between different methods. For this purpose, we propose a comprehensive dataset that covers all the challenges that occur when combining depth and color information.
The GSM dataset includes 7 different sequences, is designed to test each of the main problems in scene modelling when both color and depth information are used. Each sequence starts with 100 training frames, and have a foreground hand labelled ground truth.
To obtain access to GSM Dataset, please fill up linked form
Time of Day
This 1231 frame sequence is designed to evaluate smooth illumination changes in the scene. After training frames no moving object appears in the scene, subtle illumination changes occur during the sequence. The ground truth is composed by 23 frames that cover the most relevant part of the sequence where there are scene illumination changes.
The sequence has 428 frames, is used to evaluate the algorithm when color camouflage take place. After 100 training frames a person appears and places a folder on a shelf covering other folders of the same color. The ground truth is composed by 11 frames that cover the most relevant part of the sequence where there are moving objects and color camouflage happen.
This is a 465 frame sequence used to evaluate the presented algorithm when depth camouflage occur. After 100 training frames a person appears and places a folder on an empty place on a shelf provoking new depth values similar to the old ones. The ground truth is composed by 12 frames that cover the most relevant part of the sequence where there are moving objects and depth camouflage occur.
This 330 frame sequence created to evaluate the method’s performance when shadows appears. In this example a person appears in the scene and moves his hand near a wall provoking shadow apparition. The ground truth is composed by 11 frames that cover the most relevant part of the sequence that are the ones where the moving hand and the shadow is present.
The sequence is used to evaluate sudden illumination changes in the scene. No moving object appears. After the training frames a lamp is turned on. Light Switch sequence has 407 frames. The ground truth is composed by 9 frames that cover the most relevant part of the sequence where the illumination changes are produced.
This is a 300 frame sequence used to evaluate the method’s when there are moving objects in the training stage. In this example a person is moving during first 100 frames. This sequence is very challenging because in training stage we have the assumption that depth information is constant during all frames, so its possible to model wrong distributions that leads to misclassification.
This is a 200 frame sequence used to evaluate the method’s performance when an object initially in the background is moved. In this example a person appears in the scene and moves a chair during 100 frames. The ground truth is composed by 10 frames that cover the most relevant part of the sequence when chair starts moving. Unlike the other dataset we decided to create the ground truth images of this sequence because we think that qualitative comparison between algorithms are very useful in order to understand how they work and how we can improve them.
To obtain access to GSM Dataset, please fill up linked form
*due a typography error, results here differ from the published ones, those are the correct ones.
In order to do an exhaustive and standard performance evaluation, we computed the performance measures using the framework proposed on CVPR 2014 CDnet challenge, which implements the following seven different measures: recall, specificity, false positive ratio (FPR), false negative ratio (FNR), percentage of wrong classifications (PWC), f-measure and precision:
- TP : True Positive
- FP : False Positive
- FN : False Negative
- TN : True Negative
- Re (Recall) : TP / (TP + FN)
- Sp (Specificity) : TN / (TN + FP)
- FPR (False Positive Rate) : FP / (FP + TN)
- FNR (False Negative Rate) : FN / (TP + FN)
- PWC (Percentage of Wrong Classifications) : 100 * (FN + FP) / (TP + FN + FP + TN)
- F-Measure : (2 * Precision * Recall) / (Precision + Recall)
- Precision : TP / (TP + FP)
In order to enable direct comparisons with different algorithm we publish the GSM results. Note that there ara two different results GSMUF and GSMUB if undefined pixels are considered foreground (Uf) or background (UB):
Source code written in C++ using the OpenCV library is available.
Note that this code is developed for scientific purposes only and it is designed to test the viability of the GSM algorithm. Inside the project folder there is a README file with the execution instructions.
To obtain access to GSM source code, please fill up linked form