I wrote two Matlab scripts to create the aligned database of faces and average face. The first, align.m, produced versions of the input images aligned in the same manner. The second, avgFace.m, took the aligned database of faces and computed the average face.
align.m takes in two parameters: min and max. My program assumes that all images that need to be aligned are .jpg's and in a subdirectory named 'facedb'. It also assumes that these images have consecutive non-negative integers as file names (i.e. 0.jpg, 1.jpg, 2.jpg, etc.). The parameter min represents the file name sans file extension of the first image you want to align. The parameter max represents the file name sans file extension of the last image you want to align. Therefore, the program will align images between and including min.jpg and max.jpg.
The program cycles through each image to align and displays it. The user is prompted to click the center of the subject's eyes. The user must click the left-hand side eye of the subject first and confirm. Then the user must procede to click the right-hand side eye of the subject and confirm. The selected coordinates of the two eyes are saved. The image is warped so the centers of the eyes are 100 pixels apart. Next, the program crops the warped image so that the output image is 300 pixels tall by 200 pixels wide. The cropping is done so that the eyes are midway in the height of the image. I chose to do this because the eyes in a normal human face are roughly at the midpoint of the face heightwise. I chose to crop the pictures to a size of 300x200 since doing so provided a good field of view of the face while including many facial details and eliminating most of the negligable background. Once the image is aligned and cropped, the aligned faces are stored in a subfolder of the main working directory called 'aligned'. Below are some examples of aligned and cropped images from the database of faces.
avgFace.m takes in two parameters: min and max. My program assumes that all aligned images that need to be averaged are 400x400 .jpg images and are in a subdirectory named 'aligned'. It also assumes that these images have consecutive non-negative integers as file names (i.e. 0.jpg, 1.jpg, 2.jpg, etc.). The parameter min represents the file name sans file extension of the first image you want to average. The parameter max represents the file name sans file extension of the last image you want to average. Therefore, the program will average images between and including min.jpg and max.jpg.
First, I created a 3 dimensial matrix of zeros the same size as the cropped aligned images with a depth of 3. The depth of 3 is because the program deals with RGB images. This depth allows for the average intensities of red, green, and blue for data set to be calculated at each pixel. This will produce an average face as an RGB image. To find the average color intensities at each pixel, I looped through each image being averaged and multiplied it by a scaling factor. The scaling factor is 1/n where n is the number of images being averaged. I proceeded to add these scaled down intensities to the running sum of intensities at each pixel. After doing this for all images to be averaged, the image with average color intensities is produced and saved in the 'aligned' subdirectory.
My face detection program was written as a Matlab script, detect.m. detect.m takes 3 parameters: file, tau, and scale.
file: the file name of the target image to detect faces in. The program assumes that the target image is located in the subdirectory named "target."
tau: the minimum threshold of the standardized SSD used to detect pixels that could be the center of faces.
scale: a double representing how much you want to scale the template by when trying to detect faces in the target image.
First, the program reads in the template image in. In my face detector, this is the average face in the 'aligned' subdirectory. This file name is hardcoded in for simplicity's sake. The program then loads the target image in the "target" subdirectory. The program then converts that image into a double matrix and normalizes it by dividing the values by 255. Since the image is represented by a 3 dimensional matrix with values ranging from 0 to 255 for each color intensity, normalizing it makes all the values between 0 and 1. This makes comparing multiple intensities easier and will decrease the range of possible SSD values. The program then loads the individual R, G, and B intensity components of the target image into three separate matrices. The same normalization and division process is applied to the template image. For each color intensity matrix for the target image, I found the SSD between that and its corresponding color intensity matrix of the template matrix.
To calculate the SSD between associated color intensity matrices of the target and template images, I wrote a separate Matlab function called getSSD. getSSD takes two parameters, S and T. S is the target matrix and T is the template matrix. Let T have dimensions smaller than or equal to S. Because of the properties of convolution, To dot multiply the correct elements in the transformed padded matrices to get part of the SSD, I had to rotate the template matrix by 180 degrees. In Question 2, I determined the relationship between SSD and convolution. Therefore, I had to perform a convolution between S and -2*T, perform a convolution between S.*S and a mask of ones of size T, and find the sum of all the elements of T squared. Summing these three components would give the SSD at each pixel as if the template were applied with said pixel at the center of the template.
To speed up the program, I decided to use fast Fourier transforms to compute the convolutions. In order to do that, the matrices needed to be padded with zeros so their dimensions are powers of two. Because the template will be applied around the edges of the target image, I had to make sure that the target was padded with zeros of the same dimensions of the template image in each given direction. The program then found the next highest power of two that could account for this safety buffer and padded each side so the final dimensions of the matrix were powers of two. To do this, I had to divide the total amount to pad by two and pad the matrix with this many zeros in the given directions. After the program pads the target with zeroes, the program pads the template so its dimensions match the dimensions of the padded template.
Before Fourier transforming the padded S matrix, I had to apply an fftshift on it to produce more accurate results. After Fourier transforming the matrices, I element-wise multiplied the necessary transformed matrices to perform convolution in the Fourier domain. I inverse fft'd the results and added them together along with the sum of the elements of T squared to create a matrix of SSD values. This produced a matrix of SSD values where the element in the matrix represented the SSD as if the template was applied to the target image with said element being the center of the template. However, the SSD matrix is padded with zeros in the same manner S was. SSD needed to be depadded and returned.
For each color intensity matrix for the target image, I found the SSD between that and its corresponding color intensity matrix of the template matrix. To create more readible results, I normalized the values of the intensity SSDs so they would be between 0 and 1. I then averaged the three intensity SSD matrices to get the average SSD matrix. These numbers would thus be between 0 and 1. I decided to output this matrix as a gray scale image. In the image, the more black the pixels are, the lower the standardized average SSD at that pixel is. The opposite is true the more white the pixel is. Therefore, regions of blackness correspond to the centers of possible faces.
I looped through all elements in the SSD matrix and compared each element to the tau threshold. If its value was below the threshold, I flagged it as being a possible center of a face. To do this, I created a matrix of all zeros with the same dimensions as the SSD matrix. I changed a pixel's corresponding value in this flag matrix to one if its SSD value was less than tau. I decided to use this flag matrix even though it took up extra memory because it is significantly faster than to store all possible "face" points in an always changing size vector. Once I got all "face points" I performed non-maximal suppression to suppress all non-maximal points in a given neighborhood with dimensions of the template. Once non-maximal "face points" are suppressed, the program draws a box with dimensions of the template around the "face points" given that a "face point" is the detected center of a face. I made various templates and target images using an image editor make sure my detection algorithm was inherently correct.
To do a sanity check on my detection program, I created various templates and target images. I hardcoded in the filenames of the templates. The first template I created was an off-centered black square. For the target image, I copied said square onto a large white background and translated it. As the pictures indicated, the pixel where the lowest SSD was at the center at the moved template on the target image. The same results were seen with a similar template of an off centered orange square. The target image had two orange squares and a blue square. The program was able to detect the orange squares and did not detect the blue square. This proved that the color of the shapes being detected mattered. In addition, this proved that multiple objects could be detected in a target image if they are in separate "neighborhoods." My final test template was an orange triangle. My target image was various colored shapes including the orange triangle on a white background. The program was able to detect the orange triangle in the target image. This showed that the program could detect objects with various other objects in the background. Once I achieved these results, I went on try to detect faces in the unaligned, uncropped photos in the face database.
With the appropriate parameter values, the program was able to detect the subject's face in both a picture in which he was smiling and one in which he was not. The center of the detected faces is approximately between the eyes of the subject, indicating a good match with the template image. The possible "face points" with minimal SSD's are located near the center of the face as seen above. These results along with the previous ones indicate that the face detector detects faces that were used in computing the average face extraordinally well. Next, I decided to test my faces not used in compiling the average face.
My face detector did somewhat well when applied to images of people not in our class and used in constructing the average face. Using the correct scales and thresholds, I was able to detect all four members of the Beatles in 'beatles.jpg' and Charlie Sheen's face in 'sheen.jpg.' The detected face was centered almost between each person's eyes. The minimal SSD/"face points" image indicated the same result. However, the program did not do well at detecting faces in the last two input images. In 'sheen-with-girl.jpg', with a relatively low threshold, the program was able to identify Charlie Sheen's face but not that of his lady friend. In addition, the program returned many false positives, especially in Charlie's suit. Raising the threshold allowed for the girl's face to be detected.
As you can see, the face detector detected many false positives in its search. These false positives seemed to occur in three general areas: lighting hitting the padded wall, the ceiling tiles, and the wood of the desks. This is most likely because these areas are somewhat skin colored. When looking for a scaled down template in a target image, some of details of the template are lost. This loss of detail in the template would result in a more blurred average face with less recognizable facial features. Therefore, when looking for a face this way, the more 'skin colored' and blurred objects will be wrongly identified as a face. There are some false negatives returned as well. It was able to detect many of the students' faces when they are facing forward. This is because we only use one average face template which is directly looking at the camera. This algorithm makes it nearly impossible to detect faces that are not straight on.
The first five eigenfaces as calculated in the program are as follows:
The first eigenface seems to be bright around the skin of the face. This perhaps differentiates skin intensity between pixels. The second eigenface seems to be bright around potential area were glasses are and for hair. This perhaps differentiates glasses or hair type. The third eigenface seems to be bright on the forehead. This perhaps corresponds to whether or not the person has bangs or a bare forehead. The fourth eigenface had bright patches near the lips and by facial lines. This could perhaps differentiate the type of lips people have and any notable facial lines. Finally, the fifth eigenface has bright spots around the eyesockets. This could differentiate the depth of a person's eyes.
Looking at the graph for the singular values, the eigenvalues drop off fairly quickly in an exponential decay pattern. However, these values start to even out around the 30th eigenvalue. This indicates that number of principal components that you would use to form a compact and accurate representation of your face database is about 30. Instead, choose an arbitrarily larger number of principal components. For this case, let that number be 30 since it is much larger than 2. Let v = a_1*e_1 + ... + a_k*e_k where k = 30, a_i is the i-th coefficient times the i-th eigenface. Let v' = a_1*e_1 + ... + a_k*e_k + a_k+1*e_k+1 + ... + a_69*e_69, where v' represents the fully reconstructed image. The difference, or error, between the two can be calculated by doing ||v-v'||^2. Without compression, the amount of data stored is approximately 69*m*n where you have 69 images of dimensions m*n. If m=300 and n=200, you would have to store 4,140,000 pixels. With compression, you have to store 69 30-dimension vectors. In addition, you have 30*m*n pixels for eigenfaces to store. Assume that m=300 and n=200. This would lead to only 69*30+30*300*20 or 1,802,020 pixels. The compression ratio would be 1802020/4140000, or approximately 0.4353.