# Image Compression with Principal Component Analysis

The following images are taken from DesignCrowd.

This first image is an image with tourists ( I call it as tour) while the second one has no tourist – no_tour.

This is another new image that is different from the above two.

The goals are:

• Use PCA to compress the tour image
• Test the generalizability of the PC loadings obtained from tour image to other similar image (i.e., the no_tour image) and a very different image (i.e. new image).

First, let’s load the pictures and look at the dimensions.

library(jpeg)
tour = readJPEG(paste(path,"paris_tourist.jpg", sep = "/"))
no_tour = readJPEG(paste(path,"paris_no_tourist.jpg", sep = "/"))
new = readJPEG(paste(path,"rome.jpg", sep = "/"))
dim(tour); dim(no_tour); dim(new)
## [1] 459 715   3
## [1] 458 715   3
## [1] 500 750   3

The three pistures do not have the same dimensions so let’s resize them to 458, 715, 3. The 3 represents R, G, and B, the three color schemes from the images.

library("EBImage")
# scale to a specific width and height
tour <- resize(tour, w = 458, h = 715)
new <- resize(new, w = 458, h = 715)

# extract the pixel array
tour <- as.array(tour)
new <- as.array(new)

# The tour Image.

## PCA

Break down each color scheme into three dataframe and perform PCA.

rtour = tour[, ,1]
gtour = tour[, ,2]
btour = tour[, ,3]
prtour = prcomp(rtour, center = FALSE)
pgtour = prcomp(gtour, center = FALSE)
pbtour = prcomp(btour, center = FALSE)

tour.pca = list(prtour, pgtour, pbtour)

## Scree Plot and Cumulative Variation Plot

Take a look at the Scree plot of the PCA.

library(dplyr)
# Create a dataframe for easier plotting
df = data.frame(scheme = rep(c("R", "G", "B"),each =458),
index = rep(1:458,  3),
var = c(prtour$sdev^2, pgtour$sdev^2,
pbtour$sdev^2)) df %<>% group_by(scheme) %>% mutate(propvar =100*var/sum(var)) %>% mutate(cumsum = cumsum(propvar)) %>% ungroup() #plot library(ggplot2) #relevel to make it look nicer df$scheme = factor(df$scheme,levels(df$scheme)[c(3,2,1)])

df %>% ggplot(aes( x = index, y = propvar, fill = scheme)) +
geom_bar(stat="identity") +
labs(title="Screeplot of Principal Component", x ="Principal Component",
y="% of Variance") + geom_line()  +
scale_x_continuous(limits = c(0,30)) +
facet_wrap(~scheme)
df %>% ggplot(aes( x = index, y = cumsum, fill = scheme)) +
geom_bar(stat="identity") +
labs(title="Cumulative Proportion of Variance Explained Principal Component", x="Principal Component",
y="Cumulative % of Variance") + geom_line() +
scale_x_continuous(limits = c(0,30)) +
facet_wrap(~scheme)

It seems like the first 30 PCs have enough proportion of variance covered because the proportion of variance explained is close to 100%.

## Image Reconstruction

Let’s reconstruct the first image. I choose to look at the images constructed from 2, 30, 200, and 300 Prinicipal Components.

# This is the number of desired PCs
pcnum = c(2,30,200, 300)

for(i in pcnum){
pca.img <- sapply(tour.pca, function(j) {
compressed.img <- j$x[,1:i] %*% t(j$rotation[,1:i])
}, simplify = 'array')
writeJPEG(pca.img, paste(path,"/tour/tour_compressed_", round(i,0), '_components.jpg', sep = ''))
}

It appears that using 200 PCs is enough to compress the image while maintaining a good quality.

## Image Size Compression

But how about the sizes of these images?

original <- file.info(paste(path,"paris_tourist.jpg", sep ="/"))$size / 1000 for(i in pcnum){ filename = paste("tour_compressed_",i,'_components.jpg', sep = '') full.path = paste(path,"tour",filename,sep = "/") size = file.info(full.path)$size/1000
cat(filename, 'size:',size,"\n",
"Reduction from the original image: ",(original-size)/original*100,"% \n\n" )
}
## tour_compressed_2_components.jpg size: 30.258
##  Reduction from the original image:  53.07237 %
##
## tour_compressed_30_components.jpg size: 55.743
##  Reduction from the original image:  13.54726 %
##
## tour_compressed_200_components.jpg size: 43.244
##  Reduction from the original image:  32.93216 %
##
## tour_compressed_300_components.jpg size: 39.435
##  Reduction from the original image:  38.8396 %

Surprisingly, the compressed image that uses 30 PCs has the lowest reduction of image size.

# The no_tour Image

Now, let’s use a similar image (no_tour) and use the Principal Loadings obtained from the previous step to execute image compression.

The compress () function is meant for compressing a new image (i.e., image not used in the previous step to obtain Principal Loadings).

rno_tour = no_tour[, ,1]
gno_tour = no_tour[, ,2]
bno_tour = no_tour[, ,3]

no_tour_rgb = list(rno_tour,gno_tour,bno_tour)

compress <- function(trained_rgb_pca, newrgb, pcnum, dims){
r_rotate = trained_rgb_pca[[1]]$rotation g_rotate = trained_rgb_pca[[2]]$rotation
b_rotate = trained_rgb_pca[[3]]\$rotation

r = newrgb[[1]]
g = newrgb[[2]]
b = newrgb[[3]]

pred_r = (r %*% r_rotate)[,1:pcnum] %*% t(r_rotate[,1:pcnum])
pred_g = (g %*% g_rotate)[,1:pcnum] %*% t(g_rotate[,1:pcnum])
pred_b = (b %*% b_rotate)[,1:pcnum] %*% t(b_rotate[,1:pcnum])

pred.pca = list(pred_r, pred_g, pred_b)
pred.array = array(as.numeric(unlist(pred.pca)),dim = dims)
return(pred.array)
}

for(i in pcnum){
pca.img = compress(tour.pca, no_tour_rgb, pcnum = i, dims = c( 458, 715,3))
writeJPEG(pca.img, paste(path,"/no_tour","/no_tour_compressed_", round(i,0), '_components.jpg', sep = ''))
}

One can observe that the last image’s quality is the best out of four (since it uses the most PCs) but it is not even close to a high quality image.

# The new Image.

We now know that the performance of PCA in image compression is highly useful to its original image but not for a similar image. It’s natural to ask whether it will be worst if a totally unrelated image (such as new image) was used instead. Let’s check it out.

rnew = new[, ,1]
gnew = new[, ,2]
bnew = new[, ,3]

new_rgb = list(rnew,gnew,bnew)

for (i in pcnum){
pca.img = compress(tour.pca, new_rgb, pcnum = i, dims = c(458,715, 3))
writeJPEG(pca.img, paste(path,"/new","/new_compressed_", round(i,0), '_components.jpg', sep = ''))
}

While the image output does not look exactly like the original image, the main features of the image is clearly shown on it. This is pretty impressive because the machine learns the rules of extracting important information based on only one image tour. One can experiment increase the number of PCs to get a higher resolution image too.

# Summary

Principal Component Analysis is a great tool for dimension reduction (thus, sinze reduction in this case), extracting important insights, and uncover underlying message of a dataset.