library(rsvddpd)
library(microbenchmark)
library(matrixStats)
library(pcaMethods)
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
#>
#> Attaching package: 'pcaMethods'
#> The following object is masked from 'package:stats':
#>
#> loadings
Singular Value Decomposition (SVD) is a very popular technique which is abundantly used in different applications from Bioinformatics, Image and Signal processing, Textual Analysis, Dimensional Reduction techniques etc.
However, it is often the case that the data matrix, on which SVD is generally applied on, contains outliers which are not in accord with the data generating mechanism. In such a case, usual SVD performs poorly in a sense that the singular values and the left and right singular vectors are found to be very different from the ones that would have been obtained if the data matrix was free of outliers. Hence, the dire need of a robust version of SVD is extremely prevalent, since hardly any data in practice becomes free of any type of outliers.
For illustration, consider the simple 4 × 3 matrix, where the elements go from 1 to 12.
X <- matrix(1:12, nrow = 4, ncol = 3, byrow = TRUE)
X
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6
#> [3,] 7 8 9
#> [4,] 10 11 12
and the singular value decomposition turns out the singular values as approximately 25, 1.3 and 0.
svd(X)
#> $d
#> [1] 2.546241e+01 1.290662e+00 2.406946e-15
#>
#> $u
#> [,1] [,2] [,3]
#> [1,] -0.1408767 -0.82471435 0.5399964
#> [2,] -0.3439463 -0.42626394 -0.6516661
#> [3,] -0.5470159 -0.02781353 -0.3166568
#> [4,] -0.7500855 0.37063688 0.4283266
#>
#> $v
#> [,1] [,2] [,3]
#> [1,] -0.5045331 0.76077568 -0.4082483
#> [2,] -0.5745157 0.05714052 0.8164966
#> [3,] -0.6444983 -0.64649464 -0.4082483
Now, note what happens when we contaminate a single entry of the matrix by a large outlying value.
X[2, 2] <- 100
svd(X)
#> $d
#> [1] 101.431313 18.313121 1.148165
#>
#> $u
#> [,1] [,2] [,3]
#> [1,] -0.02260136 0.1500488 0.9516017926
#> [2,] -0.98805726 -0.1540849 0.0008289283
#> [3,] -0.08969187 0.5758535 0.1322532569
#> [4,] -0.12323712 0.7887559 -0.2774210109
#>
#> $v
#> [,1] [,2] [,3]
#> [1,] -0.05752705 0.62535728 -0.778215212
#> [2,] -0.99499917 -0.09966888 -0.006539692
#> [3,] -0.08165348 0.77394728 0.627963626
All the singular values are now much different, being 101.4, 18.3 and 1.14. However, in practical cases, where X actually represent a data matrix, this can pose a serious problem.
On the other hand, using rSVDdpd
function from
rsvddpd
package enables us a mitigate the effect of this
outlier.
rSVDdpd(X, alpha = 0.3)
#> $d
#> [1] 25.47298781 1.29012984 0.02812823
#>
#> $u
#> [,1] [,2] [,3]
#> [1,] 0.1408239 -0.82504037 0.4775443
#> [2,] 0.3450091 -0.42538658 -0.8366690
#> [3,] 0.5467884 -0.02772982 0.2395729
#> [4,] 0.7497732 0.37092546 0.1205872
#>
#> $v
#> [,1] [,2] [,3]
#> [1,] 0.5043111 0.7632642 0.4038542
#> [2,] 0.5749973 0.0521016 -0.8164947
#> [3,] 0.6442426 -0.6439823 0.4125994
Since the function does some randomized initialization under the hood, the result might not be exactly same when you run the code again. However, you should get the singular values pretty close to the singular values of the original X before we added the outlier.
Let us take a look at what rSVDdpd
does under the hood.
Before that, singular value decomposition (SVD) of a matrix X is splitting it as;
Xn × p = Un × rDr × rVp × rT Here, r is the rank of the matrix X, D is a diagonal matrix with non-negative real entries, and U, V are orthogonal matrices. Since, we usually observe data matrix X with errors, the model ends up being X = UDVT + ϵ, where ϵ is the errors.
For simplicity, we consider r = 1, i.e. X ≈ λabT, where a, b are vectors of appropriate dimensions. The usual SVD can be viewed as solving the problem ∑i, j(Xij − λaibj)2, with respect to the choices of ai, bj’s and λ. This L2 norm is essentially susceptible to outliers, hence people have generally tried to use L1 norm instead and tried to minimize that.
Here, we use Density Power Divergence (which is popularly used in robust estimation techniques bridging robustness and efficiency) to quantify the norm of the error. In particular, we try to minimize the function,
$$ H = \int \phi\left( \dfrac{x - \lambda a_ib_j}{\sigma} \right)^{(1 + \alpha)}dx - \dfrac{1}{np} \sum_{i=1}^{n} \sum_{j = 1}^{p} \phi\left( \dfrac{X_{ij} - \lambda a_ib_j}{\sigma} \right)^{\alpha} $$ with respect to the unknowns λ, ai, bj and σ2, where ϕ(⋅) is the standard normal density function. However, since the above problem is actually non-convex, but is convex when one of ai’s or bj’s are held fixed, we iterate between situations fixing ai’s and bj’s and finding minimum of the other quantities respectively.
Because of the usage of standard normal density, and exponential
functions, the usual algorithm suffers from underflow and overflow and
the estimates tend to become NAN
or Inf
in
some iterations for reasonably large or reasonably small values in the
data matrix. To deal with this, rSVDdpd
function first
scales all elements of the data matrix to a suitable range, and then
perform the robust SVD algorithm. Finally, the scaling factor can be
adjusted to obtain the original singular values.
rSVDdpd(X * 1e6, alpha = 0.3)
#> $d
#> [1] 25472975.68 1290131.19 28068.98
#>
#> $u
#> [,1] [,2] [,3]
#> [1,] 0.1408240 -0.8250396 0.4775455
#> [2,] 0.3450054 -0.4253890 -0.8366693
#> [3,] 0.5467892 -0.0277303 0.2395711
#> [4,] 0.7497743 0.3709244 0.1205838
#>
#> $v
#> [,1] [,2] [,3]
#> [1,] 0.5043114 0.76325750 0.4038663
#> [2,] 0.5749967 0.05211497 -0.8164942
#> [3,] 0.6442428 -0.64398914 0.4125884
rSVDdpd(X * 1e-6, alpha = 0.3)
#> $d
#> [1] 2.547297e-05 1.290131e-06 2.805950e-08
#>
#> $u
#> [,1] [,2] [,3]
#> [1,] 0.1408240 0.8250395 0.4775457
#> [2,] 0.3450048 0.4253893 -0.8366693
#> [3,] 0.5467893 0.0277304 0.2395708
#> [4,] 0.7497745 -0.3709242 0.1205833
#>
#> $v
#> [,1] [,2] [,3]
#> [1,] 0.5043114 -0.76325651 0.4038682
#> [2,] 0.5749967 -0.05211695 -0.8164941
#> [3,] 0.6442428 0.64399016 0.4125868
As it can be seen, the function rSVDdpd
handles the very
large or very small elements nicely.
Y <- X[, c(3, 1, 2)]
rSVDdpd(Y, alpha = 0.3)
#> $d
#> [1] 25.47298707 1.29012992 0.02812464
#>
#> $u
#> [,1] [,2] [,3]
#> [1,] 0.1408239 -0.82504033 0.4775444
#> [2,] 0.3450089 -0.42538672 -0.8366690
#> [3,] 0.5467884 -0.02772984 0.2395728
#> [4,] 0.7497732 0.37092540 0.1205870
#>
#> $v
#> [,1] [,2] [,3]
#> [1,] 0.6442426 -0.64398276 0.4125987
#> [2,] 0.5043111 0.76326375 0.4038549
#> [3,] 0.5749973 0.05210241 -0.8164946
As expected, the singular values do not change when the columns of the data matrix is permuted, however, the singular vector permutes in the same manner of the permutation of the columns.
An important property of SVD is that the matrix corresponding to the left and right singular vectors are orthogonal matrices. A sanity check of this property can also be verified very easily.
crossprod(rSVDdpd(X, alpha = 0.3)$u)
#> [,1] [,2] [,3]
#> [1,] 1.000000e+00 2.759042e-17 -1.810093e-17
#> [2,] 2.759042e-17 1.000000e+00 2.007640e-17
#> [3,] -1.810093e-17 2.007640e-17 1.000000e+00
As it seems, the off diagonal entries are very small values. This is ensured by introducing a Gram Schimdt Orthogonalization step between successive iterations of the algorithm.
In presence of outliers with large deviation, the performance of
rSVDdpd
is fairly robust to the choice of α, the robustness parameter. With
α = 0, rSVDdpd
corresponds to usual svd
function from base
package. However, with increasing α, the robustness increases,
i.e. even a smaller deviation would not affect the singular values,
while with higher α, the
variance of the estimators generally increase.
To demonstrate the effect of α on time complexity,
microbenchmark
package will be used.
microbenchmark::microbenchmark(svd(X), rSVDdpd(X, alpha = 0), rSVDdpd(X, alpha = 0.25),
rSVDdpd(X, alpha = 0.5), rSVDdpd(X, alpha = 0.75),
rSVDdpd(X, alpha = 1), times = 30)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> svd(X) 22.412 26.099 27.94690 26.9250 28.654 56.706 30
#> rSVDdpd(X, alpha = 0) 65.663 66.554 67.99497 67.9165 68.478 75.711 30
#> rSVDdpd(X, alpha = 0.25) 67.305 70.862 77.11397 72.8860 74.640 175.177 30
#> rSVDdpd(X, alpha = 0.5) 70.081 72.135 74.43430 72.7860 75.140 93.455 30
#> rSVDdpd(X, alpha = 0.75) 71.523 73.036 75.90333 74.5795 77.725 83.967 30
#> rSVDdpd(X, alpha = 1) 69.701 72.986 76.05630 75.0350 77.395 91.420 30
Therefore, the execution time slightly increases with higher α.
To compare performances of usual SVD algorithm with that of
rSVDdpd
, one can use simSVD
function, which is
used to simulate data matrices based on a model and then obtain an
estimate of Bias and MSE of the estimates using a Monte Carlo
approach.
First, we create the true data matrix, with singular vectors taken from coefficients of orthogonal polynomials.
U <- as.matrix(stats::contr.poly(10)[, 1:3])
V <- as.matrix(stats::contr.poly(4)[, 1:3])
trueSVD <- list(d = c(10, 5, 3), u = U, v = V) # true svd of the data matrix
We can now call simSVD
function to see the performance
of usual SVD algorithm under contamination from outlier.
res <- simSVD(trueSVD, svdfun = svd, B = 100, seed = 2021, outlier = TRUE, out_value = 25, tau = 0.9)
res
#> $Bias
#> [1] 28.20376 21.39073 11.20089
#>
#> $MSE
#> [1] 845.4570 527.2264 208.7107
#>
#> $Variance
#> [1] 50.00523 69.66283 83.25073
#>
#> $Left
#> [1] 0.6713628 0.7145263 0.7491816
#>
#> $Right
#> [1] 0.5724405 0.5660693 0.6095304
Following is the performance of robustSvd
function from
pcaMethods
package.
res <- simSVD(trueSVD, svdfun = pcaMethods::robustSvd, B = 100, seed = 2021, outlier = TRUE, out_value = 25, tau = 0.9)
res
#> $Bias
#> [1] 16.85259 13.72513 16.79947
#>
#> $MSE
#> [1] 508.2096 333.2573 402.1919
#>
#> $Variance
#> [1] 224.1996 144.8780 119.9697
#>
#> $Left
#> [1] 0.4666559 0.6445184 0.7104788
#>
#> $Right
#> [1] 0.4084638 0.5299844 0.5151042
Now we compare rSVDdpd
function’s performance with the
other SVD implementations.
res <- simSVD(trueSVD, svdfun = rSVDdpd, B = 100, seed = 2021, outlier = TRUE, out_value = 25, tau = 0.9, alpha = 0.25)
res
#> $Bias
#> [1] 3.47466190 0.05686266 -0.22769722
#>
#> $MSE
#> [1] 104.9849391 0.6843969 0.2378147
#>
#> $Variance
#> [1] 92.9116638 0.6811636 0.1859687
#>
#> $Left
#> [1] 0.08385757 0.09691123 0.10080858
#>
#> $Right
#> [1] 0.05255743 0.07546599 0.08511308
And with α = 0.75, we have;
res <- simSVD(trueSVD, svdfun = rSVDdpd, B = 100, seed = 2021, outlier = TRUE, out_value = 25, tau = 0.9, alpha = 0.75)
res
#> $Bias
#> [1] 0.1903417 -0.1843705 -0.2716241
#>
#> $MSE
#> [1] 0.4822773 0.1357128 0.1671783
#>
#> $Variance
#> [1] 0.44604734 0.10172035 0.09339867
#>
#> $Left
#> [1] 0.01421574 0.03210005 0.04942867
#>
#> $Right
#> [1] 0.006691835 0.020217527 0.029774606
As it can be seen, the bias and MSE are much lesser in
rSVDdpd
algorithm.