## Stochastic portfolio optimisation and out-of-sample testing using R

Author: Andreï Victorovitch Kostyrka, University of Luxembourg.

Last updated: 2018-05-31.

One might ask a question, “What is it like to try to pick an optimal portfolio and invest in stocks?” Well, here I have prepared some R code that does the following thing:

• Downloads stock data from Yahoo Finance from 2014 for a selection of stocks;
• Puts all stocks together across the maximum set of available dates and nicely imputes missing values using Kalman filter with structural modelling;
• Tries all possible combinations or portfolios (as many as possible, using Halton pseudo-random low-discrepancy sequences, doing many iterations and saving the best result from every iteration to save memory);
• Creates amazing colourful plots for subsequent analysis;
• Returns statistics for equally weighted portfolios, max-return under equally weighted risk, min-risk under equally weighted return, and maximum Sharpe ratio (for simplicity, the ratio of returns to risk, assuming there is no risk-free asset);
• Tests these 4 portfolios out of sample for 1 year (“if one started trading one year ago, what would one have obtained”).

Two sets of stocks were chosen: most lucrative U.S. companies and most lucrative Russian companies. This is what one gets if one chooses 13 top U.S. companies:

library(BatchGetSymbols) library(zoo) library(imputeTS) # For imputation library(randtoolbox) # For Sobel sampling library(parallel)   # http://fortune.com/2017/06/07/fortune-500-companies-profit-apple-berkshire-hathaway/ # https://www.swiftpassportservices.com/blog/profitable-companies-russia/   syms <- c("GOOG", "AAPL", "JPM", "BRK-A", "WFC", "MCD", "BAC", "MSFT", "AMZN", "JNJ", "C", "MO", "GE") symnames <- c("Google", "Apple", "J.P. Morgan", "Berkshire Hathaway", "Wells Fargo", "McDonald's", "Bank of America", "Microsoft", "Amazon", "Johnson and Johnson", "Citigroup", "Altria group", "General Electric") cat("Calculating optimal portfolio for US stocks\n")   # syms <- c("SGTZY", "ROSN.ME", "OGZPY", "LUKOY", "SBRCY", "TRNFP.ME", "NILSY", "ALRS.ME", "YNDX") # symnames <- c("SurgutNefteGas", "RosNeft", "GazProm", "LukOil", "SberBank", "TransNeft", # "Norilsk Nickel", "Alrosa", "Yandex") # cat("Calculating optimal portfolio for Russian stocks\n")   first.date <- as.Date("2014-01-01") last.date <- as.Date("2018-08-01") end.ins <- as.Date("2017-08-01")   dat <- BatchGetSymbols(tickers = syms, first.date = first.date, last.date = last.date)   allt <- dat$df.tickers allt <- allt[!is.na(allt$price.adjusted), ] dates <- sort(unique(allt$ref.date)) prices <- data.frame(dates) prices[, 2:(length(syms)+1)] <- NA colnames(prices)[2:(length(syms)+1)] <- syms for (i in syms) { datesi <- allt$ref.date[as.character(allt$ticker)==i] pricesi <- allt$price.adjusted[as.character(allt$ticker)==i] prices[match(datesi, dates), i] <- pricesi prices[, i] <- zoo(prices[, i], order.by = prices$dates) prices[, i] <- na.kalman(prices[, i]) } prices2 <- prices prices[,1] <- NULL   co <- rainbow(length(syms), end=0.75) png("1-relative-prices.png", 700, 400, type="cairo") plot(prices[,1]/mean(prices[,1]), col=co[1], lwd=1.5, main="Prices divided by their in-sample mean", xlab="Time", ylab="Ratio") for (i in 2:length(syms)) { lines(prices[,i]/mean(prices[,i]), col=co[i]) } legend("topleft", syms, col=co, lwd=1) dev.off()   rs <- data.frame(rep(NA, nrow(prices)-1)) for (i in syms) { rs[, i] <- diff(log(prices[, i])) } rs[,1] <- NULL   w0 <- rep(1/length(syms), length(syms)) names(w0) <- syms rmean <- zoo(as.numeric(as.matrix(rs) %*% w0), order.by = time(rs[,1]))   a <- table(as.numeric(format(time(rmean), "%Y"))) daysperyear <- mean(a[1:(length(a)-1)]) # Because the last year is incomplete   rsfull <- rs # Full sample rs <- rs[time(rs[,1]) <= end.ins, ] # Training sample rsnew <- rsfull[time(rsfull[,1]) > end.ins, ] # Test sample   meanrs <- colMeans(rs) covrs <- cov(rs)   musigma <- function(w) { mu <- sum(meanrs * w)*daysperyear s <- sqrt((t(w) %*% covrs %*% w) * daysperyear) return(c(mu=mu, sigma=s, ratio=mu/s)) }   m0 <- musigma(w0)   num.workers <- if (.Platform$OS.type=="windows") 1 else detectCores() MC <- 108000 iters <- 100 cc <- 150 # How many colours cols <- rainbow(cc, end = 0.75) sharpecuts <- seq(-0.1, 1.5, length.out = cc) png("2-portfolios.png", 700, 600, type="cairo") # First iteration cat("Calculating", MC, "portfolios for the first time.\n") ws <- halton(MC, length(syms), init=TRUE) ws <- ws/rowSums(ws) colnames(ws) <- syms res <- mclapply(1:MC, function(i) musigma(ws[i, ]), mc.cores = num.workers) res <- as.data.frame(matrix(unlist(res, use.names = FALSE), nrow=MC, byrow = TRUE)) colnames(res) <- c("mean", "sd", "ratio") prcmu0 <- ecdf(res$mean)(m0[1]) # Which quantile is the mean prcsd0 <- ecdf(res$sd)(m0[2]) # Which quantile is the sd qmplus <- quantile(res$mean, prcmu0+0.005) qmminus <- quantile(res$mean, prcmu0-0.005) qsplus <- quantile(res$sd, prcsd0+0.005) qsminus <- quantile(res$sd, prcsd0-0.005) meanvicinity <- which(res$mean < qmplus & res$mean > qmminus) sdvicinity <- which(res$sd < qsplus & res$sd > qsminus) plot(res[,2:1], pch=16, col=cols[as.numeric(cut(res$ratio, breaks = sharpecuts))], xlim=range(res$sd)+c(-0.03, 0.03), ylim=pmax(range(res$mean)+c(-0.03, 0.03), c(0, -10)), main="Risk and return for various portfolios", xlab="Annualised volatility", ylab="Annualised log-return")   sharpes <- mus <- sds <- array(NA, dim=c(iters+1, length(syms))) sharpes[1, ] <- ws[which.max(res$ratio), ] mus[1, ] <- ws[which(res$mean == max(res$mean[sdvicinity])), ] sds[1, ] <- ws[which(res$sd == min(res$sd[meanvicinity])), ] pairsm <- pairss <- array(NA, dim=c(iters+1, 3)) pairsm[1, ] <- res$mean[c(which.max(res$ratio), which(res$mean == max(res$mean[sdvicinity])), which(res$sd == min(res$sd[meanvicinity])))] pairss[1, ] <- res$sd[c(which.max(res$ratio), which(res$mean == max(res$mean[sdvicinity])), which(res$sd == min(res$sd[meanvicinity])))] for (i in 1:iters) { cat("Calculating", MC, "portfolios, iteration", i, "out of", iters, "\n") # ws <- sobol(MC, length(syms), scrambling = 1, init=FALSE) # There is a bug in randtoolbox: it starts returning numbers outside [0, 1] ws <- halton(MC, length(syms), init=FALSE) ws <- ws/rowSums(ws) colnames(ws) <- syms res <- mclapply(1:MC, function(i) musigma(ws[i, ]), mc.cores = num.workers) res <- as.data.frame(matrix(unlist(res, use.names = FALSE), nrow=MC, byrow = TRUE)) colnames(res) <- c("mean", "sd", "ratio") meanvicinity <- which(res$mean < qmplus & res$mean > qmminus) sdvicinity <- which(res$sd < qsplus & res$sd > qsminus) maxsh <- which.max(res$ratio) maxm <- which(res$mean == max(res$mean[sdvicinity])) minsd <- which(res$sd == min(res$sd[meanvicinity])) sharpes[i+1, ] <- ws[maxsh, ] mus[i+1, ] <- ws[maxm, ] sds[i+1, ] <- ws[minsd, ] pairsm[i+1, ] <- res$mean[c(maxsh, maxm, minsd)] pairss[i+1, ] <- res$sd[c(maxsh, maxm, minsd)] points(res[,2:1], pch=16, col=cols[as.numeric(cut(res\$ratio, breaks = sharpecuts))]) rm(res); gc() }   points(as.numeric(pairss), as.numeric(pairsm))   sharpems <- apply(sharpes, 1, musigma) wmaxsharpe <- sharpes[which.max(sharpems[3,]), ] meanms <- apply(mus, 1, musigma) wmaxmean <- mus[which.max(meanms[1,]), ] # Maximum returns while SD is the same sdms <- apply(sds, 1, musigma) wminsd <- sds[which.min(sdms[2,]), ] # Minimum SD while mean is the same   m1 <- musigma(wmaxsharpe) m2 <- musigma(wmaxmean) m3 <- musigma(wminsd) finalstats <- cbind(rbind(w0, wmaxsharpe, wmaxmean, wminsd), rbind(m0, m1, m2, m3)) rownames(finalstats) <- c("Equal", "MaxSharpe", "MaxReturn", "MinRisk") colnames(finalstats)[(length(syms)+1):(length(syms)+3)] <- c("RInS", "VInS", "SRInS")     points(m0[2], m0[1], cex=3, pch=16) points(c(m1[2], m2[2], m3[2]), c(m1[1], m2[1], m3[1]), cex=2, pch=16) dev.off()   # And now checking these revenues on the test sample r0 <- zoo(as.numeric(as.matrix(rsnew) %*% w0), order.by = time(rsnew[,1])) r1 <- zoo(as.numeric(as.matrix(rsnew) %*% wmaxsharpe), order.by = time(rsnew[,1])) r2 <- zoo(as.numeric(as.matrix(rsnew) %*% wmaxmean), order.by = time(rsnew[,1])) r3 <- zoo(as.numeric(as.matrix(rsnew) %*% wminsd), order.by = time(rsnew[,1]))   png("3-test-sample-returns.png", 700, 400, type="cairo") plot(r0, main="Out-of-sample returns of weighted portfolios", ylab="Return", xlab="Time") lines(r1, col=2) lines(r2, col=3) lines(r3, col=4) legend("topright", c("Equal", "Sharpe", "Max. return", "Min. volatility"), col=1:4, lwd=1) dev.off()   png("4-test-sample.png", 700, 400, type="cairo") plot(exp(cumsum(r0)), ylim=c(0.6, 1.4), main="Portfolio growth over the last year in the test sample", ylab="Relative growth", xlab="Time") abline(h=1) lines(exp(cumsum(r1)), col=2) lines(exp(cumsum(r2)), col=3) lines(exp(cumsum(r3)), col=4) legend("bottomleft", c("Equal", "Sharpe", "Max. return", "Min. volatility"), col=1:4, lwd=1) dev.off()   finals <- cbind(r0, r1, r2, r3) meanf <- colMeans(finals)*daysperyear sdf <- apply(finals, 2, sd)*sqrt(daysperyear) finalsoos <- cbind(meanf, sdf, meanf/sdf) colnames(finalsoos) <- c("ROOS", "VOOS", "SRROS") finalstats <- cbind(finalstats, finalsoos) colnames(finalstats)[1:length(syms)] <- symnames   library(stargazer) stargazer(t(finalstats), summary = FALSE, type="html", out="a.html")   write.csv(prices2, "stocks.csv", row.names = FALSE)

This is how the 13 U.S. stocks behaved.

This is how 10 million random portfolios behaved (black dots indicate 100 best portfolios under the three criteria; big black dot for the equally weighted portfolio).

These are portfolio weights for U.S. companies with their performance.

 Equal MaxSharpe MaxReturn MinRisk Google 0.077 0.011 0.004 0.052 Apple 0.077 0.132 0.128 0.105 J.P. Morgan 0.077 0.133 0.099 0.028 Berkshire Hathaway 0.077 0.019 0.040 0.170 Wells Fargo 0.077 0.003 0.008 0.025 McDonald's 0.077 0.164 0.148 0.192 Bank of America 0.077 0.030 0.034 0.036 Microsoft 0.077 0.078 0.180 0.022 Amazon 0.077 0.088 0.161 0.005 Johnson & Johnson 0.077 0.138 0.064 0.138 Citigroup 0.077 0.003 0.003 0.009 Altria group 0.077 0.198 0.131 0.142 General Electric 0.077 0.003 0.001 0.078 Return in-sample 0.145 0.179 0.190 0.145 Risk in-sample 0.142 0.128 0.141 0.119 Sharpe ratio in-sample 1.023 1.395 1.345 1.217 Return in 2017.07–2018.06 0.135 0.163 0.241 0.053 Risk in 2017.07–2018.06 0.144 0.134 0.146 0.130 Sharpe ratio in 2017.07–2018.06 0.941 1.218 1.657 0.404

This is how these four U.S. portfolios look out of sample.

If one had invested in these four portfolios a year before, this is what one would have harvested by now.

And this is the result for top 9 Russian companies.

This is how the 9 Russian stocks behaved.

This is how 10 million random portfolios behaved (black dots indicate 100 best portfolios under the three criteria; big black dot for the equally weighted portfolio).

Note that some portfolios can actually yield negative returns, and the volatility is generally higher!

These are portfolio weights for Russian companies with their performance.

 Equal MaxSharpe MaxReturn MinRisk SurgutNefteGas 0.111 0.005 0.009 0.175 RosNeft 0.111 0.162 0.031 0.269 GazProm 0.111 0.012 0.068 0.067 LukOil 0.111 0.018 0.083 0.037 SberBank 0.111 0.038 0.031 0.014 TransNeft 0.111 0.248 0.016 0.213 Norilsk Nickel 0.111 0.079 0.381 0.044 Alrosa 0.111 0.413 0.358 0.049 Yandex 0.111 0.026 0.022 0.131 Return in-sample 0.055 0.227 0.172 0.055 Risk in-sample 0.245 0.218 0.245 0.218 Sharpe ratio in-sample 0.223 1.042 0.701 0.250 Return in 2017.07–2018.06 0.202 0.164 0.223 0.164 Risk in 2017.07–2018.06 0.179 0.170 0.215 0.153 Sharpe ratio in 2017.07–2018.06 1.129 0.965 1.039 1.067

This is how these four Russian portfolios look out of sample.

If one had invested in these four portfolios a year before, this is what one would have harvested by now.

So actually if one used this investing strategy based on the information available in mid-2017, by now, depending on the criterion (Sharpe ratio or maximum profitability) they would have earned 14–22% on Russian stocks or 15–17% on American stocks!

TODO: constrained optimisation using the best stochastic solution as a starting point.

[2019-05-31] Update: for high-dimensional problems (hundreds of stocks), many more MC trials are required! One might also want to change the Halton sequence to a random uniform generator (like ws <- matrix(runif(length(syms)*MC), ncol = length(syms))) or a Latin hypercube sampler (lhs).

<...>

Why are you still here? Why are you not trading yet?

## Шарада

Прибавь мне шкаф, но с буквой «л» —
И я халат из кашемира;
Прибавь мне град — и мой удел —
Нести яйцо и мясо миру;
Без них — индейкой пролетел
Век Валтасара после пира.