Stochastic portfolio optimisation and out-of-sample testing using R

Last updated: 2019-05-31.

One might ask a question, “What is it like to try to pick an optimal portfolio and invest in stocks?” Well, here I have prepared some R code that does the following thing:

Downloads stock data from Yahoo Finance from 2014 for a selection of stocks;
Puts all stocks together across the maximum set of available dates and nicely imputes missing values using Kalman filter with structural modelling;
Tries all possible combinations or portfolios (as many as possible, using Halton pseudo-random low-discrepancy sequences, doing many iterations and saving the best result from every iteration to save memory);
Creates amazing colourful plots for subsequent analysis;
Returns statistics for equally weighted portfolios, max-return under equally weighted risk, min-risk under equally weighted return, and maximum Sharpe ratio (for simplicity, the ratio of returns to risk, assuming there is no risk-free asset);
Tests these 4 portfolios out of sample for 1 year (“if one started trading one year ago, what would one have obtained”).

Two sets of stocks were chosen: most lucrative U.S. companies and most lucrative Russian companies. This is what one gets if one chooses 13 top U.S. companies:

library(BatchGetSymbols)
library(zoo)
library(imputeTS) # For imputation
library(randtoolbox) # For Sobel sampling
library(parallel)

# http://fortune.com/2017/06/07/fortune-500-companies-profit-apple-berkshire-hathaway/
# https://www.swiftpassportservices.com/blog/profitable-companies-russia/

syms <- c("GOOG", "AAPL", "JPM", "BRK-A", "WFC", "MCD", "BAC", "MSFT", "AMZN", "JNJ", "C", "MO", "GE")
symnames <- c("Google", "Apple", "J.P. Morgan", "Berkshire Hathaway", "Wells Fargo", "McDonald's",
              "Bank of America", "Microsoft", "Amazon", "Johnson and Johnson", "Citigroup", "Altria group", "General Electric")
cat("Calculating optimal portfolio for US stocks\n")

# syms <- c("SGTZY", "ROSN.ME", "OGZPY", "LUKOY", "SBRCY", "TRNFP.ME", "NILSY", "ALRS.ME", "YNDX")
# symnames <- c("SurgutNefteGas", "RosNeft", "GazProm", "LukOil", "SberBank", "TransNeft",
#               "Norilsk Nickel", "Alrosa", "Yandex")
# cat("Calculating optimal portfolio for Russian stocks\n")

first.date <- as.Date("2014-01-01")
last.date <- as.Date("2018-08-01")
end.ins <- as.Date("2017-08-01")

dat <- BatchGetSymbols(tickers = syms, first.date = first.date, last.date = last.date)

allt <- dat$df.tickers
allt <- allt[!is.na(allt$price.adjusted), ]
dates <- sort(unique(allt$ref.date))
prices <- data.frame(dates)
prices[, 2:(length(syms)+1)] <- NA
colnames(prices)[2:(length(syms)+1)] <- syms
for (i in syms) {
  datesi <- allt$ref.date[as.character(allt$ticker)==i]
  pricesi <- allt$price.adjusted[as.character(allt$ticker)==i]
  prices[match(datesi, dates), i] <- pricesi
  prices[, i] <- zoo(prices[, i], order.by = prices$dates)
  prices[, i] <- na.kalman(prices[, i])
}
prices2 <- prices
prices[,1] <- NULL

co <- rainbow(length(syms), end=0.75)
png("1-relative-prices.png", 700, 400, type="cairo")
plot(prices[,1]/mean(prices[,1]), col=co[1], lwd=1.5, main="Prices divided by their in-sample mean", xlab="Time", ylab="Ratio")
for (i in 2:length(syms)) {
  lines(prices[,i]/mean(prices[,i]), col=co[i])
}
legend("topleft", syms, col=co, lwd=1)
dev.off()

rs <- data.frame(rep(NA, nrow(prices)-1))
for (i in syms) {
  rs[, i] <- diff(log(prices[, i]))
}
rs[,1] <- NULL

w0 <- rep(1/length(syms), length(syms))
names(w0) <- syms
rmean <- zoo(as.numeric(as.matrix(rs) %*% w0), order.by = time(rs[,1]))

a <- table(as.numeric(format(time(rmean), "%Y")))
daysperyear <- mean(a[1:(length(a)-1)]) # Because the last year is incomplete

rsfull <- rs # Full sample
rs <- rs[time(rs[,1]) <= end.ins, ] # Training sample
rsnew <- rsfull[time(rsfull[,1]) > end.ins, ] # Test sample

meanrs <- colMeans(rs)
covrs <- cov(rs)

musigma <- function(w) {
  mu <- sum(meanrs * w)*daysperyear
  s <- sqrt((t(w) %*% covrs %*% w) * daysperyear)
  return(c(mu=mu, sigma=s, ratio=mu/s))
}

m0 <- musigma(w0)

num.workers <- if (.Platform$OS.type=="windows") 1 else detectCores()
MC <- 108000
iters <- 100
cc <- 150 # How many colours
cols <- rainbow(cc, end = 0.75)
sharpecuts <- seq(-0.1, 1.5, length.out = cc)

png("2-portfolios.png", 700, 600, type="cairo")

# First iteration
cat("Calculating", MC, "portfolios for the first time.\n")
ws <- halton(MC, length(syms), init=TRUE)
ws <- ws/rowSums(ws)
colnames(ws) <- syms
res <- mclapply(1:MC, function(i) musigma(ws[i, ]), mc.cores = num.workers)
res <- as.data.frame(matrix(unlist(res, use.names = FALSE), nrow=MC, byrow = TRUE))
colnames(res) <- c("mean", "sd", "ratio")
prcmu0 <- ecdf(res$mean)(m0[1]) # Which quantile is the mean
prcsd0 <- ecdf(res$sd)(m0[2]) # Which quantile is the sd
qmplus <- quantile(res$mean, prcmu0+0.005)
qmminus <- quantile(res$mean, prcmu0-0.005)
qsplus <- quantile(res$sd, prcsd0+0.005)
qsminus <- quantile(res$sd, prcsd0-0.005)
meanvicinity <- which(res$mean < qmplus & res$mean > qmminus)
sdvicinity <- which(res$sd < qsplus & res$sd > qsminus)

plot(res[,2:1], pch=16, col=cols[as.numeric(cut(res$ratio, breaks = sharpecuts))],
     xlim=range(res$sd)+c(-0.03, 0.03), ylim=pmax(range(res$mean)+c(-0.03, 0.03), c(0, -10)),
     main="Risk and return for various portfolios", xlab="Annualised volatility", ylab="Annualised log-return")

sharpes <- mus <- sds <- array(NA, dim=c(iters+1, length(syms)))
sharpes[1, ] <- ws[which.max(res$ratio), ]
mus[1, ] <- ws[which(res$mean == max(res$mean[sdvicinity])), ]
sds[1, ] <- ws[which(res$sd == min(res$sd[meanvicinity])), ]
pairsm <- pairss <- array(NA, dim=c(iters+1, 3))
pairsm[1, ] <- res$mean[c(which.max(res$ratio), which(res$mean == max(res$mean[sdvicinity])), which(res$sd == min(res$sd[meanvicinity])))]
pairss[1, ] <- res$sd[c(which.max(res$ratio), which(res$mean == max(res$mean[sdvicinity])), which(res$sd == min(res$sd[meanvicinity])))]

for (i in 1:iters) {
  cat("Calculating", MC, "portfolios, iteration", i, "out of", iters, "\n")
  # ws <- sobol(MC, length(syms), scrambling = 1, init=FALSE) # There is a bug in randtoolbox: it starts returning numbers outside [0, 1]
  ws <- halton(MC, length(syms), init=FALSE)
  ws <- ws/rowSums(ws)
  colnames(ws) <- syms
  res <- mclapply(1:MC, function(i) musigma(ws[i, ]), mc.cores = num.workers)
  res <- as.data.frame(matrix(unlist(res, use.names = FALSE), nrow=MC, byrow = TRUE))
  colnames(res) <- c("mean", "sd", "ratio")
  meanvicinity <- which(res$mean < qmplus & res$mean > qmminus)
  sdvicinity <- which(res$sd < qsplus & res$sd > qsminus)
  maxsh <- which.max(res$ratio)
  maxm <- which(res$mean == max(res$mean[sdvicinity]))
  minsd <- which(res$sd == min(res$sd[meanvicinity]))
  sharpes[i+1, ] <- ws[maxsh, ]
  mus[i+1, ] <- ws[maxm, ]
  sds[i+1, ] <- ws[minsd, ]
  pairsm[i+1, ] <- res$mean[c(maxsh, maxm, minsd)]
  pairss[i+1, ] <- res$sd[c(maxsh, maxm, minsd)]
  points(res[,2:1], pch=16, col=cols[as.numeric(cut(res$ratio, breaks = sharpecuts))])
  rm(res); gc()
}

points(as.numeric(pairss), as.numeric(pairsm))

sharpems <- apply(sharpes, 1, musigma)
wmaxsharpe <- sharpes[which.max(sharpems[3,]), ]
meanms <- apply(mus, 1, musigma)
wmaxmean <- mus[which.max(meanms[1,]), ] # Maximum returns while SD is the same
sdms <- apply(sds, 1, musigma)
wminsd <- sds[which.min(sdms[2,]), ] # Minimum SD while mean is the same

m1 <- musigma(wmaxsharpe)
m2 <- musigma(wmaxmean)
m3 <- musigma(wminsd)
finalstats <- cbind(rbind(w0, wmaxsharpe, wmaxmean, wminsd), rbind(m0, m1, m2, m3))
rownames(finalstats) <- c("Equal", "MaxSharpe", "MaxReturn", "MinRisk")
colnames(finalstats)[(length(syms)+1):(length(syms)+3)] <- c("RInS", "VInS", "SRInS")

points(m0[2], m0[1], cex=3, pch=16)
points(c(m1[2], m2[2], m3[2]), c(m1[1], m2[1], m3[1]), cex=2, pch=16)
dev.off()

# And now checking these revenues on the test sample
r0 <- zoo(as.numeric(as.matrix(rsnew) %*% w0), order.by = time(rsnew[,1]))
r1 <- zoo(as.numeric(as.matrix(rsnew) %*% wmaxsharpe), order.by = time(rsnew[,1]))
r2 <- zoo(as.numeric(as.matrix(rsnew) %*% wmaxmean), order.by = time(rsnew[,1]))
r3 <- zoo(as.numeric(as.matrix(rsnew) %*% wminsd), order.by = time(rsnew[,1]))

png("3-test-sample-returns.png", 700, 400, type="cairo")
plot(r0, main="Out-of-sample returns of weighted portfolios", ylab="Return", xlab="Time")
lines(r1, col=2)
lines(r2, col=3)
lines(r3, col=4)
legend("topright", c("Equal", "Sharpe", "Max. return", "Min. volatility"), col=1:4, lwd=1)
dev.off()

png("4-test-sample.png", 700, 400, type="cairo")
plot(exp(cumsum(r0)), ylim=c(0.6, 1.4), main="Portfolio growth over the last year in the test sample", ylab="Relative growth", xlab="Time")
abline(h=1)
lines(exp(cumsum(r1)), col=2)
lines(exp(cumsum(r2)), col=3)
lines(exp(cumsum(r3)), col=4)
legend("bottomleft", c("Equal", "Sharpe", "Max. return", "Min. volatility"), col=1:4, lwd=1)
dev.off()

finals <- cbind(r0, r1, r2, r3)
meanf <- colMeans(finals)*daysperyear
sdf <- apply(finals, 2, sd)*sqrt(daysperyear)
finalsoos <- cbind(meanf, sdf, meanf/sdf)
colnames(finalsoos) <- c("ROOS", "VOOS", "SRROS")
finalstats <- cbind(finalstats, finalsoos)
colnames(finalstats)[1:length(syms)] <- symnames

library(stargazer)
stargazer(t(finalstats), summary = FALSE, type="html", out="a.html")

write.csv(prices2, "stocks.csv", row.names = FALSE)

This is how the 13 U.S. stocks behaved.

Relative prices of top U.S. stocks

This is how 10 million random portfolios behaved (black dots indicate 100 best portfolios under the three criteria; big black dot for the equally weighted portfolio).

Return and risk of various U.S. portfolios

These are portfolio weights for U.S. companies with their performance.


	Equal	MaxSharpe	MaxReturn	MinRisk

Google	0.077	0.011	0.004	0.052
Apple	0.077	0.132	0.128	0.105
J.P. Morgan	0.077	0.133	0.099	0.028
Berkshire Hathaway	0.077	0.019	0.040	0.170
Wells Fargo	0.077	0.003	0.008	0.025
McDonald's	0.077	0.164	0.148	0.192
Bank of America	0.077	0.030	0.034	0.036
Microsoft	0.077	0.078	0.180	0.022
Amazon	0.077	0.088	0.161	0.005
Johnson & Johnson	0.077	0.138	0.064	0.138
Citigroup	0.077	0.003	0.003	0.009
Altria group	0.077	0.198	0.131	0.142
General Electric	0.077	0.003	0.001	0.078
Return in-sample	0.145	0.179	0.190	0.145
Risk in-sample	0.142	0.128	0.141	0.119
Sharpe ratio in-sample	1.023	1.395	1.345	1.217
Return in 2017.07–2018.06	0.135	0.163	0.241	0.053
Risk in 2017.07–2018.06	0.144	0.134	0.146	0.130
Sharpe ratio in 2017.07–2018.06	0.941	1.218	1.657	0.404

This is how these four U.S. portfolios look out of sample.

Returns for different optimal U.S. portfolios out-of-sample

If one had invested in these four portfolios a year before, this is what one would have harvested by now.

Cumulative growth of one's investments into U.S. stocks

And this is the result for top 9 Russian companies.

This is how the 9 Russian stocks behaved.

Relative prices of top Russian stocks

This is how 10 million random portfolios behaved (black dots indicate 100 best portfolios under the three criteria; big black dot for the equally weighted portfolio).

Return and risk of various Russian portfolios

Note that some portfolios can actually yield negative returns, and the volatility is generally higher!

These are portfolio weights for Russian companies with their performance.


	Equal	MaxSharpe	MaxReturn	MinRisk

SurgutNefteGas	0.111	0.005	0.009	0.175
RosNeft	0.111	0.162	0.031	0.269
GazProm	0.111	0.012	0.068	0.067
LukOil	0.111	0.018	0.083	0.037
SberBank	0.111	0.038	0.031	0.014
TransNeft	0.111	0.248	0.016	0.213
Norilsk Nickel	0.111	0.079	0.381	0.044
Alrosa	0.111	0.413	0.358	0.049
Yandex	0.111	0.026	0.022	0.131
Return in-sample	0.055	0.227	0.172	0.055
Risk in-sample	0.245	0.218	0.245	0.218
Sharpe ratio in-sample	0.223	1.042	0.701	0.250
Return in 2017.07–2018.06	0.202	0.164	0.223	0.164
Risk in 2017.07–2018.06	0.179	0.170	0.215	0.153
Sharpe ratio in 2017.07–2018.06	1.129	0.965	1.039	1.067

This is how these four Russian portfolios look out of sample.

Returns for different optimal Russian portfolios out-of-sample

If one had invested in these four portfolios a year before, this is what one would have harvested by now.

Cumulative growth of one's investments into Russian stocks

So actually if one used this investing strategy based on the information available in mid-2017, by now, depending on the criterion (Sharpe ratio or maximum profitability) they would have earned 14–22% on Russian stocks or 15–17% on American stocks!

[2019-05-31] Update: for high-dimensional problems (hundreds of stocks), many more MC trials are required! One might also want to change the Halton sequence to a random uniform generator (like ws <- matrix(runif(length(syms)*MC), ncol = length(syms))) or a Latin hypercube sampler (lhs).

TODO: constrained optimisation using the best stochastic solution as a starting point.

Why are you still here? Why are you not trading yet?