About Andreï Kostyrka

Науколюб, грамматический нацист, антитеист. Пишу стихотворения, сочиняю музыку, верстаю книги, занимаюсь эконометрикой и настраиваю фортепиано.

A quick and inaccurate answer to the question: how many people will contract COVID-19 in Russia?

Disclaimer: The current post expresses the personal opinion of the author, which may or may not coincide with the opinion of the WHO experts, governmental workers, health care professionals, or other involved institutions. Please check the official web site of the World Health Organisation for the most recent updates on the situation and professional advice.

In this post, we show how to fit the often-mentioned ‘logistic curve’ to COVID-19 cases across various countries using a couple of econometric techniques, and how to compute scenario forecasts.

There is one question everyone is asking in Russia: ‘When will our country hit the peak?’ This is a tricky question since the word ‘peak’ usually means ‘the day with the most new cases’ and has little to nothing to do with policy implications. The more general question usually sounds like ‘how will the dynamics look in 2 months’. Many articles—even scientific ones—seem to be talking about some ephemeral ‘logistic curve’. However, there is one problem: climbing in is not the same as climbing out, so the dynamics before and after the peak might vary. Tell me, do these two objects look similar?

Logistic curve coronavirus Italy cases

This is why asymmetry must be modelled explicitly. We are going to adopt an approach due to Fernández & Steel (1998), where they generalise symmetrical probability density functions to incorporate the skewness parameter, namely stretching the left part and shrinking the right part by the same factor, and then, normalising to match the first two moments of the original distribution. For the sake of interpretability, we are not going to match the moments since we are not modelling densities; we are just taking the logistic curve, and shifting, scaling, and skewing it in order to match the observed data. Besides that, the logistic curve has exponentially decaying tails, which might not be too realistic due to the commonplace complacency: people think that when there are 10–15 new cases per day, it is over, and begin communicating too closely, and it becomes their (and eveyone’s) undoing. Modelling of tail heaviness (as with Student’s t distribution) could also be an option, but there might be not too many observations to reliably estimate this parameter. The formula for the new-cases-per-day logistic curve will therefore follow the form:
g(x; \mu, \sigma, s, h) \mathrel{\stackrel{\text{def}}{=}}
h \cdot f \left( \frac{x - \mu}{\sigma \cdot s} \right), & \frac{x - \mu}{\sigma} > 0, \\
h \cdot f \left( \frac{s(x - \mu)}{\sigma} \right), & \frac{x - \mu}{\sigma} \le 0.
Here, \(\mu\) is the mode (peak of the function), \(\sigma\) is the spread, \(s\) is the asymmetry coefficient that takes values from \((0, +\infty)\) and shows the relative heaviness of the right tail (\(s > 1\) implies a longer right tail), and finally, \(h\) corresponds to vertical scaling (the number of cases per country matters, so the absolute scale should be inferred from the data, too). Remember that f can be any symmetrical function—here, we are using symmetrical-density-like ones for convenience. Remember that we want a quick answer!

Density logistic curve for corona virus flattening

First of all, we download the March and April data from Wikipedia pages titled ‘2020 coronavirus pandemic in X’. We shall focus our attention on Spain, Germany, Italy, Luxembourg, Sweden, Czech Republic, Turkey, and Russia. You will find the data in the table below; please note that these data are as of the 1st of May, 2020, and are subject to change due to methodology updates or new reports.

Date spa ger ita lux swe cze tur rus
2020-03-01 85 117 1,694 0 7 3 0 0
2020-03-02 121 150 2,036 0 7 3 0 3
2020-03-03 166 188 2,502 0 11 5 0 3
2020-03-04 228 240 3,089 0 16 5 0 3
2020-03-05 282 349 3,858 2 17 8 0 4
2020-03-06 401 534 4,636 3 22 19 0 10
2020-03-07 525 684 5,883 4 29 26 0 14
2020-03-08 674 847 7,375 5 38 32 0 17
2020-03-09 1,231 1,112 9,172 5 54 38 0 20
2020-03-10 1,695 1,460 10,149 7 108 63 0 20
2020-03-11 2,277 1,884 12,462 7 181 94 1 28
2020-03-12 3,146 2,369 15,113 26 238 116 1 34
2020-03-13 5,232 3,062 17,660 34 266 141 5 45
2020-03-14 6,391 3,795 21,157 51 292 189 6 59
2020-03-15 7,988 4,838 24,747 77 339 298 18 63
2020-03-16 9,942 6,012 27,980 81 384 383 47 93
2020-03-17 11,826 7,156 31,506 140 425 450 98 114
2020-03-18 14,769 8,198 35,713 203 511 560 191 147
2020-03-19 18,077 10,999 41,035 335 603 765 359 199
2020-03-20 21,571 13,957 47,021 484 685 889 670 253
2020-03-21 25,496 16,662 53,578 670 767 1,047 947 306
2020-03-22 29,909 18,610 59,138 798 845 1,161 1,236 367
2020-03-23 35,480 22,672 63,927 875 927 1,287 1,529 438
2020-03-24 42,058 27,436 69,176 1,099 1,027 1,472 1,872 495
2020-03-25 50,105 31,554 74,386 1,333 1,125 1,763 2,433 658
2020-03-26 57,786 36,508 80,539 1,453 1,228 2,022 3,629 840
2020-03-27 65,719 42,288 86,498 1,605 1,317 2,395 5,698 1,036
2020-03-28 73,232 48,582 92,472 1,831 1,392 2,657 7,402 1,264
2020-03-29 80,110 52,547 97,689 1,950 1,450 2,817 9,217 1,534
2020-03-30 87,956 57,298 101,739 1,988 1,566 3,001 10,827 1,836
2020-03-31 95,923 61,913 105,792 2,178 1,683 3,308 13,531 2,337
2020-04-01 104,118 67,366 110,574 2,319 1,828 3,589 15,679 2,777
2020-04-02 112,065 73,522 115,242 2,487 1,982 3,858 18,135 3,548
2020-04-03 119,199 79,696 119,827 2,612 2,179 4,190 20,921 4,149
2020-04-04 126,168 85,778 124,632 2,729 2,253 4,472 23,934 4,731
2020-04-05 131,646 91,714 128,948 2,804 2,359 4,587 27,069 5,389
2020-04-06 136,675 95,391 132,547 2,843 2,566 4,822 30,217 6,343
2020-04-07 141,942 99,225 135,586 2,970 2,763 5,017 34,109 7,497
2020-04-08 148,220 103,228 139,422 3,034 2,890 5,312 38,226 8,672
2020-04-09 153,222 108,202 143,626 3,115 3,036 5,569 42,282 10,131
2020-04-10 158,273 113,525 147,577 3,223 3,100 5,732 47,029 11,917
2020-04-11 163,027 117,658 152,271 3,270 3,175 5,902 52,167 13,584
2020-04-12 166,831 120,479 156,363 3,281 3,254 5,991 56,956 15,770
2020-04-13 170,099 123,016 159,516 3,292 3,369 6,059 61,049 18,328
2020-04-14 174,060 125,098 162,488 3,307 3,539 6,141 65,111 21,102
2020-04-15 180,659 127,584 165,155 3,373 3,665 6,301 69,392 24,490
2020-04-16 184,948 130,450 168,941 3,444 3,785 6,433 74,193 27,938
2020-04-17 190,839 133,830 172,434 3,480 3,909 6,549 78,546 32,008
2020-04-18 194,416 137,439 175,925 3,537 3,964 6,654 82,329 36,793
2020-04-19 198,674 139,897 178,972 3,550 4,027 6,746 86,306 42,853
2020-04-20 200,210 141,672 181,228 3,558 4,173 6,900 90,980 47,121
2020-04-21 204,178 143,457 183,957 3,618 4,284 7,033 95,591 52,763
2020-04-22 208,389 145,694 187,327 3,654 4,406 7,132 98,674 57,999
2020-04-23 213,024 148,046 189,973 3,665 4,515 7,187 101,790 62,773
2020-04-24 219,764 150,383 192,994 3,695 4,610 7,273 104,912 68,622
2020-04-25 223,759 152,438 195,351 3,711 4,686 7,352 107,773 74,588
2020-04-26 226,629 154,175 197,675 3,723 4,756 7,404 110,130 80,949
2020-04-27 229,422 155,193 199,414 3,729 4,884 7,445 112,261 87,147
2020-04-28 232,128 156,337 201,505 3,741 4,980 7,504 114,653 93,558
2020-04-29 236,899 157,641 203,591 3,769 5,040 7,579 117,589 99,399
2020-04-30 239,369 159,119 205,463 3,784 5,051 7,682 120,204 106,498

Secondly, we have to choose the estimation method. Basically, for every country, we have pairs of observations \((1, Y_1), (2, Y_2), \ldots, (61, Y_{61})\), and the functional form is assumed to be known up to 4 (logistic) or 5 (Student) parameters. The most obvious way of estimation is minimising some statistic based on the discrepancies between the real data and the model predictions, the easiest case being squared errors and the sum thereof. However, often in practice more robust penalty functions are used, and sometimes, more robust statistics, so here, we shall also try minimising the sum of absolute deviations (like in a quantile non-linear regression) and the median of squared errors (like in a Least-Median-of-Squares regression due to Rousseeuw). We shall be minimising these 3 criteria. Note that the last one is not smooth in parameters, so there might be a continuum of parameter values yielding the global minimum of the criterion.
C_1 = \frac{1}{n} \sum_{t=1}^T (Y_t - g(t; \theta))^2, \quad C_2 = \frac{1}{n} \sum_{t=1}^T |Y_t - g(t; \theta)|, \quad C_3 = \mathrm{median}[(Y_t - g(t; \theta))^2]
Thirdly, since we are modelling parametric curves with four intuitive parameters, we pick some ballpark estimates in order to roughly match the dynamics. The criterion function is highly non-linear in parameters, so we begin by manually fiddling with parameters for each of our countries, and then, constructing a rather generous hypercube around it and trying a stochastic optimisation routine to pick the candidate for the best solution. We further refine it with constrained Nelder—Mead optimiser.

We implement the general curve via a custom function that depends on the data points (days elapsed from the 1st of March), distribution type (we consider logistic and Student, but other ones are possible), and finally, distribution parameters (a vector of length 4 or 5: mean, scale, skewness, height multiplier, and for Student’s t distribution, the number of degrees of freedom).

Remember, we do not refer to this function as to density since we do not know the real scale; it is just a curve, and its unit of measurement is, well, people, so it measures the number of cases. This is how all dynamics look in one graph.

New covid 19 cases as a fraction of maximum

Now, it is time to choose reasonable starting values for the optimiser because they belong to various scales. For numerical stability, especially for gradient-based methods, it is reasonable to have all parameters of the same order of magnitude (say, via rescaling), and it seems like the vertical scale parameter is in the range of tens of thousands, so we divide it by 10,000. For every country, we eyeballed the parameters \((\tilde \mu, \tilde\sigma, \tilde s, \tilde h\ [, \widetilde{df}])\), and then set the following ranges for stochastic search: \((\tilde\mu/2, \tilde\sigma/3, \tilde s/3, \tilde h / 4\ [, 0.1])\) to \((1.5\tilde\mu, 3\tilde\sigma, 3\tilde s, 3\tilde h\ [, 50])\).

Initial values for search of covid 19 dynamics parameters

For optimisation, we are using the hydroPSO library with its excellent parallel capabilities and many tweaks. Since high-dimensional spaces are getting sparser than one can imagine, for a more thorough sweep, we are using a population of size 400 or 600. Convergence is quite quick for the logistic model, and somewhat slower for the Student’s one, so we use 50–100 iterations for the former and 75–125 iterations for the latter (this is a rough first-step preëstimation anyway). One iteration took 0.25–0.5 seconds, so estimating all models took 8 ⋅ (50 + 75 + 100 + 75 + 100 + 125) ⋅ 0.35 ≈ 1500 seconds, or roughly half an hour—a small price to pay for robustness against multiple local optima.

You would think that estimation can continue without a hindrance... right? Wrong!

There is one cloud on the horizon: so far, there has been no peak in Russia... yet! This open a whole new can of worms: strictly speaking, the skewness parameter and the scale parameter are not identified because we observe only the left side of the distribution: one can multiply the skewness parameter by 2 and the scale parameter by 2, and get the same left tail. Relying on the rightmost end of the data is a dangerous affair in this case. It is well-known that estimation of the truncated normal parameters when the mean is outside the truncation range is very unreliable; in our situation, the peak is also outside the range, which makes our problem even worse. This means that for the purposes of estimation, the skewness parameter must be set to that of other countries, so only scale will be estimated. E. g. if Russia were to follow the right-tail decay model of Luxembourg, what would the total number of cases be? In other words, we can make scenario forecasts: Turkish scenario, Italian scenario, average scenario.

The second-step maximisation from the candidate solution was done in a blink of an eye, and the estimates themselves did not change much. However, the results were more volatile for Sweden, one model returned strange results for Turkey, and, as expected, the results for Russia were quite strange—because we have not applied the magic fix yet.

Fitted logistic curve total infected coronavirus cases

Now it is obvious that there is no clear winner among the models, and both logistic and Student’s fit yield similar fits. For the sake of parsimony, we restrict our analysis to logistic fits only and forget about tail heaviness estimation.

Besides that, in this estimation, we are not worrying about statistical inference. Of course, for the NLS estimator, one can compute robust standard errors, but for these data, it is non-trivial to do so for the two other estimators, and the data are dependent, so there is no obvious solution. Remember, we are keeping it simple here.

Now comes the part everyone is eagerly waiting for: how many cases are expected to occur in the upcoming months? Or, say, on which date are we expecting less than one person to be infected? Or even better, when are we expecting less than 1 case in two weeks (that is, 1/14 cases per day), and what will be the toll up until that date—so we can call it the end of the pandemic? For this, we can invoke a very simple computation: increase the time frame to the right by as much as we want, and just find the points where that curve takes values 1 or 1/14. The total expected number of cases will be the cumulative sum up until that date.

Now, we use the logistic models to forecast the number of cases in each country until the day less than 1 case or less than 1/14 cases per day. For every country, we pick the best model (out of three logistic ones) as the one that predicted the closest number of total cases up to the last date (i. e. the best in-sample fit).

Real cases by April, 30 Mean least squares Mean abs. dev. Median least sq.
Spain 239,369 237,304 242,555 203,910
Germany 159,119 158,465 164,505 183,371
Italy 205,463 203,246 204,339 212,175
Luxembourg 3,784 3,749 4,108 5,147
Sweden 5,051 5,076 5,046 5,160
Czech Republic 7,682 7,646 7,710 8,033
Turkey 120,204 121,944 116,104 102,600
Cases by April, 30 Days from March, 1 till 1 new case/day Days from March, 1 till 1 new case/14 days Cases till 1 new case/day Cases till 1 new case/14 days
239,369 189 231 282,401 282,415
159,119 142 172 171,369 171,379
205,463 201 249 241,977 241,993
3,784 83 109 3,838 3,847
5,051 148 196 6,752 6,768
7,682 95 120 8,029 8,037
120,204 140 166 145,922 145,930
Peak (μ) Horiz. scale (σ) Skewness (s) Vert. scale (h)
Spain 25.383 7.859 2.026 2.854
Germany 27.533 7.050 1.612 2.180
Italy 20.519 9.619 1.881 2.103
Luxembourg 20.131 3.999 2.410 0.068
Sweden 32.858 12.033 1.519 0.052
Czech Republic 27.818 6.764 1.400 0.113
Turkey 41.888 7.754 1.277 1.829

But... wait! What about Russia?

We have 7 countries and 7 different scenarios. In terms of parameters, these scenarios are described by the peak day (\(\mu\)), steepness of the curve (\(\sigma\)), asymmetry (\(s\)), overall scale (\(h\)), and for Russia, we cannot know the degree of asymmetry. Compare these two plots.

Density logistic curve yielding different tails

For the same left tail, we get the red right tail that is 4 times heavier than the blue one. This has huge implication for the forecasts as well: the blue curve describes a pandemic that decreases to less than 1 new case per day in 66 days with 37,469 cases, and the red one will take a whopping 232 days with 127,451 infections! This means that the initial values for Russia must be tweaked according to all 7 scenarios for other countries.

After we re-estimate the logistic model for Russia with skewness fixed to 1 (least absolute deviations yielded a slightly better total number of cases, with relative error slightly less than 1%), we can multiply both the scale \(\sigma\) and the skewness \(s\) by other countries’ skewnesses, compute the two sacred days (less than 1 case per 1 day or per 14 days), and obtain the total number of cases.

The estimates for Russia are: \(\hat\mu = 56.1\), \(\hat\sigma / s = 6.33\), \(\hat h = 2.49\). This means that, despite the lack of clear peak by day 61, it must have occurred on day 56, just before the end of data collection.

Days till 1 new case/day Days till 1 new case/14 days Cases till 1 new case/day Cases till 1 new case/14 days
Spain scenario 320 388 402,422 402,445
Germany scenario 223 267 283,730 283,745
Italy scenario 284 343 357,887 357,907
Luxembourg scenario 429 526 536,540 536,573
Sweden scenario 205 243 260,712 260,724
Czech Republic scenario 182 215 233,328 233,339
Turkey scenario 161 188 207,443 207,452
Average scenario (s = 1.73) 249 299 315,371 315,387

Russia coronavirus covid 19 cases forecast daily

Russia coronavirus covid 19 cases forecast cumulative

Before we jump to the conclustion, there are several ‘buts’ worth mentionings:

  1. Inferring the shift of the curve (\(\hat\mu\)) from the left part is very unreliable, so there is a huge confidence band around that estimate. How huge? It is up to the dear reader to compute the standard errors, but no one should even assume normality of the estimator since it is well known that the behaviour of estimators on the boundary of the parameter space is far from asymptotically normal (cf. the failure of maximal likelihood for estimation of extreme order statistics).
  2. The ‘peak’, although defined as the maximum number of cases, in our estimation, corresponds to th peak of the smoother underlying curve, even if individual observations can jump above or below, assuming, among others, that the error is mean zero. If there is systematic under-reporting because there are not enough tests available and the real numbers are higher, this overly simplistic model cannot learn anything useful from the data, and other, more sophisticated methods must be used.
  3. There are many asymptomatic cases that are not registered anywhere that are making up the underwater part of the iceberg, so if that iceberg of unobservedness were to be raised entirely above water, the peak would shift to the right.
  4. The reported daily numbers depend heavily on testing tactics, good weather (this is not ‘sunspot economics’ envisioned by Jevons, this correlates with the number of people gathering outside), presidental speeches, and other hidden variables.

All these caveates hint at the fact that what we obtained is an under-estimate of the real peak, but dealing with the problems outlined above required better spatial data, more information about the population density, geographical distribution—a real job for real epidemilogists.

We conclude the following (if Russia follows the dynamics of some European countries and if one blisfully ignores the caveats above):

  1. The peak is most likely to have already occurred at the very end of April, 2020;
  2. The total number of infected people will be between 200 and 500 hundred thousand Russian citizens;
  3. Under the assumption of average asymmetrical European scenario (where the decrease of cases per day after the peak is roughly 3 times slower than the growth prior to reaching the peak), the total number of cases is estimated to be 315 thousand;
  4. In early November, the number of new cases per day will become less than 1, and by the end of December, the number, there will be fewer than 1 case per 2 weeks.

Only time will tell how accurate these quick estimates are, so you are encouraged to redo the estimation on new data (maybe with more countries) and update the forecasts.

Copy the table above into R under the name ds (data frame) and use the code below to replicate these results.

cs <- c("spa", "ger", "ita", "lux", "swe", "cze", "tur", "rus")
csn <- c("Spain", "Germany", "Italy", "Luxembourg", "Sweden", "Czech Republic", "Turkey", "Russia")
mycurve <- function(x, distrib, pars) {
  pars[4] <- pars[4] * 1e4
  xstd <- (x - pars[1]) / pars[2]
  f <- switch(distrib,
              logistic = ifelse(xstd > 0, dlogis(xstd/pars[3]), dlogis(xstd*pars[3])) * pars[4],
              student = ifelse(xstd > 0, dt(xstd/pars[3], df = pars[5]), dt(xstd*pars[3], df = pars[5])) * pars[4]
png("plot1-logistic-curve-coronavirus-italy.png", 640, 400, type = "cairo", pointsize = 16)
plot(diff(ds$ita), type = "l", xlab = "Days since the 1st of March", ylab = "New cases per day", bty = "n", lwd = 2, main = "Logistic curve for Italy")
lines(1:60, mycurve(1:60, "logistic", c(25, 6, 1, 2.45)), col = "red", lty = 2, lwd = 2)
text(50, 1500, "WTF?!", col = "red", cex = 2)
mycols <- c("#000000", "#008800", "#AA0000")
png("plot2-density-logistic-curve-for-corona-virus-flattening.png", 720, 360, pointsize = 16)
par(mar = c(2, 2, 3, 1), mfrow = c(1, 2))
curve(dlogis(x), -6, 6, n = 1001, main = "Skewed logistic density", bty = "n", lwd = 2, ylim = c(0, dlogis(0)), ylab = "", xlab = "")
curve(mycurve(x, "logistic", c(0, 1, 1.5, 1e-4)), -6, 6, n = 1001, add = TRUE, col = mycols[2], lwd = 2, lty = 2)
curve(mycurve(x, "logistic", c(0, 1, 0.5, 1e-4)), -6, 6, n = 1001, add = TRUE, col = mycols[3], lwd = 2, lty = 2)
legend("topright", c("log.", "log., skew = 1.5", "log., skew = 0.5"), col = mycols, pch = 16, bty = "n")
curve(dt(x, 5), -6, 6, n = 1001, main = "Skewed Student’s density", bty = "n", lwd = 2, ylim = c(0, dt(0, 5)))
curve(mycurve(x, "student", c(0, 1, 1.5, 1e-4, 5)), -6, 6, n = 1001, add = TRUE, col = mycols[2], lwd = 2, lty = 2)
curve(mycurve(x, "student", c(0, 1, 0.5, 1e-4, 5)), -6, 6, n = 1001, add = TRUE, col = mycols[3], lwd = 2, lty = 2)
legend("topright", c("t(5)", "t(5, skew = 1.5)", "t(5, skew = 0.5)"), col = mycols, pch = 16, bty = "n")
ds[, (length(cs)+2):(length(cs)*2+1)] <- sapply(ds[, 2:(length(cs)+1)], function(x) diff(c(NA, x)))
colnames(ds)[(length(cs)+2):(length(cs)*2+1)] <- paste0("d", cs)
ds <- ds[2:nrow(ds), ]
mycols <- rainbow(length(cs), end = 0.8, v = 0.8)
png("plot3-new-covid-19-cases-as-a-fraction-of-maximum.png", 640, 400, pointsize = 16)
plot(NULL, NULL, xlim = c(1, nrow(ds)), ylim = c(0, 1), xlab = "Days since March, 1", ylab = "Cases / Cases at peak", bty = "n", main = "Scaled dynamics of new cases per day")
for (i in (length(cs)+2):(length(cs)*2+1)) lines(ds[, i] / max(ds[, i]), type = "l", lwd = 2, col = mycols[i - length(cs) - 1])
legend("topleft", cs, pch = 16, col = mycols, bty = "n")
start.pars <- list(list(c(27, 8, 1.8, 3), c(26, 11, 2, 2, 5)), # Spain
                   list(c(26, 6, 2, 2.3), c(27, 9, 2, 1.5, 5)), # Germany
                   list(c(20, 9, 2.1, 2.2), c(20, 12, 2.1, 1.5, 5)), # Italy
                   list(c(25, 5, 1.5, 0.08), c(25, 7, 1.5, 0.055, 5)), # Luxembourg
                   list(c(35, 8, 1.2, 0.07), c(35, 12, 1.2, 0.045, 5)), # Sweden
                   list(c(30, 7, 1.3, 0.11), c(30, 11, 1.3, 0.07, 5)), # Czech,
                   list(c(43, 7, 1.3, 1.8), c(43, 11, 1.3, 1.2, 5)), # Turkey
                   list(c(60, 7, 1, 2.8), c(58, 10, 1, 1.8, 5)) # Russia
start.pars <- lapply(start.pars, function(x) {names(x) <- c("logistic", "student"); x})
leg <- function(i) {
  c(parse(text = paste0('mu == ', start.pars[[i]][[1]][1], '*" | "*', start.pars[[i]][[2]][1])),
    parse(text = paste0('sigma == ', start.pars[[i]][[1]][2], '*" | "*', start.pars[[i]][[2]][2])),
    parse(text = paste0('s == ', start.pars[[i]][[1]][3], '*" | "*', start.pars[[i]][[2]][3])),
    parse(text = paste0('h == ', start.pars[[i]][[1]][4], '*" | "*', start.pars[[i]][[2]][4])),
    parse(text = paste0('df(t) == ', start.pars[[i]][[2]][5])))
png("plot4-initial-values-for-search.png", 720, 1200, pointsize = 16)
par(mfrow = c(4, 2))
for (i in 1:length(cs)) {
  plot(ds[, paste0("d", cs[i])], type = "l", lwd = 2, xlab = "Days since March, 1", ylab = "Registered cases", bty = "n", main = paste0(csn[i], " initial values"), ylim = c(0, max(ds[, paste0("d", cs[i])])))
  lines(mycurve(1:nrow(ds), "logistic", start.pars[[i]][[1]]), col = mycols[i], lwd = 2, lty = 2)
  lines(mycurve(1:nrow(ds), "student", start.pars[[i]][[2]]), col = mycols[i], lwd = 2, lty = 3)
  legend("topleft", legend = leg(i), bty = "n")
criterion <- function(pars, observed, distrib, fun = function(x) mean(x^2)) {
  theor <- mycurve(x = 1:length(observed), distrib = distrib, pars = pars)
  return(fun(theor - observed))
rls <- function(x) sqrt(mean(x^2)) # Root mean least square; root is here for the purposes of comparison
nla <- function(x) mean(abs(x))
nms <- function(x) sqrt(median(x^2)) # Root median of squares
opt.res <- vector("list", length(cs))
for (i in 1:length(cs)) {
  sl <- start.pars[[i]][[1]]
  st <- start.pars[[i]][[2]]
  ll <- start.pars[[i]][[1]] / c(2, 3, 3, 4) # Lower bound for logistic
  ul <- start.pars[[i]][[1]] * c(1.5, 3, 3, 3) # Upper bound for logistic
  lt <- c(start.pars[[i]][[2]][1:4] / c(2, 3, 3, 4), 0.1) # Lower bound for Student
  ut <- c(start.pars[[i]][[2]][1:4] * c(1.5, 3, 3, 3), 50) # Upper bound for Student
  contl1 <- list(npart = 400, REPORT = 1, parallel = "parallel", par.nnodes = 6, maxit = 50, plot = TRUE)
  contl2 <- list(npart = 400, REPORT = 1, parallel = "parallel", par.nnodes = 6, maxit = 75, plot = TRUE)
  contl3 <- list(npart = 400, REPORT = 1, parallel = "parallel", par.nnodes = 6, maxit = 100, plot = TRUE)
  contt1 <- list(npart = 600, REPORT = 1, parallel = "parallel", par.nnodes = 6, maxit = 75, plot = TRUE)
  contt2 <- list(npart = 600, REPORT = 1, parallel = "parallel", par.nnodes = 6, maxit = 100, plot = TRUE)
  contt3 <- list(npart = 600, REPORT = 1, parallel = "parallel", par.nnodes = 6, maxit = 125, plot = TRUE)
  set.seed(1); cat(csn[i], "logistic: least squares\n")
  opt.log.nls <- hydroPSO(par = sl, fn = function(x) criterion(x, ds[, paste0("d", cs[i])], "logistic", rls), lower = ll, upper = ul, control = contl1)
  set.seed(1); cat(csn[i], "logistic: least absolute deviations\n")
  opt.log.nla <- hydroPSO(par = sl, fn = function(x) criterion(x, ds[, paste0("d", cs[i])], "logistic", nla), lower = ll, upper = ul, control = contl2)
  set.seed(1); cat(csn[i], "logistic: median of least squares\n")
  opt.log.nms <- hydroPSO(par = sl, fn = function(x) criterion(x, ds[, paste0("d", cs[i])], "logistic", nms), lower = ll, upper = ul, control = contl3)
  set.seed(1); cat(csn[i], "student: least squares\n")
  opt.t.nls <- hydroPSO(par = st, fn = function(x) criterion(x, ds[, paste0("d", cs[i])], "student", rls), lower = lt, upper = ut, control = contt1)
  set.seed(1); cat(csn[i], "student: least absolute deviations\n")
  opt.t.nla <- hydroPSO(par = st, fn = function(x) criterion(x, ds[, paste0("d", cs[i])], "student", nla), lower = lt, upper = ut, control = contt2)
  set.seed(1); cat(csn[i], "student: median of least squares\n")
  opt.t.nms <- hydroPSO(par = st, fn = function(x) criterion(x, ds[, paste0("d", cs[i])], "student", nms), lower = lt, upper = ut, control = contt3)
  opt.res[[i]] <- list(opt.log.nls, opt.log.nla, opt.log.nms, opt.t.nls, opt.t.nla, opt.t.nms)
  save(opt.res, file = "opt.RData", compress = "xz")
ui4 <- diag(4)
ci4 <- c(10, 1, 0.01, 1e-5)
ui5 <- diag(5)
ci5 <- c(10, 1, 0.01, 1e-5, 0.1)
opt.refined <- vector("list", length(cs))
for (i in 1:length(cs)) {
  cat(csn[i], "\n")
  sl1 <- opt.res[[i]][[1]]$par # Replace with start.pars[[i]]$logistic if you did not do the stochastic optimisation
  sl2 <- opt.res[[i]][[2]]$par # Replace with start.pars[[i]]$logistic if you did not do the stochastic optimisation
  sl3 <- opt.res[[i]][[3]]$par # Replace with start.pars[[i]]$logistic if you did not do the stochastic optimisation
  st1 <- opt.res[[i]][[4]]$par # Replace with start.pars[[i]]$student if you did not do the stochastic optimisation
  st2 <- opt.res[[i]][[5]]$par # Replace with start.pars[[i]]$student if you did not do the stochastic optimisation
  st3 <- opt.res[[i]][[6]]$par # Replace with start.pars[[i]]$student if you did not do the stochastic optimisation
  opt.log.nls <- constrOptim(sl1, function(x) criterion(x, ds[, paste0("d", cs[i])], "logistic", rls), grad = NULL, ui = ui4, ci = ci4, outer.iterations = 300)
  opt.log.nla <- constrOptim(sl2, function(x) criterion(x, ds[, paste0("d", cs[i])], "logistic", nla), grad = NULL, ui = ui4, ci = ci4, outer.iterations = 300)
  opt.log.nms <- constrOptim(sl3, function(x) criterion(x, ds[, paste0("d", cs[i])], "logistic", nms), grad = NULL, ui = ui4, ci = ci4, outer.iterations = 300)
  opt.t.nls <- constrOptim(st1, function(x) criterion(x, ds[, paste0("d", cs[i])], "student", rls), grad = NULL, ui = ui5, ci = ci5, outer.iterations = 300)
  opt.t.nla <- constrOptim(st2, function(x) criterion(x, ds[, paste0("d", cs[i])], "student", nla), grad = NULL, ui = ui5, ci = ci5, outer.iterations = 300)
  opt.t.nms <- constrOptim(st2, function(x) criterion(x, ds[, paste0("d", cs[i])], "student", nms), grad = NULL, ui = ui5, ci = ci5, outer.iterations = 300)
  opt.refined[[i]] <- list(opt.log.nls, opt.log.nla, opt.log.nms, opt.t.nls, opt.t.nla, opt.t.nms)
save(opt.res, opt.refined, file = "opt.RData", compress = "xz")
xmax <- 120
xinf <- 1000
mycols <- c("#CC0000", "#0000FFCC")
png("plot5-fitted-logistic-curve-total-infected-coronavirus-cases.png", 720, 1200, type = "cairo", pointsize = 16)
par(mfrow = c(4, 2))
for (i in 1:length(cs)) {
  plot(ds[, paste0("d", cs[i])], xlim = c(1, xmax), type = "l", lwd = 3, xlab = "Days since March, 1", ylab = "Cases", bty = "n", main = paste0("Fitted curves for ", csn[i])) 
  for (j in 1:3) {
    lines(mycurve(1:xmax, distrib = "logistic", pars = opt.refined[[i]][[j]]$par), lwd = 2, lty = j, col = mycols[1])
    lines(mycurve(1:xmax, distrib = "student", pars = opt.refined[[i]][[j + 3]]$par), lwd = 2, lty = j, col = mycols[2])
  legend("topleft", c("Logis.", "Stud."), pch = 16, col = mycols, bty = "n")
  legend("topright", c("Mean(least squares)", "Mean(least abs. dev.)", "Median(least abs. sq.)"), lty = 1:3, lwd = 2, bty = "n")
res.tab <- vector("list", length(cs))
for (i in 1:length(cs)) {
  a <- cbind(c(start.pars[[i]]$logistic, NA), c(opt.refined[[i]][[1]]$par, NA), c(opt.refined[[i]][[2]]$par, NA), c(opt.refined[[i]][[3]]$par, NA),
             start.pars[[i]]$student, opt.refined[[i]][[4]]$par, opt.refined[[i]][[5]]$par, opt.refined[[i]][[6]]$par)
  colnames(a) <- c("Log. init. val.", "Log. LS", "Log. LAD", "Log. LMS", "Stud. init. val.", "Stud. LS", "Stud. LAD", "Stud. LMS")
  row.names(a) <- c("Peak", "Horiz. scale", "Skew", "Vert. scale", "Stud. DoF")
  res.tab[[i]] <- a
  # stargazer(a, out = paste0(csn[i], ".html"), type = "html", summary = FALSE, digits = 2)
find.day <- function(cases, distrib, pars) {
  uniroot(function(x) mycurve(x, distrib, pars) - cases, lower = pars[1], upper = pars[1] + 180, extendInt = "downX", maxiter = 200, tol = 1e-8)$root
get.cases <- function(xmax, distrib, pars) {
  round(sum(mycurve(1:xmax, distrib, pars)))
days.to.one <- days.to.tw <- cases.one <- cases.tw <- cases.insample <- matrix(NA, nrow = length(cs), ncol = 6)
best.mod <- best.log <- numeric(length(cs))
for (i in 1:8) {
  days.to.one[i, ] <- c(sapply(1:3, function(j) find.day(1, "logistic", opt.refined[[i]][[j]]$par)),
                        sapply(4:6, function(j) find.day(1, "student", opt.refined[[i]][[j]]$par)))
  days.to.tw[i, ] <- c(sapply(1:3, function(j) find.day(1/14, "logistic", opt.refined[[i]][[j]]$par)),
                        sapply(4:6, function(j) find.day(1/14, "student", opt.refined[[i]][[j]]$par)))
  cases.one[i, ] <- c(sapply(1:3, function(j) get.cases(ceiling(days.to.one[i, j]), "logistic", opt.refined[[i]][[j]]$par)),
                      sapply(4:6, function(j) get.cases(ceiling(days.to.one[i, j]), "student", opt.refined[[i]][[j]]$par)))
  cases.tw[i, ] <- c(sapply(1:3, function(j) get.cases(ceiling(days.to.tw[i, j]), "logistic", opt.refined[[i]][[j]]$par)),
                      sapply(4:6, function(j) get.cases(ceiling(days.to.tw[i, j]), "student", opt.refined[[i]][[j]]$par)))
  cases.insample[i, ] <- c(sapply(1:3, function(j) get.cases(nrow(ds), "logistic", opt.refined[[i]][[j]]$par)),
                           sapply(4:6, function(j) get.cases(nrow(ds), "student", opt.refined[[i]][[j]]$par)))
  best.mod[i] <- which.min(abs(ds[nrow(ds), cs[i]] - cases.insample[i, ]))
  best.log[i] <- which.min(abs(ds[nrow(ds), cs[i]] - cases.insample[i, 1:3]))
  cat(csn[i], "has the best in-sample fit from model", best.mod[i], "and among logistic models,", best.log[i], "\n")
colnames(days.to.one) <- colnames(days.to.tw) <- c("Log. LS", "Log. LAD", "Log. LMS", "Stud. LS", "Stud. LAD", "Stud. LMS")
row.names(days.to.one) <- row.names(days.to.tw) <- csn
mod.inds <- c("Mean least squares", "Mean abs. dev.", "Median least sq.")
cases.ins <- cbind(as.numeric(ds[nrow(ds), 2:(length(cs))]), cases.insample[1:7, 1:3])
colnames(cases.ins) <- c("Real cases by April, 30", paste0(mod.inds, " forecast"))
row.names(cases.ins) <- csn[1:7]
stargazer(cases.ins, type = "html", out = "insample.html", summary = FALSE)
best.pars <- lapply(1:7, function(i) opt.refined[[i]][[best.log[i]]]$par)
best.pars <- do.call(rbind, best.pars)
row.names(best.pars) <- csn[1:7]
colnames(best.pars) <- c("Peak (mu)", "Horiz. scale (sigma)", "Skewness (s)", "Vert. scale (h)")
stargazer(best.pars, type = "html", out = "bestpars.html", summary = FALSE)
days.to <- ceiling(cbind(sapply(1:7, function(i) days.to.one[i, best.log[i]]), sapply(1:7, function(i) days.to.tw[i, best.log[i]])))
cases <- ceiling(cbind(sapply(1:7, function(i) cases.one[i, best.log[i]]), sapply(1:7, function(i) cases.tw[i, best.log[i]])))
dcases <- cbind(as.numeric(ds[nrow(ds), 2:length(cs)]), days.to, cases)
colnames(dcases) <- c("Cases by April, 30", "Days till 1 new case/day", "Days till 1 new case/14 days", "Cases till 1 new case/day", "Cases till 1 new case/14 days")
row.names(dcases) <- csn[1:7]
stargazer(dcases, type = "html", out = "days-to-one.html", summary = FALSE)
png("plot6-density-logistic-curve-yielding-different-tails.png", 600, 400, pointsize = 16)
par(mar = c(2, 2, 3, 1))
curve(mycurve(x, "logistic", c(10, 3, 2, 1)), 0, 10, n = 1001, col = "blue", lwd = 2, bty = "n", xlab = "n", ylab = "n", main = "Identification problems without the right tail (logistic)", xlim = c(0, 40))
curve(mycurve(x, "logistic", c(10, 6, 4, 1)), 0, 10, n = 1001, add = TRUE, col = "red", lwd = 2)
curve(mycurve(x, "logistic", c(10, 3, 2, 1)), 10, 40, n = 1001, add = TRUE, col = "blue", lwd = 2, lty = 2)
curve(mycurve(x, "logistic", c(10, 6, 4, 1)), 10, 40, n = 1001, add = TRUE, col = "red", lwd = 2, lty = 2)
legend("right", c(expression(sigma == 3 ~ ", " ~ s == 2), expression(sigma == 6 ~ ", " ~ s == 4)), bty = "n", col = c("blue", "red"), lwd = 2)
legend("topleft", c(expression(mu == 10), expression(h == 1)), bty = "n")
d.good <- find.day(1, "logistic", c(10, 3, 2, 1))
d.bad <- find.day(1, "logistic", c(10, 6, 4, 1))
get.cases(ceiling(d.good), "logistic", c(10, 3, 2, 1))
get.cases(ceiling(d.bad), "logistic", c(10, 6, 4, 1))
criterion.log.rest <- function(pars, skew, observed, fun = function(x) mean(x^2)) {
  theor <- mycurve(x = 1:length(observed), distrib = "logistic", pars = c(pars[1:2], skew, pars[3]))
  return(fun(theor - observed))
set.seed(1); cat("Russia logistic: least squares\n")
opt.log.nls <- hydroPSO(par = start.pars[[8]]$logistic, fn = function(x) criterion.log.rest(x, 1, ds[, "drus"], rls), lower = start.pars[[8]]$logistic[c(1, 2, 4)] / 4, upper = start.pars[[8]]$logistic[c(1, 2, 4)] * 4, control = contt3)
set.seed(1); cat("Russia logistic: least absolute deviations\n")
opt.log.nla <- hydroPSO(par = start.pars[[8]]$logistic[c(1, 2, 4)], fn = function(x) criterion.log.rest(x, 1, ds[, "drus"], nla), lower = start.pars[[8]]$logistic[c(1, 2, 4)] / 4, upper = start.pars[[8]]$logistic[c(1, 2, 4)] * 4, control = contt3)
rus.nls <- c(opt.log.nls$par[1:2], 1, opt.log.nls$par[3])
rus.nla <- c(opt.log.nla$par[1:2], 1, opt.log.nla$par[3])
rus.nls.cases <- get.cases(nrow(ds), "logistic", rus.nls)
rus.nla.cases <- get.cases(nrow(ds), "logistic", rus.nla)
c(ds[nrow(ds), "rus"], ds[nrow(ds), "rus"] - rus.nls.cases, ds[nrow(ds), "rus"] - rus.nla.cases)
# Least absolute deviations is better
russia <- matrix(NA, nrow = 8, ncol = 4)
mycols <- rainbow(length(cs), end = 0.8, v = 0.8)
for (i in 1:7) {
  these.pars <- c(rus.nla[1], rus.nla[2] * best.pars[i, 3], best.pars[i, 3], rus.nla[4])
  to1 <- ceiling(find.day(1, "logistic", these.pars))
  to14 <- ceiling(find.day(1/14, "logistic", these.pars))
  c1 <- get.cases(to1, "logistic", these.pars)
  c14 <- get.cases(to14, "logistic", these.pars)
  russia[i, ] <- c(to1, to14, c1, c14)
these.pars <- c(rus.nla[1], rus.nla[2] * mean(best.pars[, 3]), mean(best.pars[, 3]), rus.nla[4])
to1 <- ceiling(find.day(1, "logistic", these.pars))
to14 <- ceiling(find.day(1/14, "logistic", these.pars))
c1 <- get.cases(to1, "logistic", these.pars)
c14 <- get.cases(to14, "logistic", these.pars)
russia[8, ] <- c(to1, to14, c1, c14)
png("plot7-russia-coronavirus-covid-19-cases-forecast-daily.png", 640, 400, pointsize = 14)
plot(ds[, "drus"], xlim = c(1, 305), ylim = c(0, max(ds[, "drus"])), type = "l", lwd = 3, bty = "n", main = "Conditional forecast COVID-19 pandemic in Russia for 2020", xlab = "", ylab = "Daily cases", xaxt = "n")
axis(1, seq(1, 305, length.out = 11), c(month.name[3:12], ""), las = 2)
for (i in 1:7) {
  these.pars <- c(rus.nla[1], rus.nla[2] * best.pars[i, 3], best.pars[i, 3], rus.nla[4])
  lines(nrow(ds):305, mycurve(nrow(ds):305, "logistic", these.pars), lwd = 2, col = mycols[i])
these.pars <- c(rus.nla[1], rus.nla[2] * mean(best.pars[, 3]), mean(best.pars[, 3]), rus.nla[4])
lines(nrow(ds):305, mycurve(nrow(ds):305, "logistic", these.pars), lwd = 3, lty = 2, col = "black")
legend("topright", paste0(c(csn[1:7], "Average"), " scenario"), col = c(mycols[1:7], "black"), lwd = 2, lty = c(rep(1, 7), 2), bty = "n")
png("plot8-russia-coronavirus-covid-19-cases-forecast-cumulative.png", 640, 400, pointsize = 14)
plot(cumsum(ds[, "drus"]), xlim = c(1, 305), ylim = c(0, max(russia)), type = "l", lwd = 3, bty = "n", main = "Forecast dynamics of COVID-19 pandemic in Russia", xlab = "", ylab = "Total cases, thousands", xaxt = "n", yaxt = "n")
axis(1, seq(1, 305, length.out = 11), c(month.name[3:12], ""), las = 2)
axis(2, seq(0, 6e5, 1e5), as.character(seq(0, 6e2, 1e2)), las = 2)
for (i in 1:7) {
  these.pars <- c(rus.nla[1], rus.nla[2] * best.pars[i, 3], best.pars[i, 3], rus.nla[4])
  lines(nrow(ds):305, cumsum(mycurve(nrow(ds):305, "logistic", these.pars)) + ds[nrow(ds)-1, "rus"], lwd = 2, col = mycols[i])
these.pars <- c(rus.nla[1], rus.nla[2] * mean(best.pars[, 3]), mean(best.pars[, 3]), rus.nla[4])
lines(nrow(ds):305, cumsum(mycurve(nrow(ds):305, "logistic", these.pars)) + ds[nrow(ds)-1, "rus"], lwd = 3, lty = 2, col = "black")
legend("topleft", paste0(c(csn[1:7], "Average"), " scenario"), col = c(mycols[1:7], "black"), lwd = 2, lty = c(rep(1, 7), 2), bty = "n")
mean(best.pars[, 3])
colnames(russia) <- colnames(dcases)[2:5]
row.names(russia) <- c(paste0(csn[1:7], " scenario"), paste0("Average scenario (s = ", round(mean(best.pars[, 3]), 2), ")"))
stargazer(russia, type = "html", out = "russia-final.html", summary = FALSE)


Fernández, C., Steel, M.F.J., 1998. On bayesian modeling of fat tails and skewness. Journal of the American Statistical Association 93, 359–371.

Case closed: Using assignment in R

Upon reading a very heavily opinionated article on the use of forward pipes, I could barely contain my righteous indignation—righteous as long as my hatred for the tidyverse and its dumbed-down bloated commands does not subside. That’s a topic for another post (ggplot2 with smoothing? hiding all hyper-parameters under the hood?) in the vein of ‘simple black box with one “Compute” button versus understanding and being able to implement any step of the computation’, but for now, let’s make an equally heavily opinionated counterargument for John Mount’s Win-Vector blog post.

The section that buggers me is ‘Assignment in R’. Let’s not forget that the entire point of the post is defending the right of forward assignment to exist within the context of %>%-like pipes from magrittr, which by itself is a package to be avoided that turns one’s workflow into a bloody bash pipeline which obscures dangerous intermediate type casting and is just slower. But enough of that, I deeply respect Hadley Wickham’s professionalism and aspiration to popularise R; I completely disagree with his approach of turning R into an alien-looking language, though.

Original punctuation preserved.

R‘s preferred assignment operator is “<-“. This is in the popular style guides. <...> This has some advantages, and is the public style. Also “=” is much harder to use inside R’s base::quote method than “<-“, so there are still cases where the semantics of “=” and “<-” are different (though I think they all involve the distinction trying specify argument binding versus assignment while inside a function call’s argument list).

That basically says it. In computer science, there is assignment, there is comparison, and there is argument binding. I strongly suspect that ‘=’ was kept for the purposes of assignment out of user-friendliness considerations.

I have previously written that given the choice I prefer “=” for assignment. It has the advantages that:

• “<-” is has a different meaning to many readers. In R “x<-3” assigns the value 3 to a variable named x, in other popular programming languages (where new R users may be coming from) “x<-3” denotes comparing x to -3.

The word ‘pain’ has a different meaning to an English speaker (an unpleasant physical sensation) and a French speaker (food made of flower, water, and yeast).

First of all, this is not proper coding style. One should not use scriptio continua: this practice had died out by the XIVth century. Mathematical operators should have spaces around them, and TeX, world’s best typesetting system, puts a special emphasis on how many varieties of spaces exists in mathematical typesetting: relational operators, binary operators, named operators etc. For now, let’s just use simple spaces around operators to improve legibility or at least use them to distinguish outermost and innermost operators: x + 2*y + 3*z looks beautiful and legible. x<-3 does not. x <- 3 does. x < -3 does, and has a different meaning.

• “=” is a single character, so it can not be ruined by the insertion of a space. “x< -3” does not assign the value 3 to a variable named x, it compares x to -3. I would not mind so much if “x< -3” was a syntax error (as “x< =3” is), but it is valid code that quietly does something very different than “x<-3“. If you have taught R enough you have experience helping students undo this bug.

I agree that in general, basic operations should not be ruined by spaces, but the same can be said about trigraphs in C preprocessor. Even the simplest comment symbol // breaks in almost all languages if a space is inserted in between (/ / ). Some symbol combinations are not supposed to be broken with spaces, and this is true for almost all programming languages.

• Also “=” can not be broken up by line-splitting.

John, your line-splitting algorithm is broken. Apply soft wrapping to the damaged area or submit a fix to the highlighter you are using.

• “=” is on the keyboard (as “←” was when arrow like assignments were themselves introduced).

On Linux, if one enables ‘Extra typographic characters’ in X options, they will obtain the same (AltGr + 0). [cough-cough APL cough-cough] If R supported Unicode commands, I would wholeheartedly advocate the use of a single-character assignment operator. Why isn’t John complaining about the less-or-equal-to operator <= and suggesting replacing it with ‘≤’, adding the latter onto the keyboard, though? Why not map all digraphs into equivalent single characters? 2 < = 3 throws an error as well!

• “=” is easier to paste into HTML as it does not require escape coding such as “<“.

So is ‘≤’, but somehow John is panning <- exclusively, not <=.

• It is the symbol used in most every other popular current programming language for assignment.

Most other popular current programming languages cannot do in one line of code the things R can. Applied statisticians should not hard-code matrix inversion in C++. There are different language types, different paradigms, and different built-ins. Once I suggested the shortest solution for the Code Golf question about implementing Kolmogorov—Smirnov test. The ‘every other’ argument does not apply here. Most every other popular current programming language has clunkier implementations of R’s one-liners.

• There is an asymmetric cost of mistakes. Typing “=” when you meant “<-” is usually harmless. Typing “<-” in a context where “=” was needed is not caught by R and fairly bad (please see here for details). So if you get out of the habit of using “<-” one type of bug become less likely.

Know thy tools. The cost of using an electric hammer drill on wood is not the same as the cost of using an ordinary drill on hard concrete. There are cases where one character is put by force of habit can break things. E. g., if one, being spoiled by data.table, wants to select the second column of a matrix a <- matrix(1:9, 3); a[2], they will get 2 instead of 4 5 6 yielded by a[,2]. Same goes for multi-dimensional arrays. Yet John does not complain about those kinds of mistakes.

• There is a cognitive benefit in reducing the number of low-value distinctions you need to maintain, especially for beginners. If we think of the mind as having “seven plus or minus two” slots for current information do we really want to waste 11 to 20 percent of our students’ attention on something like this when teaching? The beginner does not need to worry over the differences between value assignment and argument binding at all times. In fact it is a useful generalization to think of argument binding as a safe transient value assignment.

Appealing to outdated psychological theories is not a good argument to begin with. I am sure John and many other people knows far more than 7 programming languages and more than 7 commands in every programming language. This rule is one of those old wives’ tales that needs to die as quickly as possible (‘5-second rule’, ‘hairy palms’ etc.). Someone has already generalised programming to such Turing-complete programming languages as Вrаinfuсk and Whitespace in as few distinct commands as possible, but most people did not like it.

To sum it up, dear John, one should not hammer in nails with a microscope. You did a great distinction between <-, ==, and =, and you should not cherry-pick some digraphs and argue against their use.

Главный праздник Люксембурга — день рождения великого герцога

Национальный праздник Люксембурга

Главный праздник Люксембурга — день рождения великого герцога — приходится на 23 июня каждого года. Удивите своих друзей малоизвестными фактами об этом празднике!

Хотя 23 июня никогда не было днём рождения ни одного правителя Люксембурга, в 1961 году эту дату выбрали в качестве официальной даты рождения августейшей особы, что означает, что все последующие великие герцоги и герцогини обязаны по закону рождаться именно в этот день.

В течение десятилетий многие приезжие даже не знали, что этот праздник существует, поскольку когда они видели или слышали его название — «Groussherzogsgebuertsdag» или «Groussherzoginsgebuertsdag» (если на троне великая герцогиня) — то думали, что либо наборщик уснул за клавиатурой, либо говорящий прокашливается.

По всей стране можно найти люксембуржцев и давно ассимилировавшихся приезжих, с большим энтузиазмом размахивающих государственным флагом. Они делают это не только от переполняющей их гордости и любви к своей приёмной стране, но и для того, чтобы напомнить друг другу, как вообще выглядит их флаг.

В столице празднества начинаются в 16 часов предыдущего дня возле дворца великого герцога со смены караула — запоминающегося события, особенно для солдата, который стоял перед дворцом целый год.

Вечерние мероприятия открывает факельная процессия в центре города, что является не только отсылкой к истории, но и напоминанием о том, что хождение по городу пешком вновь стало единственным разумным способом добраться до места назначения по загруженным дорогам.

После заката запускают роскошный фейерверк, за которым можно наблюдать с моста Адольфа. Будьте готовы, что лучшие обзорные площадки будут уже забиты людьми, поэтому приходите на несколько лет раньше и приобретайте в этом районе квартиру с видом на реку Петрус.

По ночам центр города заполнен веселящимися людьми — толчея такая сильная, что исключительно в эту ночь не только дозволяется, но и поощряется разговор на улице с незнакомцами.

Решение организовать крупномасштабную вечеринку перед национальным праздником было принято не для того, чтобы дать отмечающим возможность отойти от похмелья за подлня, как предполагают некоторые источники, а для того, чтобы утром следующего дня люди, злоупотреблявшие всю ночь, были ещё достаточно пьяны, чтобы улыбаться и бурно приветствовать утренний парад.

Англоязычный источник.

Getting coefficient significance stars in LaTeX tables

In this quick entry, it will be shown how it is possible to format regression output tables in LaTeX and automatically print significance codes (in the form of stars) without any symbols placed a priori.

First of all, one should obtain estimates and standard errors from any statistical package (preferably R, but as long as homicide is considered a crime punishable by incarceration or death, and using SPSS is not, one can use whatever they want—although the author wishes it were the other way about).

Second, one should use regular expressions or any other dark wizardry in order to convert such table into a list of LaTeX control sequences taking two argument each, possible delimited by an ampersand. Here is an example how this can be done by taking raw Excel output (tab-delimited) and running any regex engine on it.


market	0.0552	0.0055	0.00988	0.00149
ret	-1.974	0.0329	-0.767	0.00993
sls	-1.31	0.36	-0.342	0.128
age	-0.0109	0.0246	-0.00381	0.00663
roa	-0.107	0.107	-0.0733	0.0281
Intercept	-0.586	0.138	-0.165	0.036

Latex table before regular expressions

Regular expression:

s/([A-Za-z]+)\t([-0-9.]+)\t([0-9.]+)\t([-0-9.]+)\t([0-9.]+)/\1 & \\coef{\2}{\3} & \\coef{\4}{\5} \\\\/g


market & \coef{0.0552}{0.0055} & \coef{0.00988}{0.00149} \\
ret & \coef{-1.974}{0.0329} & \coef{-0.767}{0.00993} \\
sls & \coef{-1.31}{0.36} & \coef{-0.342}{0.128} \\
age & \coef{-0.0109}{0.0246} & \coef{-0.00381}{0.00663} \\
roa & \coef{-0.107}{0.107} & \coef{-0.0733}{0.0281} \\
Intercept & \coef{-0.586}{0.138} & \coef{-0.165}{0.036} \\

Latex table after regular expressions

Do not be surprised that now there 3 columns, not 5. Suppose one would like to have the standard error placed beneath the coefficient. Then what?

This is the standard preamble.

\documentclass[11pt, a4paper]{article}

Now, it is time to write a command that would typeset a fixed-width box (because all entries in a column have the same width). The basic building block will have 3 arguments: width (optional), estimate (numeric), and standard error (numeric).


Now, the stars. fp package provides facilities to work with fixed-point arithmetic, which is more than enough in our case. Three significance levels—10%, 5%, and 1%—and normal distribution quantiles will be used because this was initially written for applied people in accounting. For asymptotically normal estimators, the critical values can be obtained via the R command qnorm(), e. g. qnorm(1-0.001/2), which should give 3.29 as the critical value for a two-sided t test of significance at 0.1% level.

\FPset\critsss{2.575829}% 0.995 quantile of the standard normal distribution
\FPset\critss{1.959964}% 0.975 quantile of the standard normal distribution
\FPset\crits{1.644854}% 0.95 quantile of the standard normal distribution
\FPset\coefest{#2}% The first argument is the estimate
\FPset\coefse{#3}% The second argument is the standard error
\FPdiv\zstat{\coefest}{\coefse}% This is the observed t-statistic
\FPabs\zabs{\zstat}% This is its absolute value
\FPiflt\zabs{\crits}\def\kostyrkastar{\relax}\else% Minimising the risk of overwriting an existing command

Since hard def’s are used for brevity, in order to minimise the chance that the star-drawing command overwrites any existing one, one should consider using their family name or the name of their pet as a command prefix—above is an example of precautiousness rather than hubris.

LaTeX output table coefficient significance stars based on the t statistic

Look how stars appear when the absolute value of the $t$ statistic increases.
$\beta_0$ \coef{0.00}{1.0} ($t=0$) \par
$\beta_1$ \coef{1.10}{2.2} ($t=0.5$) \par
$\beta_2$ \coef{2.00}{2.0} ($t=1$) \par
$\beta_3$ \coef{-3.00}{2.0} ($t=-1.5$) \par
$\beta_4$ \coef{5.10}{3.0} ($t=1.7$) \par
$\beta_5$ \coef{-3.80}{2.0} ($t = -1.9$) \par
$\beta_6$ \coef{4.92}{2.5} ($t = 1.968$) \par
$\beta_8$ \coef{19.60}{7.0} ($t = 2.8$) \par
$\beta_9$ \coef{-800}{8.0} ($t = -100$)

LaTeX output table coefficient significance stars in two columns

		& \null\hfill NCSKEW\hfill\null &\null\hfill DUVOL\hfill\null \\
		Market value & \coef{0.0552}{0.0055} & \coef{0.00988}{0.00149} \\
		Return & \coef{-1.974}{0.0329} & \coef{-0.767}{0.00993} \\
		Sales  & \coef{-1.310}{0.36} & \coef{-0.342}{0.128} \\
		Age & \coef{-0.0109}{0.0246} & \coef{-0.00381}{0.00663} \\
		ROA & \coef{-0.107}{0.107} & \coef{-0.0733}{0.0281} \\
		(Constant) & \coef{-0.586}{0.138} & \coef{-0.165}{0.036} \\
		Sector dummies &\null\hfill Yes\hfill\null &\null\hfill Yes\hfill\null \\
		Year dummies &\null\hfill Yes\hfill\null &\null\hfill Yes\hfill\null \\
		Observations &\null\hfill 14\,689\hfill\null &\null\hfill 14\,694\hfill\null \\
		\multicolumn{3}{l}{\footnotesize Significance codes: * $p<0.10$, ** $p<0.05$, *** $p<0.01$.} \\[-.5ex]
		\multicolumn{3}{l}{\footnotesize Cluster-robust (by country) standard errors in brackets.} \\
\caption{OLS estimates}

If one is working in particle physics and needs to use sigma levels, one can be more creative with their notation. For example, one might need more markers or vertcial placement. This can be achieved in an instant.


LaTeX output table coefficient significance stars and standard errors underneath

		& \null\hfill LHC\hfill\null & \null\hfill Sein\hfill\null & \null\hfill Pectus\hfill\null & \null\hfill Corps-nichons \hfill\null\\
		Experiment 1 & \coefnuclear{4.9908}{1.3099} & \coefnuclear{7.4037}{1.2957} & \coefnuclear{5.6249}{1.3124} & \coefnuclear{1.2275}{1.2893} \\[2ex]
		Experiment 2 & \coefnuclear{4.3236}{1.2972} & \coefnuclear{10.0386}{1.3176} & \coefnuclear{3.7301}{1.3056} & \coefnuclear{6.7275}{1.2844} \\[2ex]
		Experiment 3 & \coefnuclear{6.1690}{1.2955} & \coefnuclear{7.9955}{1.2917} & \coefnuclear{1.8410}{1.2883} & \coefnuclear{8.7442}{1.3116} \\[2ex]
		Experiment 4 & \coefnuclear{0.6228}{1.3083} & \coefnuclear{3.0899}{1.2977} & \coefnuclear{9.3049}{1.3027} & \coefnuclear{2.4693}{1.2962} \\
		\multicolumn{3}{l}{\footnotesize Significance codes: ${}^* = 1\sigma$, ${}^. = 0.5\sigma$.} \\
	\caption{Trying to uncover Higgs' bosom}

One can change the placement of \hfill and \null to change the alignment. It is, however, a very delicate matter, and good alignment differs from table to table since sometimes, the decimal separator should serve as the pivot. A good (but rather restrictive in real world) idea would be using the same number of digits for all table entries.

Another good example would be using bootstrapped percentile confidence intervals and checking whether zero is inside it or not. One can supply multiple intervals. Making that kind of output is extremely easy if one uses boot package. However, examples of such style in real scientific journals are virtually unseen—because either their typesetters have not read this entry (yet) and are blissfully unaware of this method, or this sort of acrobatics is not deemed æsthetically pleasing or practical.


\coefbootci[4cm]{0.034}{-0.01}{0.09} \par

It is possible to make any formatting aspect adjustable. The last example shows how to make coefficients significant at 5% bold.


LaTeX conditional formatting based on t statistics

Virgin insignificant coefficient with $t=1.5$: \coefbold{4.20}{2.8} \par
Chad significant coefficient with $t=2.5$: \coefbold{6.66}{2.664}

P.S. In no way is this entry published in order to promote hacking p-values or abusing the data and/or methodology in order to get a sweet succulent p-value that is less than 0.05. Researchers are encouraged to avoid such filthy terms as ‘a convincing trend towards significance’ and other ones, and honestly provide the results as is, using asymptotic refinement via bootstrap, or empirical-likelihood-based with proper calibration, or, in small samples with limited-value variables, exact values for non-parametric small-sample statistics.

18 most difficult rules in Russian language

Prepare to take a ride on an emotional roller-coaster across the most challenging aspects of Russian language. Or should I rather correct the previous sentence, ‘descent into madness’. Some of my French and Luxembourgian friends married some of my Russian friends and are currently learning Russian in order to improve communication with their spouses and show devotion. Naïve martyrs, says me; little do they know that the system of conjugation and declension has so many illogicalities and exceptions that it is easier to say that there is absolutely no system to declension and conjugation. Let us begin, shall we?

  1. In Russian, if one is narrating a story and wants to emphasise the suddenness of some action, one can substitute the 3rd person past indicative (singular or plural) with the 2nd person present imperative (singular only).
  2. This works in reverse: one can use 2nd person past perfective to denote imperative!
  3. In fact, when narrating, instead of 1st person past continuous, one can use just present (Russian does not have continuous tenses, neither it has present perfective.) However, incidentally, the present tense can also describe future events.

    Шагаю я намедни по плацу, а у меня возьми и отойди пуговица на шинели. Старшина тут же: «Быстро пришил пуговицу обратно!» А я не могу, я встречаюсь с девушкой через час.
    Verbatim translation: I am marching the other day on the drill ground, and suddenly, you, button, imperativego and imperativefly off. Sergeant: ‘You sewed this button back on!’ But I cannot waste time, I have a date in one hour!
    Proper translation: I was marching the other day on the drill ground, and suddenly, a button decided to fly off. Sergeant: ‘Hey you, imperativesew this button back on!’ But I cannot waste time, I am having a date in one hour!

  4. You know, in some languages, verbs have their own conjugation tables. Enormous tables for every single verb. Like conjugaison.com (from the creators of ‘j’aurais voulu que vous sussiez’ and ‘vous pûtes’). Some verbs do not necessarily make sense in certain forms (no one would say about themselves ‘I foal’, which means, to give birth to a colt—a baby horse). Russian has a verb for giving birth to a horse (жеребиться), but does not have its first-person form (жереблюсь? жеребюсь? ожеребляюсь?). Another example is the modification of ‘to head up’ for a wheat spike. If anyone here is a wheat spike, raise your awns. Some verbs do not have certain forms because they would just sound awful (бдеть, to keep vigil — бжу? бдю? бдею?; шерстить, to irritate someone’s skin with wool — шерщю? шерстю? шерщиваю?). So in Russian, those verb forms are substituted with ersatz expressions (сохраняю бдительность, я свитер, оттого я чешу и колюсь). Those are called insufficient verbs since they do not have certain forms. Do not be surprised if you cannot simply translate ‘I shall win’ using the future perfective form of the verb ‘to win’ because there is none (variants such as победю, побежу, побежду are all wrong)! You already lost when you started learning Russian.
  5. Conversely, some verbs have two forms (such verbs are called superfluous). Of course, with slightly different meanings; otherwise, where would be fun in that? For example, the verb ‘cast’ would have the same infinitive but different indicative forms: ‘Zeus casts lightning bolts’ (Зевс мечет молнии) and ‘the athlete throws the hammer’ (спортсмен метает молот).
  6. In Russian language, there is a special form of past tense, plus quam perfectum, or pluperfect (which, of course, is not listed in conjugation tables in textbooks dictionaries!), that is analogous to the English ‘used to / would’ or French ‘avait fait’. However, it is not past perfect in the traditional sense; it denotes a continuous action for which the speaker would like to evoke memories of the distant past. The simple past tense of the verb ходить (to walk) is ходил, but the pluperfect form is хаживал. In it is more or less analogous to the English ‘would’: ‘Back in the days of yore, my father would visit royal dinner parties and would try precious wine from Duke’s personal reserves. Oh, the yesteryears! Poor, poor father, where is he now! Those corks on the shelf from the bottles of wine he would savour are the last vestiges of the long-faded grandeur!’

    Говорить — говаривал, петь — певал, сидеть — сиживал, есть — едал, ездать — езживал, гостить — гащивал, обедать — обедывал, драть — дирал, двигать — двигивал, махать — махивал, жить — живывал, гонять — ганивал, спать — сып́ал, кутить — кучивал, бегать — бегивал.

  7. Also, one cannot just say, ‘there is a glass on the table’. Things on top of other things must either stand, lie, or sit, and there is no rule, logic or mnemonic to determine which verb to use.

    There is a table. There is a glass on the table. It is standing (стакан стоит). There is a fork on the table. It is lying (вилка лежит). However, if we stick the fork into the table, it will be standing (or sticking out; будучи воткнутой в стол, вилка стоит вертикально). Maybe it is the horizontal / vertical orientation that matters? Let’s add a plate and a pan. They are horizontally oriented, but they are still standing (на столе стоят тарелка и сковородка). Now let’s place the plate inside the pan. Now it is lying in the pan (but it had been standing on the table; тарелка лежит в сковородке). The readiness of an object for use is not an indicator: the fork is lying. Not a cat jumps on the table. The cat can stand, sit or lie (кошка стоит, сидит или лежит). It is trivial to connect lying / standing with the cat’s orientation, but sitting? It is a new property. The cat’s bottom is instrumental in sitting (кошка сидит). Now a bird lands on the table. It is sitting as well (птица сидит). However, its legs are fully supporting its body, and in no way the bird is using its botty department. The bird, however, is never standing. Now we go berserk, kill the bird, and a taxidermist (probably Chuck Testa) stuffs it and makes a completely life-like bird—and now it is standing (чучело птицы стоит)! If you think that only animate object can sit, you are wrong again: a shoe sits on a foot, a dress sits on a person, a car sits in snow, a mosquito sits in amber (ботинок сидит на ноге, платье сидит на человеке, машина сидит в сугробе, комар сидит в янтаре). These objects are inanimate, do not have buttocks, but still, they are sitting! Now go figure out which objects sit, which stand, and which lie.

    The Staffordshire department of the Royal Society of Putting Things on Top of Other Things must have accepted too many Russian members!

  8. In Russian, plural is denoted in three possible ways: singular nominative, singular genitive, and... well, plural. It goes like this:
    One cow, two ofcow, three ofcow, four ofcow, five cows, six cows... eleven cows, fourteen cows, and so it goes on until twenty cows. Then... you have twenty-one cow. Because it ends with one! Then, twenty-two... you guessed right... ofcow! Twenty-four ofcow, twenty-five cows, and so it goes on. Everyone remembers Disney’s ‘101 dalmatian’, right?

  9. A small but nasty one: while the words for 20, 30, 50, 60, 70, 80 and 90 are based on 2, 3, 5, 6, 7, 8 and 9 respectively (like two-ten, three-ten etc.), the word for 40 (сорок) does not have anything related to four. Instead, it is related to.... packed bundles of sable fur, why not! It is not the word forty that Russians used to describe the number of sable pelts in a bundle sufficient to make one coat; it is literally the word ‘cloth to wrap a bundle of sable pelts’ that Russians use for forty! The closest relative of this atrocity is the word сорочка (a shirt made of cloth), which is widely used as a synonym for рубашка.

    Russia Reversal meme about numeral for forty

    Bulgarians did not have sables or martens, so in Bulgarian language (which is re-e-eally close to Russian), they use Russian numerals two-ten, three-ten, four-ten, five-ten etc. No animals harmed.

  10. Worse than that, with compound numerals (like three thousand five hundred and seventy-four), one has to decline every word of the numeral. Sometimes, if a numeral is one word made of two parts (like ‘four-hundred’), you have to decline both. Or maybe not. Of course, if the numeral in question is fourteen, you do not decline the four, because screw fourteen! And two-ten, three-ten and four-ten as well. So it will look like this:

    Я гордый владелец трёх тысяч пятисот семидесяти четырёх коров.
    Я гордый владелец трёх тысяч трёхсот тридцати четырёх коров.
    Я гордый владелец трёх тысяч семисот четырнадцати коров.

    I am a proud owner ofthree ofthousand offiveofhundred ofsevenoften offour ofcows.
    But! I am a proud owner ofthree ofthousand ofthreeofhundred threeoften ofthree ofcows.
    But! I am a proud owner ofthree ofthousand ofsevenofhundred fourofteen ofcows.

    Oh, and did I tell you that you have four more cases to remember for these numerals?

  11. To make things worse, numerals ending with 1 or 2 can have genders in the nominative case. But only masculine or feminine. No neuter.

    Два чайника, две чашки, два пальто.

  12. Speaking of genders, verbs have genders in the past, but not in the present or future.

    Металл кипел, сталь кипела, железо кипело, жидкости кипели.

  13. Russian has different declension rules for animate and inanimate objects. However, a stiff is animate, while a corpse is not. Why? Well, only a human being can be a stiff, but corpses can also be of animal origins... so it makes dead animals less animate than dead humans? I hypothesise that this dates back to Russian folk tales in which stiffs came to life after being sprinkled with the Water of Life™ (they also featured the Water of Death™ that was used to join together the pieces of a fallen hero). However, the more logical reasoning is the following: the word мертвец (stiff) has a tinge of ‘freshness’ (just died), maybe clinical death, someone who had just died, but the word труп (corpse) denotes something that has been dead for a while. One does not exhumate stiffs; only corpses. In official language (forensic reports) one might use the collocation свежий труп (fresh corpse) to indicate that the body has not begun to rot yet.

    Я вижу на земле мертвеца. (Accusative taking the form of genitive, as with animate objects.)
    Я вижу на земле труп. (Accusative taking the form of nominative, as with inanimate objects.)

  14. You thought it was over, huh? Cue fractional numerals! You all know that 8 metres should always be ‘eight metres’, no silly gobbledegook like ‘three ofcow’? Well... if it is 8.6 metres, it has to be spelt ‘eight and six tenths... ofmetre’, because it is the bloody tenths that govern the declension ‘of metre’! However, if one means ‘8½ metres’, it becomes ‘eight and a half metres’. Did I say that while ‘my pie’ is singular, ‘my half a pie’ is plural? Like ‘ton gâteau’ but ‘tes demi-gâteaux’!
  15. Speaking of numerals. Most languages have cardinal numerals and ordinal numerals. However, Russian also has collective animate masculine numerals that sometimes describe groups of men, occupations (but possibly females), and uncountable singular nouns! One can’t say ‘two ofman’. One must say ‘two-group men’. Then, ‘three-group men’, up to ‘nine-group men’, although one can use the simple cardinal plural starting from 5. And one must not use collective numerals for groups larger than 10.
  16. And some countable nouns have absolutely mental plural forms: дно — донья (bottom; almost like doña), брелок — брелоки (keychain, but чулок — чулки, stocking), шило — шилья (awl), профессор — профессора (professor). It gets better with genitive: кочерга — кочерёг (hot poker), кровля — кровель (roofing), усадьба — усадеб (manor), вафля — вафель (waffle), цапля — цапель (heron), кегля — кегель (bowling pin), but кегль — кеглей (typographic unit, from German Kegel).
  17. If you thought it does not get worse, brace yourselves. Look at the word ‘human’. Look at it closely? Do you think it has two forms of plural, cardinal and collective? Well, here you are wrong again: three plural forms! It has collective plural (nobody matters), tally plural (each one matters), and archaic plural (for greater grandiloquence). It is like distinguishing between ‘many people’, ‘a number of individuals’ and ‘humanity’. (One could parry that kind of attack on sanity with an over-the-top number of collective nouns in the post-Norman tongue...)

We are approaching the culmination, or, as Kaikosru Sorabji would put it, ‘the crux’. Combining together muddy numerals and incoherent declension, one gets a new frankenrule:
If the numeral ends with 1, 2, 3 or 4, one uses singular declension rules to denote multiplicity. However, in cases other than nominative, one uses plural for those one ending with 2, 3 or 4. If one thinks of those objects as of a large group performing an action, instead of the plural form of the verb, one can use the neuter singular! However, if the numeral ends with 1, one must use the singular form of the verb of the appropriate gender!

Declension of numerals in Russian

I did not make this up. This is true. Denoting accusative with den, as in German, we get the example to illustrate it:

Сто один далматинец переходил дорогу. Грузовик сбил сто одного далматинца.
Сто два далматинца переходило / переходили дорогу. Грузовик сбил сто двух далматинцев.
Сто пять далматинцев переходило / переходили дорогу. Грузовик сбил сто пять / пятерых далматинцев.

One hundred and one dalmatian was crossing the highway. The lorry ran over hundred and denone dendalmatian.
One hundred and two ofdalmatian was/were crossing the highway. The lorry ran over denhundred and dentwo dendalmatians.
One hundred and five dalmatians was/were crossing the highway. The lorry ran over denhundred and denfive / denfive-group dendalmatians.
(Here, one cannot say ‘сто двоих’, hundred and two-group, because it would mean, ‘hundred and both’!)

I have no words. The only thing I have is an back-of-the-envelope estimate: if kids start learning Russian at the age of 0 and by the age of 18, having graduated from high school, are aware of all these pitfalls, then it would take 18 years of intense learning (like at school, multiple hours per day of reading textbooks in Russian, watching TV, communicating with native speakers etc.) to master Russian. If you are a non-Russian reading this, the only way to comprehend Russian grammar is marrying a Russian and learning the language with your kid. By the time they reach the age of majority, you will be able to speak as fluently as they do.

How to set up a Linux computer and printers at the University of Luxembourg

NB: This is not the official guide. This is a summary of my Linux adventures at the University of Luxembourg. I am not affiliated with the IT department of Uni.LU. For actual help, please create tickets in the service portal; I am not offering advice!

First of all, I would like to thank the IT department of the University of Luxembourg for not upgrading to Windows 10. I lack words with which to express my vitriol and execration for Windows 8, but this is a topic for another post. However, not fully happy with Windows 7, I decided to create a dual-boot system (Windows 7 / Linux). The only thing remaining was partitioning the hard drive. In order to shrink the volume, Windows obliged me to do a Disk Check using CHKDSK. And this is where it went downhill. After a reboot, CHKDISK spent the entire night restoring chunks of god-knows-what, then rebooted the computer, started the Start-Up Recovery (who was asking for those options anyway?!), tried to fix the boot loader, froze and... broke the Windows boot loader. Enough is enough.

Margaret Thatcher Enough Is Enough

First of all, I made sure that Linux has all the necessary built-in drivers for Ethernet and the Internet connection is fine. For that purpose I downloaded Linux Mint 19 64 bit with Cinnamon, wrote it onto a USB stick on my Mint 19 laptop (with USB Image Writer, mintstick), and booted from the USB, and verified that the Internet does not require any extra configuration (VPN, PPPoE etc.) and works out of the box. I made a backup of all my documents, asked the IT department to flush the computer completely, and then, shrunk the C:\ drive by 200 GB (100 GB for /, 96 GB for /home, 4 GB for swap).

I decided to go with Arch Linux because I did not want any unnecessary software or dæmons running in the background. However, I cheated a bit and installed Linux Mint first (partitioning the hard drive with the graphical tool in Mint installer), and then, used /dev/sda5 and /dev/sda6 as / and /home respectively in Arch Linux installer. The only unexpected thing was the necessity to launch dhcpcd manually. After that, I got the core packages, installed Xorg, LightDM, and then, Cinnamon.

Linux printing on Uni.Lu computers

Since the University of Luxembourg is using Windows print servers (damn!), one must use CUPS and SAMBA in order to send jobs. They recommend installing system-config-printer as the GUI and then going through the following steps:

  1. System settingsPrinters or Print SettingsAddNetwork PrinterWindows Printer via SAMBA.
  2. SMB Printer: smb://PRINTSERVER.uni.lux/PRINTERNAME.
    Example: smb://woodstock.uni.lux/CL-ENTE
  3. Authentication details (do not press Verify):
    Username: uni.lux\FIRSTNAME.LASTNAME
    Password: your UNI.LU password
  4. Press Forward, select the printer model* and find the generic PCL6 driver (recommended in general) and press Apply.
  5. Do not print a test page since the proper paper size is still not set.
  6. Right-click on the printer, select Properties, go to Printer options and select A4 as the paper size.

* If the printer model is not in the list of available models, make sure that the driver is installed. If the maker is Canon, you can find Canon DEB/RPM drivers on the official Canon support website; Arch Linux has cndrvcups-lb in AUR, and for a other Linux distributions, you need to get the PPD file for the printer model you are going to install. Many common printer models are supported by the Gutenprint driver collection (cupsys-driver-gutenprint in Debian, gutenprint in Arch).

If, when printing for the first time, the user is prompted for their name, the ID must be in the form of uni.lux\firstname.lastname.

If you are located in Belval, JFK or Limpertsberg, the print server is woodstock. If you are located in Kirchberg or Weicker, the print server is baloo.

Remark: you can have access to Canon printers for all sites located mainly in the corridors using the print server ares. In this case, you can choose: smb://ares.uni.lux/WIN_CANON_BLACK, smb://ares.uni.lux/WIN_CANON_COLOR, or smb://ares.uni.lux/WIN_CANON_DUPLEX. Then, on the Canon printer itself, log on and select the file to print.

If you do not have the system-config-printer package installed in your Linux distribution, make sure that the packages cups and samba iares installed and the CUPS and SAMBA dæmons are running (most likely to be named cups and smbd in Debian-based distros, org.cups.cupsd.service and smb.service in Arch-based etc.). Then, you can add the printer via the CUPS local web interface (the true Linux way!):

  1. Go to
  2. Click Add Printer and, if prompted, enter root as the username and your root password. (If you input the wrong credentials and get the Forbidden error, just reboot your computer.)
  3. Pick Other Network PrintersWindows Printer via SAMBA.
  4. In the Connection field, enter smb://uni.lux\FIRSTNAME.LASTNAME:PASSWORD@PRINTSERVER.uni.lux/PRINTERNAME.
    Example: smb://uni.lux\rupert.hastings:password123@woodstock.uni.lux/CL-ENTE
  5. Optionally, give the printer a user-friendly name and a description.
  6. Select the manufacturer and the model. (If the printer is not in the list, see the footnote above.)
  7. If extra accessories are installed on the printer (puncher, extra paper trays etc.), select them in the next step.

Email client

The University of Luxembourg is using IMAP for incoming mail and SMTP for outcoming. Here is a list of parameters you have to supply to your client in order to be able to receive and send messages. So far, I managed to configure Mozilla Thunderbird, Mutt, and NeoMutt.


  • IMAP server: imaps://imap4-3.uni.lu:993
  • Connection security: SSL/TLS
  • Default folder: INBOX


  • SMTP user: same as IMAP user
  • SMTP password: same as IMAP password
  • SMTP server: smtp-3.uni.lu:587
  • SSL force TLS: yes

Server space on Atlas

  1. Make sure your workgroup is WORKGROUP (if necessary, put workgroup = WORKGROUP in /etc/samba/smb.conf). Next, make sure that gvfs-backends (Debian) gvfs-smb (Arch) is installed.
  2. Open your favourite file browser (Nautilus, Nemo etc.) and go to the following address in the address bar: smb://atlas/users/FIRSTNAME.LASTNAME.
  3. Example: smb://atlas/users/rupert.hastings.
  4. When prompted, enter your credentials. Username: FIRSTNAME.LASTNAME. Domain: uni.lux. Password: your UNI.LU password.

To be perfectly honest, I have not been able to open any folder other than the root (atlas)!

Finally, after looking at the Quick Start Guide provided by the IT staff, I would suggest ignoring some items, like:

  • Microsoft Office 365 and Dreamspark. Who needs it? Just use LibreOffice. Do not use MS Word. Use LibreOffice Writer for short documents and LaTeX for long ones.
  • Eduroam configuration. If you have a wired connection, just use it. Otherwise, go to https://cat.eduroam.org and download the necessary SH script.
  • VPN. https://vpn.uni.lu.

Stochastic portfolio optimisation and out-of-sample testing using R

Author: Andreï Victorovitch Kostyrka, University of Luxembourg.

Last updated: 2019-05-31.

One might ask a question, “What is it like to try to pick an optimal portfolio and invest in stocks?” Well, here I have prepared some R code that does the following thing:

  • Downloads stock data from Yahoo Finance from 2014 for a selection of stocks;
  • Puts all stocks together across the maximum set of available dates and nicely imputes missing values using Kalman filter with structural modelling;
  • Tries all possible combinations or portfolios (as many as possible, using Halton pseudo-random low-discrepancy sequences, doing many iterations and saving the best result from every iteration to save memory);
  • Creates amazing colourful plots for subsequent analysis;
  • Returns statistics for equally weighted portfolios, max-return under equally weighted risk, min-risk under equally weighted return, and maximum Sharpe ratio (for simplicity, the ratio of returns to risk, assuming there is no risk-free asset);
  • Tests these 4 portfolios out of sample for 1 year (“if one started trading one year ago, what would one have obtained”).

Two sets of stocks were chosen: most lucrative U.S. companies and most lucrative Russian companies. This is what one gets if one chooses 13 top U.S. companies:

library(imputeTS) # For imputation
library(randtoolbox) # For Sobel sampling
# http://fortune.com/2017/06/07/fortune-500-companies-profit-apple-berkshire-hathaway/
# https://www.swiftpassportservices.com/blog/profitable-companies-russia/
syms <- c("GOOG", "AAPL", "JPM", "BRK-A", "WFC", "MCD", "BAC", "MSFT", "AMZN", "JNJ", "C", "MO", "GE")
symnames <- c("Google", "Apple", "J.P. Morgan", "Berkshire Hathaway", "Wells Fargo", "McDonald's",
              "Bank of America", "Microsoft", "Amazon", "Johnson and Johnson", "Citigroup", "Altria group", "General Electric")
cat("Calculating optimal portfolio for US stocks\n")
# syms <- c("SGTZY", "ROSN.ME", "OGZPY", "LUKOY", "SBRCY", "TRNFP.ME", "NILSY", "ALRS.ME", "YNDX")
# symnames <- c("SurgutNefteGas", "RosNeft", "GazProm", "LukOil", "SberBank", "TransNeft",
#               "Norilsk Nickel", "Alrosa", "Yandex")
# cat("Calculating optimal portfolio for Russian stocks\n")
first.date <- as.Date("2014-01-01")
last.date <- as.Date("2018-08-01")
end.ins <- as.Date("2017-08-01")
dat <- BatchGetSymbols(tickers = syms, first.date = first.date, last.date = last.date)
allt <- dat$df.tickers
allt <- allt[!is.na(allt$price.adjusted), ]
dates <- sort(unique(allt$ref.date))
prices <- data.frame(dates)
prices[, 2:(length(syms)+1)] <- NA
colnames(prices)[2:(length(syms)+1)] <- syms
for (i in syms) {
  datesi <- allt$ref.date[as.character(allt$ticker)==i]
  pricesi <- allt$price.adjusted[as.character(allt$ticker)==i]
  prices[match(datesi, dates), i] <- pricesi
  prices[, i] <- zoo(prices[, i], order.by = prices$dates)
  prices[, i] <- na.kalman(prices[, i])
prices2 <- prices
prices[,1] <- NULL
co <- rainbow(length(syms), end=0.75)
png("1-relative-prices.png", 700, 400, type="cairo")
plot(prices[,1]/mean(prices[,1]), col=co[1], lwd=1.5, main="Prices divided by their in-sample mean", xlab="Time", ylab="Ratio")
for (i in 2:length(syms)) {
  lines(prices[,i]/mean(prices[,i]), col=co[i])
legend("topleft", syms, col=co, lwd=1)
rs <- data.frame(rep(NA, nrow(prices)-1))
for (i in syms) {
  rs[, i] <- diff(log(prices[, i]))
rs[,1] <- NULL
w0 <- rep(1/length(syms), length(syms))
names(w0) <- syms
rmean <- zoo(as.numeric(as.matrix(rs) %*% w0), order.by = time(rs[,1]))
a <- table(as.numeric(format(time(rmean), "%Y")))
daysperyear <- mean(a[1:(length(a)-1)]) # Because the last year is incomplete
rsfull <- rs # Full sample
rs <- rs[time(rs[,1]) <= end.ins, ] # Training sample
rsnew <- rsfull[time(rsfull[,1]) > end.ins, ] # Test sample
meanrs <- colMeans(rs)
covrs <- cov(rs)
musigma <- function(w) {
  mu <- sum(meanrs * w)*daysperyear
  s <- sqrt((t(w) %*% covrs %*% w) * daysperyear)
  return(c(mu=mu, sigma=s, ratio=mu/s))
m0 <- musigma(w0)
num.workers <- if (.Platform$OS.type=="windows") 1 else detectCores()
MC <- 108000
iters <- 100
cc <- 150 # How many colours
cols <- rainbow(cc, end = 0.75)
sharpecuts <- seq(-0.1, 1.5, length.out = cc)
png("2-portfolios.png", 700, 600, type="cairo")
# First iteration
cat("Calculating", MC, "portfolios for the first time.\n")
ws <- halton(MC, length(syms), init=TRUE)
ws <- ws/rowSums(ws)
colnames(ws) <- syms
res <- mclapply(1:MC, function(i) musigma(ws[i, ]), mc.cores = num.workers)
res <- as.data.frame(matrix(unlist(res, use.names = FALSE), nrow=MC, byrow = TRUE))
colnames(res) <- c("mean", "sd", "ratio")
prcmu0 <- ecdf(res$mean)(m0[1]) # Which quantile is the mean
prcsd0 <- ecdf(res$sd)(m0[2]) # Which quantile is the sd
qmplus <- quantile(res$mean, prcmu0+0.005)
qmminus <- quantile(res$mean, prcmu0-0.005)
qsplus <- quantile(res$sd, prcsd0+0.005)
qsminus <- quantile(res$sd, prcsd0-0.005)
meanvicinity <- which(res$mean < qmplus & res$mean > qmminus)
sdvicinity <- which(res$sd < qsplus & res$sd > qsminus)
plot(res[,2:1], pch=16, col=cols[as.numeric(cut(res$ratio, breaks = sharpecuts))],
     xlim=range(res$sd)+c(-0.03, 0.03), ylim=pmax(range(res$mean)+c(-0.03, 0.03), c(0, -10)),
     main="Risk and return for various portfolios", xlab="Annualised volatility", ylab="Annualised log-return")
sharpes <- mus <- sds <- array(NA, dim=c(iters+1, length(syms)))
sharpes[1, ] <- ws[which.max(res$ratio), ]
mus[1, ] <- ws[which(res$mean == max(res$mean[sdvicinity])), ]
sds[1, ] <- ws[which(res$sd == min(res$sd[meanvicinity])), ]
pairsm <- pairss <- array(NA, dim=c(iters+1, 3))
pairsm[1, ] <- res$mean[c(which.max(res$ratio), which(res$mean == max(res$mean[sdvicinity])), which(res$sd == min(res$sd[meanvicinity])))]
pairss[1, ] <- res$sd[c(which.max(res$ratio), which(res$mean == max(res$mean[sdvicinity])), which(res$sd == min(res$sd[meanvicinity])))]
for (i in 1:iters) {
  cat("Calculating", MC, "portfolios, iteration", i, "out of", iters, "\n")
  # ws <- sobol(MC, length(syms), scrambling = 1, init=FALSE) # There is a bug in randtoolbox: it starts returning numbers outside [0, 1]
  ws <- halton(MC, length(syms), init=FALSE)
  ws <- ws/rowSums(ws)
  colnames(ws) <- syms
  res <- mclapply(1:MC, function(i) musigma(ws[i, ]), mc.cores = num.workers)
  res <- as.data.frame(matrix(unlist(res, use.names = FALSE), nrow=MC, byrow = TRUE))
  colnames(res) <- c("mean", "sd", "ratio")
  meanvicinity <- which(res$mean < qmplus & res$mean > qmminus)
  sdvicinity <- which(res$sd < qsplus & res$sd > qsminus)
  maxsh <- which.max(res$ratio)
  maxm <- which(res$mean == max(res$mean[sdvicinity]))
  minsd <- which(res$sd == min(res$sd[meanvicinity]))
  sharpes[i+1, ] <- ws[maxsh, ]
  mus[i+1, ] <- ws[maxm, ]
  sds[i+1, ] <- ws[minsd, ]
  pairsm[i+1, ] <- res$mean[c(maxsh, maxm, minsd)]
  pairss[i+1, ] <- res$sd[c(maxsh, maxm, minsd)]
  points(res[,2:1], pch=16, col=cols[as.numeric(cut(res$ratio, breaks = sharpecuts))])
  rm(res); gc()
points(as.numeric(pairss), as.numeric(pairsm))
sharpems <- apply(sharpes, 1, musigma)
wmaxsharpe <- sharpes[which.max(sharpems[3,]), ]
meanms <- apply(mus, 1, musigma)
wmaxmean <- mus[which.max(meanms[1,]), ] # Maximum returns while SD is the same
sdms <- apply(sds, 1, musigma)
wminsd <- sds[which.min(sdms[2,]), ] # Minimum SD while mean is the same
m1 <- musigma(wmaxsharpe)
m2 <- musigma(wmaxmean)
m3 <- musigma(wminsd)
finalstats <- cbind(rbind(w0, wmaxsharpe, wmaxmean, wminsd), rbind(m0, m1, m2, m3))
rownames(finalstats) <- c("Equal", "MaxSharpe", "MaxReturn", "MinRisk")
colnames(finalstats)[(length(syms)+1):(length(syms)+3)] <- c("RInS", "VInS", "SRInS")
points(m0[2], m0[1], cex=3, pch=16)
points(c(m1[2], m2[2], m3[2]), c(m1[1], m2[1], m3[1]), cex=2, pch=16)
# And now checking these revenues on the test sample
r0 <- zoo(as.numeric(as.matrix(rsnew) %*% w0), order.by = time(rsnew[,1]))
r1 <- zoo(as.numeric(as.matrix(rsnew) %*% wmaxsharpe), order.by = time(rsnew[,1]))
r2 <- zoo(as.numeric(as.matrix(rsnew) %*% wmaxmean), order.by = time(rsnew[,1]))
r3 <- zoo(as.numeric(as.matrix(rsnew) %*% wminsd), order.by = time(rsnew[,1]))
png("3-test-sample-returns.png", 700, 400, type="cairo")
plot(r0, main="Out-of-sample returns of weighted portfolios", ylab="Return", xlab="Time")
lines(r1, col=2)
lines(r2, col=3)
lines(r3, col=4)
legend("topright", c("Equal", "Sharpe", "Max. return", "Min. volatility"), col=1:4, lwd=1)
png("4-test-sample.png", 700, 400, type="cairo")
plot(exp(cumsum(r0)), ylim=c(0.6, 1.4), main="Portfolio growth over the last year in the test sample", ylab="Relative growth", xlab="Time")
lines(exp(cumsum(r1)), col=2)
lines(exp(cumsum(r2)), col=3)
lines(exp(cumsum(r3)), col=4)
legend("bottomleft", c("Equal", "Sharpe", "Max. return", "Min. volatility"), col=1:4, lwd=1)
finals <- cbind(r0, r1, r2, r3)
meanf <- colMeans(finals)*daysperyear
sdf <- apply(finals, 2, sd)*sqrt(daysperyear)
finalsoos <- cbind(meanf, sdf, meanf/sdf)
colnames(finalsoos) <- c("ROOS", "VOOS", "SRROS")
finalstats <- cbind(finalstats, finalsoos)
colnames(finalstats)[1:length(syms)] <- symnames
stargazer(t(finalstats), summary = FALSE, type="html", out="a.html")
write.csv(prices2, "stocks.csv", row.names = FALSE)

This is how the 13 U.S. stocks behaved.

Relative prices of top U.S. stocks

This is how 10 million random portfolios behaved (black dots indicate 100 best portfolios under the three criteria; big black dot for the equally weighted portfolio).

Return and risk of various U.S. portfolios

These are portfolio weights for U.S. companies with their performance.

Equal MaxSharpe MaxReturn MinRisk
Google 0.077 0.011 0.004 0.052
Apple 0.077 0.132 0.128 0.105
J.P. Morgan 0.077 0.133 0.099 0.028
Berkshire Hathaway 0.077 0.019 0.040 0.170
Wells Fargo 0.077 0.003 0.008 0.025
McDonald's 0.077 0.164 0.148 0.192
Bank of America 0.077 0.030 0.034 0.036
Microsoft 0.077 0.078 0.180 0.022
Amazon 0.077 0.088 0.161 0.005
Johnson & Johnson 0.077 0.138 0.064 0.138
Citigroup 0.077 0.003 0.003 0.009
Altria group 0.077 0.198 0.131 0.142
General Electric 0.077 0.003 0.001 0.078
Return in-sample 0.145 0.179 0.190 0.145
Risk in-sample 0.142 0.128 0.141 0.119
Sharpe ratio in-sample 1.023 1.395 1.345 1.217
Return in 2017.07–2018.06 0.135 0.163 0.241 0.053
Risk in 2017.07–2018.06 0.144 0.134 0.146 0.130
Sharpe ratio in 2017.07–2018.06 0.941 1.218 1.657 0.404

This is how these four U.S. portfolios look out of sample.

Returns for different optimal U.S. portfolios out-of-sample

If one had invested in these four portfolios a year before, this is what one would have harvested by now.

Cumulative growth of one's investments into U.S. stocks

And this is the result for top 9 Russian companies.

This is how the 9 Russian stocks behaved.

Relative prices of top Russian stocks

This is how 10 million random portfolios behaved (black dots indicate 100 best portfolios under the three criteria; big black dot for the equally weighted portfolio).

Return and risk of various Russian portfolios

Note that some portfolios can actually yield negative returns, and the volatility is generally higher!

These are portfolio weights for Russian companies with their performance.

Equal MaxSharpe MaxReturn MinRisk
SurgutNefteGas 0.111 0.005 0.009 0.175
RosNeft 0.111 0.162 0.031 0.269
GazProm 0.111 0.012 0.068 0.067
LukOil 0.111 0.018 0.083 0.037
SberBank 0.111 0.038 0.031 0.014
TransNeft 0.111 0.248 0.016 0.213
Norilsk Nickel 0.111 0.079 0.381 0.044
Alrosa 0.111 0.413 0.358 0.049
Yandex 0.111 0.026 0.022 0.131
Return in-sample 0.055 0.227 0.172 0.055
Risk in-sample 0.245 0.218 0.245 0.218
Sharpe ratio in-sample 0.223 1.042 0.701 0.250
Return in 2017.07–2018.06 0.202 0.164 0.223 0.164
Risk in 2017.07–2018.06 0.179 0.170 0.215 0.153
Sharpe ratio in 2017.07–2018.06 1.129 0.965 1.039 1.067

This is how these four Russian portfolios look out of sample.

Returns for different optimal Russian portfolios out-of-sample

If one had invested in these four portfolios a year before, this is what one would have harvested by now.

Cumulative growth of one's investments into Russian stocks

So actually if one used this investing strategy based on the information available in mid-2017, by now, depending on the criterion (Sharpe ratio or maximum profitability) they would have earned 14–22% on Russian stocks or 15–17% on American stocks!

[2019-05-31] Update: for high-dimensional problems (hundreds of stocks), many more MC trials are required! One might also want to change the Halton sequence to a random uniform generator (like ws <- matrix(runif(length(syms)*MC), ncol = length(syms))) or a Latin hypercube sampler (lhs).

TODO: constrained optimisation using the best stochastic solution as a starting point.


Why are you still here? Why are you not trading yet?

Using Twitter to model voters’ attitude towards presidential candidates: a case for Russia in 2018

NB. Any opinions that might be critical of the political system of Russia are justified by the fact that the author is a citizen of Russia himself, and retains the right to give opinions, or commentary, or judgement.


Twitter has been known to serve as a news publication machine that drives the attention of mass media and shapes the preferences of voters; a good example of this would be the presidential elections in the USA in 2016 (Kapko, 2016). Presidential candidates were using this medium to communicate to voters directly and express their opinions outside presidential debates. It has been speculated that predictions made based on Twitter data can perform better than those obtained by robocalls or by examining the distribution of lawn signs.

Sometimes, as in the case with Donald Trump, using free-wheeling speech style can be really beneficial because it reassures voters that their candidates speaks what (s)he thinks. Candidates' tweets are news per se, and news agencies can easily get the low-hanging fruit by just re-publishing them and giving their own interpretation. Besides that, Twitter provides an opportunity to engage into a direct dialogue between the candidates and voters via the feature of directed tweets. Celebrities and influential people often respond directly to the candidates, which drives the evolution of voters' views. It has been noted that often it is clear who is writing the tweet: the candidate him/herself or his/her staffers.

As good as it may sound, there are obvious drawbacks to Twitter. First and foremost, it seems almost impossible to fit a meaningful argument within the limit of 140 (now 280) characters. As a result, it causes oversimplification of discourse. Another point is debatable: as soon as something is posted on Twitter, it is impossible to fully withdraw the tweet if someone else has noticed it. This can be both a drawback and a strong point: as soon as lewd material is posted, the poster bears full responsibility for it, and experiences all potential backlash.

Twitter as a predictor for election results has been rigorously analysed by researchers in the past. In fact, such analyses were applied to the 2010 Australian elections (Burgess and Bruns, 2012), 2010 Swedish elections (Larsson and Moe, 2012), 2011 Dutch elections (Sang and Bos, 2012), 2012 USA elections (Wang et al., 2012), 2015 Venezuelan elections (Yang, Macdonald and Ounis, 2017), 2016 USA elections (Bessi and Ferrara, 2016) etc. Some of them are using convolutional neural networks, some of them are applying clever tools to avoid misclassification, but some of them rely on a simple count of words from positive and negative dictionaries. We are going to adopt this simple approach in an attempt to analyse Russian data and evaluate the balance of powers (or lack thereof) in Russia.

In this work, I am using big data (70 000 tweets in total) and big data processing tools (Google Translate API) to investigate the relative popularity of candidates in the Russian presidential election of 2018. The methodology will be similar to that of Bhatia (2015).


In this work, I develop a simple method to analyse the sentiment towards election candidates in arbitrary languages using Google Translate. It should be noted that as the moment of writing, Google is offering a free $300 trial for its Cloud API, so the author made use of his free access to the machine translation services. The workflow is as follows:

  1. Using R and twitteR package, obtain a large number (10 000) tweets mentioning the candidate;
  2. Break the tweets into small batches (100 tweets per batch separated with triple paragraphs) and write scripts for translation;
  3. Using Google Cloud Translation API, run the translation scripts and obtain the server response as a JSON object;
  4. Input the JSON object into R, break it into separate tweets again and clean up the text;
  5. Count the positive and negative words in those tweets using dictionaries and analyse the distributions.

Obviously, the main drawback of the simple dictionary match is the lack of context. There are examples of why this might return wrong results. Example 1: “Mr. X is going to put an end to corruption!” This author has a positive attitude towards Mr. X, but due to the presence of a negative word (“corruption”), is going to be counted as negative. Example 2: “@newsagency Yes, of course, Mr. Y is going to improve the situation, yeah, like, thousand times!” This is an example of irony; the attitude of this tweet is negative and sceptical, but the simple count will find a positive match (“improve”) and count this tweet as positive. Example 3: “Mr. Z is not honest!”

To counter these three points, an argument should be made. The error from the first category does not change much in terms of emotional attitude; if candidate X is viewed as a corruption fighter, the writer implies that the situation in the country is grave (hence the need to fight), and this negative score should be counted towards the general level of negativity, not just the negativity associated with candidate X. If candidates Y and Z also claim they are going to fight corruption, then all these negative votes will contribute to a fixed proportion of false negatives (say, n%), and the ratio of negative to positive votes should not change. This implies that ratios, not absolute values should be examined. The error from the second category is so semantically complicated that is it improbable that any advanced methods (including machine learning and neural networks) based on pure text analysis are going to detect irony and correctly interpret it in a negative sense. However, the proportion of such tweets is thought to be low, and an advanced user-based analysis is suggested in this case: if a user is generally opposing candidate Y (or is pessimistic in general), then any tweets with positive score about candidate Y should be re-weighted or discarded as “implausible”. The error from the third category, however, can be remedied with the use of multi-layered neural network or contextual analysis algorithms with look-ahead and look-behind: if the word from the “positive” list is preceded by “not”, “no”, “barely” or followed by “my a**” or other similar modifiers, the match should be counted for the “negative” list, and vice versa.


Searching for tweets

Assumption 1. During the period from 2018-01-01 to 2018-03-18, any tweet containing the surname of a candidate is meant to be about the candidate, not his namesake.

This assumption allows us to identify tweets about the presidential candidates by a simple search using the name of the candidate. The candidates seem to have quite unique surnames, and almost all media covering people with their surnames are covering namely those people.

Assumption 2. There is no bias in Twitter search results; i. e., Twitter is not being in favour of any candidate, or is not filtering negative search results, or is not affected by spam bots who praise candidates.

This assumption is crucial for the unbiasedness of our results; so far, there have been no signs of Twitter censorship in Russia. However, there might be a population bias: people who are using Twitter can turn out to be younger and more progressive, and therefore, more opposing to the current political leader, Putin. It means that whereas most tweets about Putin on Twitter can come from a minority of political opposition, the majority of Putin voters are budget-depending elderly people and beneficiaries who live in faraway regions without internet access or any desire to express their opinions publicly.

api_key <- "..."
api_secret <- "..."
access_token <- "..."
access_token_secret <- "..."
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
find1 <- "Путин"
find2 <- "Бабурин"
find3 <- "Грудинин"
find4 <- "Жириновский"
find5 <- "Собчак"
find6 <- "Сурайкин"
find7 <- "Титов"
find8 <- "Явлинский"
find9 <- "Навальный"
finds <- c(find1, find2, find3, find4, find5, find6, find7, find8, find9)
number <- 10000
tweet1 <- searchTwitter(find1, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
tweet2 <- searchTwitter(find2, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
tweet3 <- searchTwitter(find3, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
tweet4 <- searchTwitter(find4, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
tweet5 <- searchTwitter(find5, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
tweet6 <- searchTwitter(find6, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
tweet7 <- searchTwitter(find7, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
tweet8 <- searchTwitter(find8, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
tweet9 <- searchTwitter(find9, n=number, lang="ru", since="2018-01-01", until="2018-03-18")
save(tweet1, tweet2, tweet3, tweet4, tweet5, tweet6, tweet7, tweet8, tweet9, file="tweets.RData")

Breaking the tweets for translation

A simple snippet of code is suggested below to translate the long and heavy objects on Google Translate. As we know, Google charges API users on a character basis (as of writing, $20 per 1 million characters), with virtually no limit for translation (if the proper quotas are set to be infinite). However, just in order to be sure that the translations are made in time and no information is lost, we broke the messages into blocks of 100 tweets.

byhow <- 100
for (cand in 1:9) {
  texts <- get(paste0("tweet", cand))
  blocks <- ceiling(length(get(paste0("tweet", cand)))/byhow)
  for (i in 1:blocks) {
    sink(paste0("scripts", cand, "/scr", i, ".sh"))
    fr <- byhow*(i-1)+1
    to <- byhow*(i-1)+byhow
    curl -s -X POST -H "Content-Type: application/json" \\
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \\
    --data "{
      \'q\': \'
      if (to > length(texts)) to <- length(texts)
      for (j in fr:to) {
        a <- texts[[j]]$text
        a <- gsub("['\"`]", "", a)
      \'source\': \'ru\',
      \'target\': \'en\',
      \'format\': \'text\'
    }" "https://translation.googleapis.com/language/translate/v2"

Therefore, for each of the nine candidates, it will write out up to 100 script files into a corresponding folder.

Using Google Translate API

Once the script files, titled scr1.sh, ..., scr100.sh are ready, all that remains is run them in series with a time-out (since Google API has a character-per-minute quota).

for i in {1..100}
  echo "Processing tweet file $i of 100"
  ./scr$i.sh > out$i.txt
  sleep 12

This script must be run in every folder (1–9), and the translation results will be stored in files out1.txt, ..., out100.txt (or up to the number of tweets divided by 100 if the candidate was not very popular).

Merging the objects into tweet lists

A simple for loop (for 9 candidates, from 1 to the number of output files) is required to merge the data back.

alltrans <- list()
for (cand in 1:9) {
  lf <- list.files(paste0("./scripts", cand, "/"))
  lf <- lf[grep("out", lf)]
  howmany <- length(lf)
  alltrans[[cand]] <- list()
  for (i in 1:howmany) {
    a <- fromJSON(readLines(paste0("scripts", cand, "/out", i, ".txt")))
    a <- a[[1]]$translations$translatedText
    aa <- strsplit(a, "\n\n\n")[[1]]
    aa <- gsub("@ +", "@", aa)
    aa <- gsub("^\n +", "", aa)
    aa <- aa[aa!=""]
    alltrans[[cand]][[i]] <- aa
    if (! i%%10) print(paste0("Candidate ", cand, ": ", i, " out of ", howmany))
clean <- function(t) {
  t <- gsub('[[:punct:]]', '', t)
  t <- gsub('[[:cntrl:]]', '', t) 
  t <- gsub('\\d+', '', t)
  t <- gsub('[[:digit:]]', '', t)
  t <- gsub('@\\w+', '', t)
  t <- gsub('http\\w+', '', t)
  t <- gsub("^\\s+|\\s+$", "", t)
  t <- gsub('"', '', t)
  t <- strsplit(t,"[ \t]+")
  t <- lapply(t, tolower)
alltrans2 <- lapply(alltrans, unlist)
alltrans3 <- lapply(alltrans2, clean)

As we see, this clean-up included removing punctuation, control characters, URLs, and digits from the tweets, and then splitting them at spaces or tabs. For matching, conversion to lowercase was used.

Analysing sentiments

First and foremost, let us look at the distributions of positive and negative tweet counts for various candidates.

positive <- scan('positive-words.txt', what='character', comment.char=';')
negative <- scan('negative-words.txt', what='character', comment.char=';')
posscore <- function(tweet) {
pos.match <- match(tweet, positive)
pos.match <- !is.na(pos.match)
pos.score <- sum(pos.match)
negscore <- function(tweet) {
neg.match <- match(tweet, negative)
neg.match <- !is.na(neg.match)
neg.score <- sum(neg.match)
posscores <- list()
negscores <- list()
for (i in 1:9) {
posscores[[i]] <- unlist(lapply(alltrans3[[i]], posscore))
negscores[[i]] <- unlist(lapply(alltrans3[[i]], negscore))
poscounts <- lapply(posscores, table)
negcounts <- lapply(negscores, table)
cols <- rainbow(10)
cairo_pdf("counts1.pdf", 7, 5)
plot(NULL, NULL, xlim=c(0, 6.7), ylim=c(0, 7516), main = "Positive tweet counts for candidates", ylab = "Tweets", xlab="Positive words in a tweet")
for (j in 1:9) {
for (i in 1:length(poscounts[[j]])) {
lines(c(i-1 + (j-1)/10, i-1 + (j-1)/10), c(0, poscounts[[j]][i]), lwd=5, col=cols[j])
legend("topright", finds, lwd=5, col=cols)
cairo_pdf("counts2.pdf", 7, 5)
plot(NULL, NULL, xlim=c(0, 6.7), ylim=c(0, 7516), main = "Negative tweet counts for candidates", ylab = "Tweets", xlab="Negative words in a tweet")
for (j in 1:9) {
for (i in 1:length(negcounts[[j]])) {
lines(c(i-1 + (j-1)/10, i-1 + (j-1)/10), c(0, negcounts[[j]][i]), lwd=5, col=cols[j])
legend("topright", finds, lwd=5, col=cols)

Distribution of positive word counts

Distribution of negative word counts

Fig. 1. Distribution of tweets

Since there is no real political competition to Putin (position 1) among the present candidates (positions 2–8), let us pay some attention to his closest political rival who was denied access to the elections, Navalny (position 9). As we can see from Fig. 1, the number of tweets related to Putin and containing 1–2 positive words is smaller than that related to Navalny. At the very same time, paradoxically, the number of tweets related to Putin and containing 1–2 negative words is also smaller than that of Navalny.

totpos <- numeric(9)
totneg <- numeric(9)
for (i in 1:9) {
totpos[i] <- sum(poscounts[[i]] * as.numeric(names(poscounts[[i]])))
totneg[i] <- sum(negcounts[[i]] * as.numeric(names(negcounts[[i]])))
# Ordering candidates not alpabetically but as they are in the database
a <- factor(finds)
finds <- factor(a, levels(a)[c(5, 1, 2, 3, 6, 7, 8, 9, 4)])
ratios <- totpos/totneg
ratiosdf <- data.frame(ratio=ratios, name=finds)
p <- ggplot(data=ratiosdf, aes(x=name, y=ratio, fill=name)) + geom_bar(stat="identity") + ylab("Positive/negative ratio") +
theme_minimal() + coord_flip() + scale_x_discrete(limits = rev(levels(ratiosdf$name))) + theme(legend.position="none", axis.title.y=element_blank())
cairo_pdf("ratios.pdf", 8, 7)

Ratio of positive to negative tweet counts

Fig. 2. Tweet sentiment ratios for candidates

It is unsurprising that that the ratio for Navalny, who is known as a fighter against corruption, is lower than that of Putin, despite the fact that there is a substantial number of opposing bloggers, due to the fact that the word “corruption” is negative per se. In general, only one candidate (Grudinin) has this ratio greater or equal to one. This means that tweets containing the names of political leaders are negative in tone in general.

Now it it time to see which words were the top positive and top negative for each candidate.

poswords <- function(tweets) {
pmatch <- match(t, positive)
posw <- positive[pmatch]
posw <- posw[!is.na(posw)]
negwords <- function(tweets) {
pmatch <- match(t, negative)
negw <- negative[pmatch]
negw <- negw[!is.na(negw)]
pdata <- list()
ndata <- list()
for (j in 1:9) {
pdata[[j]] <- ""
ndata[[j]] <- ""
for (t in alltrans3[[j]]) {
pdata[[j]] <- c(poswords(t), pdata[[j]])
ndata[[j]] <- c(negwords(t), ndata[[j]])
pfreqs <- lapply(pdata, table)
nfreqs <- lapply(ndata, table)
dpfreq <- list()
dnfreq <- list()
for (j in 1:9) {
thresh <- length(alltrans2[[j]])/200
a <- pfreqs[[j]]
a <- a[a>thresh]
b <- nfreqs[[j]]
b <- b[b>thresh]
dpfreq[[j]] <- data.frame(word=as.character(names(a)), freq=as.numeric(a))
dpfreq[[j]]$word <- as.character(dpfreq[[j]]$word)
dnfreq[[j]] <- data.frame(word=as.character(names(b)), freq=as.numeric(b))
dnfreq[[j]]$word <- as.character(dnfreq[[j]]$word)
for (j in 1:9) {
p1 <- ggplot(dpfreq[[j]], aes(word, freq)) + geom_bar(stat="identity", fill="lightblue") + theme_bw() +
geom_text(aes(word, freq, label=freq), size=4) + labs(x="Major Positive Words", y="Frequency of Occurence",
title=paste0("Major Positive Words and Occurence for \n", finds[j], ", n = ", length(alltrans2[[j]]))) + 
p2 <- ggplot(dnfreq[[j]], aes(word, freq)) + geom_bar(stat="identity", fill="pink") + theme_bw() +
geom_text(aes(word, freq, label=freq), size=4) + labs(x="Major Negative Words", y="Frequency of Occurence",
title=paste0("Major Negative Words and Occurence for \n", finds[j], ", n = ", length(alltrans2[[j]]))) + 
cairo_pdf(paste0("posw", j, ".pdf"),  2+nrow(dpfreq[[j]])/8, 5)
cairo_pdf(paste0("negw", j, ".pdf"),  2+nrow(dnfreq[[j]])/8, 5)

Positive words for Putin
Negative words for Putin

Positive words for Baburin
Negative words for Baburin

Positive words for Grudinin
Negative words for Grudinin

Positive words for Zhirinovsky
Negative words for Zhirinovsky

Positive words for Sobchak
Negative words for Sobchak

Positive words for Surajkin
Negative words for Surajkin

Positive words for Titov
Negative words for Titov

Positive words for Yavlinsky
Negative words for Yavlinsky

Positive words for Yavlinsky
Negative words for Navalny

Fig. 3. Most frequent positive and negative words for candidates

We defined “most frequent” words as words whose count exceeded the total number of tweets divided by 200. There is just one small problem: such generic words as “like” and “right”, “well” etc. are on this table. We can remedy that by manually constructing a table of most popular meaningful positive and negative words for candidates (Table 1).

Candidate 1st + word 2nd + word 1st − word 2nd − word
Putin support 275 popular 252 problems 224 protests 153
Baburin important 30 right 27 curses 340 shit 235
Grudinin interesting 416 respect 266 dirty 232 poor 160
Zhirinovsky benefit 225 entertain 206 curses 339 whore 333
Sobchak useful 165 truthful 149 whore 571 beg 410
Surajkin free 86 regard 61 provocation 348 curses 339
Titov works 51 ready 30 shit 284 scandal 237
Yavlinsky liberty 91 great 83 sorry 339 shit 294
Navalny top 232 good 181 strike 355 boycott 352

Table 1. Two most frequent meaningful words for candidates


We can make several conclusions. First, several candidates (Baburin, Surajkin, Titov, Yavlinsky) are so unpopular that they were not mentioned 10 000 times over the course of 2.5 months since the beginning of 2018, which implies that they were added to the list of approved candidates to create the illusion of choice; they cannot compete against the obvious leader of the race. Next, on average, people tweet about politicians when they are discontent with something; it is reflected in the average negative score of tweets, except for Putin and Grudinin. Some politicians (Zhirinovsky, Sobchak) are viewed as politically promiscuous, which is reflected in the word frequency analysis. This all creates the illusion of choice presence, whereas most of the population of Russia does not have access to Twitter, or are not politically engaged, or are dependent on the subsidies issued by the present government, or cannot fully explore the majority of opposing opinions because they are carefully filtered out of the central and official media.


Bessi, Alessandro and Emilio Ferrara (2016). ‘Social bots distort the 2016 US Presidential election online discussion’.

Bhatia, Aankur (2015). Twitter Sentiment Analysis Tutorial. Rpubs.

Burgess, Jean and Axel Bruns (2012). ‘(Not) the Twitter election: the dynamics of the #ausvotes conversation in relation to the Australian media ecology’. Journalism Practice 6.3, pp. 384–402.

Kapko, Matt (2016). Twitter’s impact on 2016 presidential election is unmistakable. CIO.

Larsson, Anders Olof and Hallvard Moe (2012). ‘Studying political microblogging: Twitter users in the 2010 Swedish election campaign’. New Media & Society 14.5, pp. 729–747.

Sang, Erik Tjong Kim and Johan Bos (2012). ‘Predicting the 2011 dutch senate election results with twitter’. Proceedings of the workshop on semantic analysis in social media. Association for Computational Linguistics, pp. 53–60.

Wang, Hao et al. (2012). ‘A system for real-time twitter sentiment analysis of 2012 us presidential election cycle’. Proceedings of the ACL 2012 System Demonstrations. Association for Computational Linguistics, pp. 115–120.

Yang, Xiao, Craig Macdonald and Iadh Ounis (2017). ‘Using word embeddings in twitter election classification’. Information Retrieval Journal, pp. 1–25.

Гибель России

Сто лет назад погибла Россия. Не могу собраться с мыслями, поэтому привожу отрывки идей, приходящих в голову.

Читал «Солнце мёртвых» Шмелёва — слёзы наворачиваются на глаза. Как пел Тальков: «О, генеральская тетрадь, забытой правды возрожденье, как тяжело тебя читать обманутому поколенью». Нет больше той самой России, которая подарила миру блестящих писателей и композиторов, которые сформировали то, что все называют русской культурой. Горько от этого.

Отмечать столетие революции — значит радоваться гибели родной культуры.

Да, у меня крестьянские корни, моими предками были малороссийские крестьяне, поэтому, живи я в Российской империи, наиболее ярким событием всей жизни была бы смерть в 30 лет от тифа. Тем не менее, лучше бы уж я сто раз умер в Российской империи крестьянином, чем один раз родился в современной России. Если Бога нет, то кого мне благодарить за то, что хотя бы не родился в СССР?

Если бы у меня была возможность вызвать дух неизвестного белогвардейца и высказать ему все свои мысли, то я бы произнёс всего лишь два слова: «не уберегли».

А современный русский интеллигент мечтает сделать две вещи: устроить в России революцию и бежать во Францию.

О коммунистах

Во «Власти Народа» передовая: «Настал грозный час — гибнет Россия и Революция. Все на защиту революции, так ещё недавно лучезарно сиявшей миру!» — Когда она сияла, глаза ваши бесстыжие?

Ив. Бунин, «Окаянные дни».

Среди людей интеллигентных не принято высказываться о политике — заведено вежливо молчать. С другой стороны, во-первых, ваш покорный слуга никогда не претендовал на интеллигентность, а во-вторых, если все будут молчать (из врождённой незлобливости или от блестящего дворянского воспитания), то в таком случае, как сказал великий, «мы окрасим себя в те цвета, в которые мы окрасим себя» — будем иметь что имеем.

Сегодня получил весточку из России, однако, как оказалось, не самую приятную. КПРФ распространяет по почтовым ящикам листовки, в которых призывает голосовать за своих кандидатов в муниципальные депутаты в Москве.

Я не удивлён, что в политику лезет всякая сволота. Если кто-то задавался вопросом, почему российская экономика так слаба и неконкурентоспособна на международной арене, ответ лежит прямо у него в почтовом ящике. Кумачовые цареубийцы и сермяжные леворуционеры. Почему агитируют они, а стыдно мне? Как у них вообще хватает наглости в столетнюю годовщину великой русской катастрофы и гибели не только русского государства, но и многовековой русской культуры проповедовать коммунизм и ещё пытаться на нём нажиться?

St petersburg demonstration revolution

Однако по прочтении их предвыборной программы в мою душу закрались сомнения. Если в планах кандидатов не значится разжигать пожар мировой революции, то что это за коммунисты такие? Какое-то жалкое недоразумение, предатели собственной идеологии. Хороший коммунист — мёртвый коммунист!

Что делать, если в майские праздники укусила собака?

Скоро грядут майские праздники (причём по уровню нелепости названия День весны и труда можно приравнять только ко Дню зимы и Петропавловской крепости), и парки наводнят владельцы бульдогов, бультерьеров и прочих бульбазавров. Что делать, если вы мирно тарахтели на своём велосипеде по прогулочной дорожке, а на вас внезапно выскочил рычащий Баргест и укусил вас за самое дорогое, что у вас есть, — за мясистую ляжку?

Так вот, математика даёт ответ на этот непростой вопрос. Надо было заранее решить математическую задачу!

Задача. На улице на прямой пешеходной дорожке стоит хозяин с собакой на натянутом поводке длины \(r\) (поводок натянут параллельно дорожке). На расстоянии \(l < r\) по велосипедной дорожке, параллельной пешеходной дорожке, едет велосипедист со скоростью \(v_2\). Как только велосипедист попадает в радиус досягаемости собаки (радиус поводка) в момент \(t_0\), она бежит к нему по траектории, минимизирующей время от момента попадания велосипедиста в предел досягаемости до укуса, со скоростью \(v_1\).

  • Найдите предельное значение скорости велосипедиста \(v_2(v_1, l, r)\), при котором собака не сможет его укусить.
  • В предположении, что при данных параметрах \(r\), \(l\), \(v_1\), \(v_2\) существует траектория, позволяющая собаке успеть добежать до велосипедиста и укусить его, выведите уравнение этой траектории \(x(t), y(t)\) и рассчитайте время \(t^*\), которое пройдёт с момента \(t_0\) до укуса.

P.S. Наверно, вам давно глаза не резали скверные и наспех сделанные иллюстрации? Так вот нате, вот, пожалуйста вам в шапочку...

О вреде дробления выборки

Предположим, изначально у нас было одно полное облако точек. Мы разделили его на три непересекающихся подмножества (обозначены цветами). В каждом из цветных облаков есть значимая взаимосвязь между переменными \(X\) и \(Y\), которую обнаруживают любые регрессионные методы (приведены три: классический МНК, квантильная регрессия при \(q=0{,}5\) и локально линейная регрессия первой степени).

Ложные значимые эффекты на подвыборках

Однако если взглянуть на изначальное облако из всех точек всех трёх цветов, то эффекта никакого не будет (эта ничтожно слабая и малозначимая зависимость показана толстой чёрной линией). Получается парадокс: если индивид относится к зелёной подвыборке, то в ней влияние \(X\) на \(Y\) есть. То же верно и для всех остальных подвыборок. Везде эффект положительный! Однако одновременно для всех точек никакого эффекта нет.

Зависимая переменная: Y
Вся выборка Синие Зелёные Красные
X 0.013 0.839*** 0.482*** 0.598***
(0.030) (0.024) (0.048) (0.046)
Константа 0.004 0.013 -1.202*** 1.245***
(0.031) (0.019) (0.055) (0.053)
Наблюдений 1000 386 307 307
R2 0.0002 0.756 0.252 0.361
Adjusted R2 -0.001 0.755 0.250 0.359
Residual Std. Error 0.981 (df = 998) 0.373 (df = 384) 0.667 (df = 305) 0.640 (df = 305)
F Statistic 0.198 (df = 1; 998) 1,187.609*** (df = 1; 384) 102.762*** (df = 1; 305) 172.266*** (df = 1; 305)
Note: *p<0.1; **p<0.05; ***p<0.01

Вывод: чем более избирательно исследователь будет составлять выборку для исследования («пожилое коренное население, которое до этого трудилось на работах, не требующих высокой квалификации»), тем больше вероятность, что он что-нибудь да обнаружит... и что эта находка будет ложной.

Код для репликации в R (отчего-то плагин глючит и склеивает всё в строку):

n <- 1000; set.seed(100); x <- rnorm(n); y <- rnorm(n); th <- 0.7; g1 <- (y < x + th) & (y > x - th);
g2 <- (y <= x - th); g3 <- (y >= x + th);
m <- lm(y~x); m1 <- lm(y~x, subset=g1); m2 <- lm(y~x, subset=g2); m3 <- lm(y~x, subset=g3)

Самое кошмарное музыкальное произведение

В Сети есть страница с очень интересным списком произведений, озаглавленная «The Darker Side of Classical Music» («Тёмная сторона классической музыки»). На ней собраны образцы наиболее трагичных произведений, которые вышибают у слушателей слезу или вселяют в них ужас. Некоторые записи в нём спорны (Шостакович?), некоторые не вызывают никакого сомнения (Шестая симфония Чайковского, «Плач по жертвам Хиросимы» Пендерецкого, «Остров мёртвых» Рахманинова). Как и в любом подобном списке, в нём есть фамилии забытых позднеромантических композиторов второго эшелона.

Однако одной записи в нём не хватает. Не хватает упоминания композитора, известного как «Чёрный Балакирев» благодаря разрушительной и мрачной силе, присутствующей в его произведениях. Это Сергей Михайлович Ляпунов, родной брат Александра Ляпунова. Он является автором двух симфоний и двух фортепианных концертов. Каждый из них играет тысячами разных красок: Первый концерт — классическая русская печаль, Второй концерт — новь, свежее дыхание, короткий, но изящный эксперимент с гармонией, Первая симфония — эпическое произведение, весь финал которого представляет собой пиррову победу, радость со слезами на глазах...

Однако у Ляпунова есть одно произведение, о котором почти никто не знает. Его почти никогда не исполняют. Во всём зарубежном интернете нет ни одной статьи, посвящённой ему. Это Вторая симфония — самая острая музыкальная катастрофа, чёрный кошмар, падение в пропасть безумия, олицетворение демона самоубийства и символ гибели России. Написанное в 1917 году, оно содержит максимальную концентрацию трагических чувств в первой части. Если у Скрябина первая часть Второй симфонии — это непрерывная борьба с Природой, напряжённая и предвещающая грозу, то у Ляпунова намного больше личного трагизма: мировая война, революция (после которой композитор эмигрировал в Париж), нежелание принять крах великой империи, приходящейся ему родной страной... Всегда бывший мрачным и угнетающим, во Второй симфонии композитор зашёл за грань разрушительного и разбивающего сердце.

Сегодня исполнился ровно год с того самого момента, как в библиотеке Cité Universitaire была совершена самая катастрофическая ошибка 2015 года — была открыта, прослушана и переслушана симфония Ляпунова №2 си-бимоль минор. Именно 18 апреля все демоны, роившиеся внутри этого сундука, вырвались на свободу. Автор убедительно просит неподготовленного читателя смириться с тем, что полчище ночных бесов будет подталкивает прослушивающего эту симфонию сброситься с моста, вспороть живот, взрезать горло или что-нибудь похуже. Все, кто не боится перейти на чёрную сторону искусства, приглашаются к прослушиванию этой музыкальной гибели целой страны только в высококачественных наушниках или перед полноценной акустической системой. Помощи вам ждать неоткуда, никто к вам не придёт. Если вдруг вы не сможете остановиться, заставьте кого-нибудь заставьте вас поклясться самому/самой себе, что никогда в жизни вы больше не переслушаете этого произведения.