This post requires basic knowledge of matrix calculus, in particular eigen-decomposition. Understanding how the k-means clustering algorithm works is also useful. The R-code to reproduce all the figures is given at the bottom of the post.

**The Laplacian of the adjacency matrix**

Given N nodes of the network and their links, the adjacency matrix A is the N*N matrix whose element A_ij = 1 if node i is connected to node j, and A_ij = 0 otherwise. In an undirected network the adjacency matrix is symmetric, and since no “self-links” are allowed A_ii = 0 for all i.

In a network with community structure the adjacency matrix is block-diagonal, if the first n_1 nodes belong to community 1, the next n_2 links belong to community 2, and so on until n_K. Here is an example with K=2:

(If the labels of the nodes are not nicely ordered as above, the block-diagonal structure might not be immediately visible; I will come to this point later.)

The Laplacian L of the network is the N*N matrix that has L_ij = -1 if node i and j are connected and L_ij = 0 otherwise. The i-th diagonal element L_ii is equal to the total number of links of node i.

The Laplacian has a number of useful properties:

Look at the following block-diagonal Laplacian matrix with N=10, K=2 (5 nodes in each community) and its eigen-decomposition:

The eigenvalue spectrum shows that the last two eigenvalues are zero. The corresponding eigenvectors are shown as well (the blue one is shifted upwards by one unit). Their elements are constant inside the blocks.

In the following, we use the eigenvalue spectrum to infer the number of communities, and we use the eigenvectors to infer which nodes belong to which community.

**Inferring the number of communities**

As shown above, the number of blocks of the Laplacian can be inferred from the number of vanishing eigenvalues if the block-diagonal structure is perfect. If there are links between nodes in different communities, the Laplacian is only approximately block-diagonal since some off-diagonal elements are nonzero. Therefore, last few eigenvalues will not vanish exactly, but they will be close to zero. The following Laplacian corresponds to a the same network as above, but where the link between nodes 1 and 2 has been deleted, and a link between nodes 3 and 9 has been added.

The number of small eigenvalues, and thus the number of communties can be chosen by “eyeballing” the eigenvalue spectrum for where the cutoff occurs. Or one can automate the selection of K.

I found that the k-means clustering algorithm (R implementation stats::kmeans) can be used to determine the cutoff. The idea is that the eigenvalue spectrum is quite smooth until a sharp cutoff appears before the almost-vanishing eigenvalues. This cutoff divides the eigenvalues into two clusters and we can make the k-means algorithm look for exactly two clusters. For the above eigenvalue spectrum, k-means correctly finds the size of the smaller cluster to be 2:

>> ei <- eigen(laplacian) >> kmeans(ei$values,centers=range(ei$values)) ... Clustering vector: [1] 2 2 2 2 2 2 2 2 1 1 ...

**Inferring the community structure**

I shuffled the labels of the nodes in the figure above but kept their links unchanged. The new Laplacian and its eigen-decomposition now looks as follows:

Do you still see the community structure in the Laplacian? I don’t. But one of the eigenvectors can see it.

There are still two very small eigenvalues, indicating two communities, which is the correct number.

The second to last eigenvector (blue) indicates the community structure. Running the k-means clustering algorithm on the elements of this vector yields:

>> kmeans(ei$vectors[,9],centers=range(ei$vectors[,9])) ... Clustering vector: [1] 1 1 2 1 2 1 2 2 2 1 ...

For comparison, the original node labels were

>> shuf.ind [1] 6 7 1 10 4 8 3 2 5 9

That is nodes 1,2,4,6 and 10 (which correspond to nodes 6-10 under the original labeling) are put into one cluster and the others respectively. By applying the k-means algorithm to the eigenvector elements, we reconstructed the community structure perfectly. I found this method to work also for larger values of N. If K increases, you will have to apply the above procedure iteratively. If more links exist between nodes in different communities, misclassification or overestimation of the number of communities occurs.

The goal of this post was to illustrate a naive method for inferring unknown network properties, namely the community structure. All I know about the network Laplacian and its eigen-decomposition is based on reference [1]. No thourough testing of the above method was performed, and I have no detailed results about misclassification behavior or under/over-estimation of the number of clusters. Please check [1] and references therein if you are curious about more advanced methods for community detection.

I find the above method quite visual and understandable. In case the k-means algorithm returns completely unreasonable results, the number of nodes and the separation of the eigenvector elements could as well be performed by hand if the number of communities is small. I consider this a useful property which more advance methods might not have.

If you find errors or something that remained unclear, feel free to leave a comment. This is it, thanks for watching.

**References**

[1] Newman (2004) Detecting community structure in networks, Eur. Phys. J. B 38, DOI:10.1140/epjb/e2004-00124-y

The R code to produce the above figures:

graphics.off() JPG <- TRUE ################################# # block-diagonal adjacency matrix # Figure 1 ################################# if (JPG) { jpeg("adjacency-mat.jpg",width=400,height=300,quality=100) } else { x11() } par(cex.lab=2,cex.main=2,cex.axis=2) #construct block diagonal adjacency matrix v1 <- c(rep(1,5),rep(0,5)) v2 <- c(rep(0,5),rep(1,5)) blomat <- rbind(v1,v1,v1,v1,v1,v2,v2,v2,v2,v2) diag(blomat) <- 0 layout(t(c(1,1,1,1,2))) #plot adj. matrix image(blomat[,10:1],asp=1,axes=F,col=rainbow(2), main="adjacency matrix") #colorbar par(mar=c(7,2,7,5)) image(matrix(seq(2),ncol=2),axes=F,col=rainbow(2)) axis(side=4,at=seq(0,1,length.out=2),labels=paste(c(0,1)), cex.axis=2) if (JPG) dev.off() ############################ # perfect two block laplacian + eigenspectrum # Figure 2 ############################ #construct laplacian laymat <- rbind(c(1,1,1,1,1,2),c(3,3,3,4,4,4)) v1 <- c(rep(-1,5),rep(0,5)) v2 <- c(rep(0,5),rep(-1,5)) blomat <- rbind(v1,v1,v1,v1,v1,v2,v2,v2,v2,v2) diag(blomat) <- 4 if (JPG) { jpeg("laplace-perfect.jpg",width=500,height=500, quality=100) } else { x11() } par(cex.lab=2,cex.main=2,cex.axis=2) layout(laymat) #plot laplacian image(blomat[,10:1],asp=1,axes=F,col=rainbow(6), main="Laplacian") #colorbar par(mar=c(7,2,7,5)) image(matrix(seq(3),ncol=3),axes=F,col=rainbow(6)[c(1,2,6)]) axis(side=4,at=seq(0,1,length.out=3),labels=paste(c(-1,0,4)), cex.axis=2) par(mar=c(4,6,4,4)) #eigendecomp. ei <- eigen(blomat) #plot eigenvalues plot(ei$values,main="eigenvalues") #plot eigenvectors plot(NULL,xlim=c(1,10),ylim=c(-1,2),main="last two eigenvectors", ylab=NA) lines(ei$vectors[,9]+1,type="o",col="blue") lines(ei$vectors[,10],type="o",col="orange") if (JPG) dev.off() ############################ # block-diagonal laplacian with impurities # Figure 3 ############################ #construct the laplacian v1 <- c(rep(-1,5),rep(0,5)) v2 <- c(rep(0,5),rep(-1,5)) blomat <- rbind(v1,v1,v1,v1,v1,v2,v2,v2,v2,v2) blomat[1,2] <- blomat[2,1]<- 0 blomat[3,9] <- blomat[9,3]<- -1 diag(blomat) <- 0 diag(blomat) <- -colSums(blomat) if (JPG) { jpeg("laplace-imperfect.jpg",width=500,height=500, quality=100) } else { x11() } par(cex.lab=2,cex.main=2,cex.axis=2) layout(laymat) #plot the laplacian image(blomat[,10:1],asp=1,axes=F,col=rainbow(7), main="Laplacian") par(mar=c(7,2,7,5)) #colorbar image(matrix(seq(7),ncol=7),axes=F,col=rainbow(7)) axis(side=4,at=seq(0,1,length.out=7),labels=paste(seq(-1,5)), cex.axis=2) par(mar=c(4,6,4,4)) #eigendecomposition and k-means clustering ei <- eigen(blomat) print(kmeans(ei$values,range(ei$values))) #plot eigenvalues plot(ei$values,main="eigenvalues") #plot eigenvectors plot(NULL,xlim=c(1,10),ylim=c(-1,2), main="last two eigenvectors", ylab=NA) lines(ei$vectors[,9]+1,type="o",col="blue") lines(ei$vectors[,10],type="o",col="orange") if (JPG) dev.off() ############################### #disordered block-diagonal laplacian with impurities # Figure 4 ############################### #construct Laplacian v1 <- c(rep(-1,5),rep(0,5)) v2 <- c(rep(0,5),rep(-1,5)) blomat <- rbind(v1,v1,v1,v1,v1,v2,v2,v2,v2,v2) blomat[1,2] <- blomat[2,1]<- 0 blomat[3,9] <- blomat[9,3]<- -1 shuf.ind <- c(6, 7, 1, 10, 4, 8, 3, 2, 5, 9) blomat <- blomat[shuf.ind, shuf.ind] diag(blomat) <- 0 diag(blomat) <- -rowSums(blomat) if (JPG) { jpeg("laplace-disordered.jpg",width=500,height=500, quality=100) } else { x11() } par(cex.lab=2,cex.main=2,cex.axis=2) layout(laymat) #plot Laplacian image(blomat[,10:1],asp=1,axes=F,col=rainbow(7), main="Laplacian") par(mar=c(7,2,7,5)) #plot colorbar image(matrix(seq(7),ncol=7),axes=F,col=rainbow(7)) axis(side=4,at=seq(0,1,length.out=7),labels=paste(seq(-1,5)), cex.axis=2) par(mar=c(4,6,4,4)) #eigendecomposition ei <- eigen(blomat) #plot eigenvalues plot(ei$values,main="eigenvalues") #plot eigenvectors plot(NULL,xlim=c(1,10),ylim=c(-1,2), main="last two eigenvectors", ylab=NA) lines(ei$vectors[,9]+1,type="o",col="blue") lines(ei$vectors[,10],type="o",col="orange") if (JPG) dev.off()]]>

In this short post I present my first attempt at network analysis: A minimal example to construct and visualize an artificial undirected network with community structure in R. No network libraries are loaded. Only basic R-functions are used.

The final product of this post is this plot:

I am going to walk you through the code to produce the above plots. The complete code is given at the bottom of the post.

**The starting point**

First of all the network specification:

#network specs: K communities, N nodes, n[1] of them have #label 1, etc..., p(link inside)=p.in, p(link outside)=p.out K <- 3 N <- 40 n <- c(rep(floor(N/K),K-1),floor(N/K)+N%%K) labls <- rep(seq(K),n) p.in <- .9 p.out <- .1 pairs <- expand.grid(seq(N),seq(N)) uniq.pairs <- pairs[which(pairs[,1]<pairs[,2]),] uniq.pairs <-lapply(apply(uniq.pairs,1,list),unlist) # I hate R for this line

There are `N`

nodes and `K`

communities. `n[1]`

nodes are in community 1, `n[2]`

in community 2, etc.

The probability `p.in`

is the link probability between two nodes that are in the same community and `p.out`

is the link probability across communities.

The variable `labls`

contains the label of each node.

The variable `pairs`

contains all possible tuples of the set (1,…,N). `uniq.pairs`

is a list of 2-element vectors, containing only the unique links between nodes.

**The p-matrix**

Such a network with community structure is typically modeled by what I would call a p-matrix (the technical term is “stochastic block matrix”). The entry in the i-th row and j-th column tells me the probability that node i is connected to node j. Since the network is undirected, a link between i and j is equivalent to a link between j and i, hence the p-matrix is symmetric. Furthermore no loops are allowed here, so the diagonal elements of the p-matrix are zero.

Community structure leads to the p-matrix being block-diagonal with large constant entries `p.in`

in the blocks across the diagonal and small or vanishing entries `p.out`

outside of these blocks.

#plot the block matrix plot(NULL,xlim=c(1,N),ylim=c(1,N),asp=1,axes=F,xlab=NA,ylab=NA, main="p-matrix") lapply( uniq.pairs, function(z) { colr <- ifelse(labls[z[1]]==labls[z[2]],gray(1-p.in), gray(1-p.out)) points(z,N-z[c(2,1)],pch=15,col=colr) } )

**The adjacency list**

Based on the network specifications, an adjacency list is constructed. The adjacency list is the list of links in the network. A link between nodes i and j is realized with probability `p.in`

if the nodes are in the same community, and with probability `p.out`

otherwise.

#sample an adjacency list adj.list <- lapply(uniq.pairs,function(z) { p <- ifelse( labls[z[1]]==labls[z[2]],p.in,p.out) ifelse(runif(1)<p,1,0)*z}) adj.list <- adj.list[ which(lapply(adj.list,sum)>0) ]

One way of visualizing the network is to plot the adjacency matrix which has dimension N*N and has a one at index (i,j) if nodes i and j are connected and a zero otherwise:

#plot the adjacency matrix plot(NULL,xlim=c(1,N),ylim=c(1,N),asp=1,axes=F,xlab=NA,ylab=NA, main="adjacency matrix") lapply( adj.list, function(z) { points(z,N-z[c(2,1)],pch=15) } )

**Visualizing the links**

Another way of visualizing the network is to plot the individual nodes and connect them with lines according to their links:

#plot the network plot(NULL,xlim=c(0,K),ylim=c(0,1),axes=F,xlab=NA,ylab=NA, main="network") coords <- matrix(runif(2*N),ncol=2)+cbind(labls-1,rep(0,N)) lapply( adj.list, function(z) { x <- coords[ z,1 ] y <- coords[ z,2 ] lines(x,y,col="#55555544") } ) colrs <- rainbow(K) points( coords, pch=15, col=colrs[ labls ])

**So far so good**

This is it. As I said, it was the first step.

Next I want to look at ways to guess the node labels (the community structure) only based on the network structure, so stay tuned.

Thanks for watching.

———————-

The complete code:

###################################################### # construct and plot a stochastic block model, and a # corresponding network with community structure ###################################################### ################################################################ #network specs: K communities, N nodes, n[1] of them have #label 1, etc..., p(link inside)=p.in, p(link outside)=p.out ################################################################ N <- 40 K <- 3 n <- c(rep(floor(N/K),K-1),floor(N/K)+N%%K) labls <- rep(seq(K),n) p.in <- .9 p.out <- .1 ############################################################## #sample an adjacency list # pairs (uniq.pairs) ... list of possible (unique) node pairs # adj.list ... list of links between nodes ############################################################## pairs <- expand.grid(seq(N),seq(N)) uniq.pairs <- pairs[which(pairs[,1]<pairs[,2]),] uniq.pairs <-lapply(apply(uniq.pairs,1,list),unlist) # I hate R for this line adj.list <- lapply(uniq.pairs,function(z) { p <- ifelse( labls[z[1]]==labls[z[2]], p.in,p.out) ifelse(runif(1)<p,1,0)*z}) adj.list <- adj.list[ which(lapply(adj.list,sum)>0) ] #everything in one plot jpeg("network.jpg",width=500,height=500,quality=100) par(mar=c(2,2,4,2),cex.main=1.5) layout(t(matrix(c(1,2,3,3),nrow=2))) #plot the block matrix plot(NULL,xlim=c(1,N),ylim=c(1,N),asp=1,axes=F,xlab=NA,ylab=NA, main="p-matrix") lapply( uniq.pairs, function(z) { colr <- ifelse(labls[z[1]]==labls[z[2]],gray(1-p.in), gray(1-p.out)) points(z,N-z[c(2,1)],pch=15,col=colr) } ) #plot the adjacency matrix plot(NULL,xlim=c(1,N),ylim=c(1,N),asp=1,axes=F,xlab=NA,ylab=NA, main="adjacency matrix") lapply( adj.list, function(z) { points(z,N-z[c(2,1)],pch=15) } ) #plot the network plot(NULL,xlim=c(0,K),ylim=c(0,1),axes=F,xlab=NA,ylab=NA, main="network") coords <- matrix(runif(2*N),ncol=2)+cbind(labls-1, rep(0,N)) lapply( adj.list, function(z) { x <- coords[ z,1 ] y <- coords[ z,2 ] lines(x,y,col="#55555544") } ) colrs <- rainbow(K) points( coords, pch=15, col=colrs[ labls ]) dev.off()]]>

- to add a new entry to my bibliography (bibtex), and
- search through the articles.

Here’s my minimal approach to implementing these two features:

**1. Bibtex**

Crossref provides a simple solution to obtain the bibtex entry based on the doi number. To get the bibtex entry for the article with say, doi number 10.1901/jaba.1974.7-497a, simply issue the command

>> curl -LH "Accept: text/bibliography; style=bibtex" "http://dx.doi.org/10.1901/jaba.1974.7-497a"

@article{Upper_1974, title={The unsuccessful self-treatment of a case of “writer’s block”1}, volume={7}, url={http://dx.doi.org/10.1901/jaba.1974.7-497a}, DOI={10.1901/jaba.1974.7-497a}, number={3}, journal={Journal of Applied Behavior Analysis}, publisher={Society for the Experimental Analysis of Behavior}, author={Upper, Dennis}, year={1974}, pages={497-497}}

To make the output look nice, we apply some sed magic:

>> curl -LH "Accept: text/bibliography; style=bibtex" "http://dx.doi.org/10.1901/jaba.1974.7-497a" | sed "s/, /,\n/;s/},/},\n/g;s/\(.*\)}}/\1}\n}\n/"

@article{Upper_1974,

title={The unsuccessful self-treatment of a case of “writer’s block”1},

volume={7},

url={http://dx.doi.org/10.1901/jaba.1974.7-497a},

DOI={10.1901/jaba.1974.7-497a},

number={3},

journal={Journal of Applied Behavior Analysis},

publisher={Society for the Experimental Analysis of Behavior},

author={Upper, Dennis},

year={1974},

pages={497-497}

}

The above output can be piped to a separate file. I created a separate directory in my `articles`

directory, called `.txtfiles`

. For each file `article.pdf`

, there is a corresponding `.txtfiles/article.pdf.txt`

in this directory. These txt-files contain the bibtex information generated above, and the output produced during the next section.

**2. Making PDF files searchable**

My implementation of pdf full text search involves an application of `pdftotext`

to all the articles in the directory. This definitely produces some overhead, about 50Kb per 10 pages, but I accept this. The following sed-sequence removes a lot of the single-letter, special-character and empty-lines junk produced by running `pdftotext`

on an article that contains lots of figures and equations:

FILE=article.pdf pdftotext $FILE - \ | sed "s/[^a-zA-Z ]//g"\ | sed "s/ / /g"\ | sed "s/ . //g"\ | sed "s/ / /g"\ | sed "s/^. //"\ | sed "s/ .$//"\ | sed "/^\s*\{0,1\}.\s*$/d"\ >> .txtfiles/$FILE.txt

Line-by-line translation:

transform file article.pdf to textfile and write to standard output remove all non-letters replace single space by double space remove all single character words replace double space by single space remove single characters at beginning of line remove single characters at end of line remove all lines that consist of at most a single character and spaces and append the output to .txtfiles/article.pdf.txt

That’s it for now. The file `.txtfiles/article.pdf.txt`

now contains all the information I need about this article.

**Outlook**

Using the two functions 1.) and 2.) you can hack your own minimal bib manager. After the directory .txtfiles has been filled, you can invoke full text searches using `grep`

, for example

grep -H 'author=.*Upper' .txtfiles/*

or have all the bibtex entries returned using `sed`

, like this

sed -s '/^$/,$d' .txtfiles/*

which assumes that the bibtex entry is at the top of the txt-file and that the first blank line of the file appears right after the bibtex entry.

Of course there is a lot of room for improvement for the pdftotext conversion. If you are interested only in keywords, this link might be of interest to you. You might also want to implement your own search script if you don’t like to invoke `sed`

and `grep`

manually. I am going to post a shell script soon which does some things in a more automated way.

The correlation coefficient between two variable and has expectation value zero if two variables are statistically independent of each other. However, if calculated from a finite sample of the two variables, the correlation coefficient might be spuriously different from zero. The magnitude of such spurious correlation coefficients increases if the variables are autocorrelated, as shown by this plot:

R code:

N <- 150 K <- 1000 cc1 <- vector() cc2 <- vector() cc3 <- vector() #calculate K corr. coeffs for each process for (k in seq(K)) { #white noise x <- rnorm(N) y <- rnorm(N) cc1 <- rbind(cc1,cor(x,y)) #AR 1 x <- arima.sim(model=list(order=c(1,0,0),ar=0.5),n=N) y <- arima.sim(model=list(order=c(1,0,0),ar=0.5),n=N) cc2 <- rbind(cc2,cor(x,y)) #random walk x <- cumsum(rnorm(N)) y <- cumsum(rnorm(N)) cc3 <- rbind(cc3,cor(x,y)) } #plot corr coeff histograms jpeg("cc-hist.jpg",width=400,height=700,quality=100) par(mar=c(5,5,5,2),cex.axis=2,cex.lab=2,cex.main=2) layout(seq(3)) hist(cc1,breaks=seq(-1,1,.1),main="white noise",xlab="") hist(cc2,breaks=seq(-1,1,.1),main="AR(1)",xlab="") hist(cc3,breaks=seq(-1,1,.1),main="random walk",xlab="corr. coeff.") dev.off()

For the above plot, correlation coefficients were calculated between realizations of white noise, of autoregressive first order processes (), and of random walks. The lengths of the time series is 150 and 1000 correlation coefficients were calculated for ach process.

Here are examples of the three processes:

For white noise, the histogram over the observed correlation coefficients is sharply peaked around zero. In most cases, the correlation coefficient is very close to zero. Its absolute value rarely exceeds 0.25. For the AR(1) process, the histogram looks similar, however, more coefficients larger than 0.1 are observed than for white noise. In the case of random walk, the picture changes completely. You are equally likely to observe a coefficient between 0 and 0.1 and between 0.7 and 0.8, even though the processes are completely independent from each other. The probability of observing spurious correlation increases if the variables are autocorrelated.

The lasso is a popular statistical method for constructing prediction models for a target , based on linear superposition of inputs . That is, the lasso is a method to find “good” coefficients for the model

One of the great strengths of the lasso is that it can detect inputs which contain no information about the target . The lasso sets the corresponding to exactly zero. This property makes the lasso a workhorse of statistical modeling because it can produce parsimonious models if many potential predictors are available. In R, the packages `glmnet`

contains an efficient implementation of the lasso. If the variable `Y`

is the target and the variable `X`

is the matrix of inputs, which satisfies `length(Y)=dim(X)[1]`

, a lasso model can be fit by

lasso.model <- cv.glmnet(x=X,y=Y)

and the coefficient vector `beta`

is extracted by

beta <- coef(lasso.model,s=lasso.model$lambda.min)

If there is no relation between `Y`

and the columns of `X`

, all elements of `beta`

except the first one should be equal to zero. However, spurious correlation may lead to some of the coefficients being different from zero, thus suggesting that some of the inputs in `X`

do contain some expalantory power with respect to `Y`

. The lasso (at least its implementation in `glmnet`

) is sensitive to such spurious correlations as the following histograms show

R code:

library(glmnet) library(multicore) N <- 150 M <- 50 K <- 500 #function that returns number of nonzero coefficients #for three lasso models, (length N, number of regressors M), #where x and y are white noise, AR1 processes, or random walks #x and y are always unrelated nlasso <- function(dummy) { #white noise target, white noise inputs x <- matrix(rnorm(N*M),ncol=M) y <- rnorm(N) m <- cv.glmnet(x=x,y=y) coefz <- coef(m,s=m$lambda.min)[-1] ncoef1 <- length(which(coefz!=0)) #AR1 targets, AR1 inputs x <- lapply(seq(50),function(z) arima.sim(model=list(order=c(1,0,0),ar=.5),n=N)) x <- apply(as.matrix(x),1,unlist) y <- as.vector( arima.sim(model=list(order=c(1,0,0),ar=.5),n=N) ) m <- cv.glmnet(x=x,y=y) coefz <- coef(m,s=m$lambda.min)[-1] ncoef2 <- length(which(coefz!=0)) #random walk target, random walk inputs x <- matrix(rnorm(N*M),ncol=M) x <- apply(x,2,cumsum) y <- cumsum(rnorm(N)) m <- cv.glmnet(x=x,y=y) coefz <- coef(m,s=m$lambda.min)[-1] ncoef3 <- length(which(coefz!=0)) #return c(ncoef1,ncoef2,ncoef3) } #get the numbers K times and calculate histograms ncoef <- mclapply(seq(K),nlasso,mc.cores=8) ncoef <- t(apply(as.matrix(ncoef),1,unlist)) h1 <- hist(ncoef[,1],breaks=seq(-.5,M+.5),plot=F) h2 <- hist(ncoef[,2],breaks=seq(-.5,M+.5),plot=F) h3 <- hist(ncoef[,3],breaks=seq(-.5,M+.5),plot=F) #plot jpeg("ncoef-hist.jpg",width=400,height=700,quality=100) par(mar=c(5,5,5,2),cex.axis=2,cex.lab=2,cex.main=2) layout(seq(3)) plot(seq(M+1),h1$counts,main="white noise",xlab="",ylab="Frequency",type="s") plot(seq(M+1),h2$counts,main="AR(1)",xlab="",ylab="Frequency",type="s") plot(seq(M+1),h3$counts,main="random walk",xlab="number of nonzero lasso coeffs.",ylab="Frequency",type="s") dev.off()

The histograms give an idea of how many of the 50 possible coefficients are non-zero. Ideally, all coefficients should vanish because by construction there is no relation between inputs and target. In the case of indpendent white noise the lasso almost always sets them all to zero. Only on rare occasions does the number of nonzero coefficients exceed 5. In the case of AR(1) target and inputs, coefficient vectors `beta`

with more than 20 nonzero coefficients occur quite frequently. This is even worse in the case of random walks. The smallest number of nonzero coefficients observed in 1000 experiments is 18! In some cases even 50 coefficients are nonzero, that is, the lasso assigns predictive skill to all 50 inputs, even though none of them has any relation to the target.

The effect will become weaker as the length of the time series increases. Furthermore, there might be implementations of the lasso that are able to take such spurious correlations into account. The bottom line is that in prediction problems autocorrelation can be useful but if it is ignored it can (and will) lead to dubious statistical models.

]]>Now you analyze some more financial time series using the same technique and find similar behavior. Power laws all over the place. You get even more excited. In the paper about the analysis (which you submit to a physics journal) you may throw buzzwords like “scale-free”, “critical phase transition”, and “universality”. You can also add to your CV that you contributed to the understanding of market dynamics.

An analysis of this sort was published in PNAS 108(19) [1]. The authors look at “trend switching in financial markets”. They analyze the behavior of financial time series between “turning points” and find power laws everywhere.

But then somebody else [2] has applied the same analysis to artificial data, simple Brownian motion, and discovered power laws as well. What does that mean? Is this effect not a bit too “universal” if it also occurs in integrated noise? It so turned out the observed power laws were an artefact of the statistical analysis. They have nothing to do with critical phase transitions and scale-freeness of financial market dynamics.

I reproduced the above mentioned analysis in R using the following script (feel free to use it to discover your own power laws):

#produce time series N <- 1e6 x <- cumsum(rnorm(N+1)) dt <- 5 loc.max <- rep(0,N) #find local maxima of order dt for (i in seq(dt+1,N-dt)) { w <- x[(i-dt):(i+dt)] if ( order(w,decreasing=T)[1]==dt+1 ) loc.max[i] <- 1 } #if w is the time difference between two local #maxima l1 and l2, save all possible snippets #l1...(w)...l2...(w)... of length 2*w+1 from x loc.max.pos <- which(loc.max==1) loc.max.N <- length(loc.max.pos) profiles <- list() for (j in seq(loc.max.N-5)) { #skip last 5 to keep indices below N w <- x[loc.max.pos[j]:(2*loc.max.pos[j+1]-loc.max.pos[j])] w <- w - min(w) #"normalize" profiles[[j]] <- w } #transform all profiles so they have the same length len.max #and average over them len.max <- 1000 profile.avg <- rep(0,len.max) for (p in profiles) { p.len <- length(p) rep.vec <- rep( floor(len.max/p.len), p.len ) rep.rest <- len.max - sum(rep.vec) rep.indplus1 <- sample(seq(p.len),rep.rest) rep.vec[rep.indplus1] <- rep.vec[rep.indplus1] + 1 # now rep(p,rep.vec) returns a vector that "looks like" p # but has length len.max profile.avg <- profile.avg + rep(p,rep.vec) } profile.avg <- profile.avg/length(profiles) #plot averaged profile jpeg("avgProfile.jpg",width=400,height=400,quality=100) t <- seq(0,2,length.out=len.max) plot(t,profile.avg,type="l",ylab="x") dev.off() #plot in loglog-axes jpeg("powerlaw.jpg",width=400,height=400,quality=100) plot(NULL,xlim=c(-2.2,-0.5),ylim=c(.5,0.8),xlab=expression(log[10]*"|"*t-1*"|"),ylab=expression(log[10]*"x")) ind.p <- seq(0.5*(len.max+1),len.max) points(log10(abs(t[ind.p]-1)),log10(profile.avg[ind.p]),col="orange",lwd=3,pch=15) ind.m <- seq(0.5*(len.max+1),1) lines(log10(abs(t[ind.m]-1)),log10(profile.avg[ind.m]),col="blue",lwd=2) #fit straight line over range of interest fit.ind.p <- ind.p[12:150] beta <- lm( log10(profile.avg[fit.ind.p])~log10(abs(t[fit.ind.p]-1)) )$coefficients #... and plot it lines(log10(abs(t[fit.ind.p]-1)), beta[1]+.02+beta[2]*log10(abs(t[fit.ind.p]-1)),lwd=2,lty=2) text(x=-1,y=.75,labels=paste("beta=",round(beta[2],digits=4))) legend(x=-2.1,y=.65,c("right slope","left slope"),lty=c(1,1),lwd=c(2,2),pch=c(15,-1),col=c("orange","blue")) dev.off()

In the script, I analyzed a time series of Brownian motion of length 1 million. A point in the time series is labelled as a local maximum if it is the largest value in the window . Then the original time series is cut into snippets: Each snippet starts at a local maximum and has length ,where is the distance to the next local maximum in the time series. Of course, the different snippets have varying lengths. In order to make them comparable, they are artificially stretched into series of equal lengths and shifted such that their minimum is equal to zero. The average over all snippets is the quantity of interest, plotted here:

The original idea was that this profile can be interpreted as characterizing the average behavior of the financial time series between and after turning points (local maxima). The slopes to the left and to the right of the central peak follow power laws with exponent , as shown by fitting a straight line in a log-log plot:

It bothers me a bit that unlike reported in [1] and [2], in my analysis the coefficients corresponding to the two sides of the peak come out the same. But the authors were not very specific in how the “stretching” of the individual snippets was actually performed. It seems my way of doing it (lines 30-34 in the code) made the results even more “universal”.

The reply which uncovered the mistake ultimately got rejected by PNAS, with the explanation of not adding significantly to the field. A funny part of the story is the letter [3] one of the authors of the rejected reply sent to the editor of PNAS. He complains about the bad state of science if simpler explanations of theories get rejected because of being too boring. The most remarkable sentence is this:

… a fundamental error can remain published as “truth” in PNAS without the normal debate that should be the domaim of real science. … In other words, we can add “shit” to the field but we cannot correct and remove “shit” from the field

It probably helps to be a famous professor to get away with writing a letter like this to the editor of PNAS.

It think publishing an analysis whose interpretation is not warranted by the data is nothing to be ashamed of. As can be nicely seen, the scientific process works; the error is found and reported. Future researchers are (ideally) warned and hopefully won’t make the same mistake.

However, rejecting the one who uncovers the error and offers a simpler, but less exciting explanation for the phenomenon, on grounds of not contributing anything new, is indeed something to be ashamed of. Shame on you, PNAS.

Another lesson learnt from this story is that, if you have performed any statistical analysis that is more complex than calculating the mean and the standard deviation, you should perform the same analysis on noise to make sure that whatever effect you observe is indeed a unique feature of your data and not an artefact of the analysis.

References:

[1] Preis et al (2011) “Switching processes in financial markets” PNAS doi: 10.1073/pnas.1019484108

[2] Filimonov & Sornette (2011) “Spurious trend switching phenomena in financial markets”, arxiv.org/abs/1112.3868

[3] D. Sornette (2011) Letter to the editor of PNAS

I was once told that the reason that such a shape was so commonly used for aeroplane wings was merely that then one could study it mathemtically by just employing the Zhoukowski transformation. I hope that this is not true!

(R. Penrose, “The Road to Reality”, p.150)

Penrose here talks about a complex holomorphic mapping also known as the aerofoil transformation.

What you need is a cirlce in the complex plane, described by the complex function , that passes through the point . Then the transformation

transforms this circle into the cross section of an airplane wing.

Let’s look at this in R. Fortunately, R has some complex number arithmetic already built in. A suitable circle is constructed by

z <- complex(mod=2,argument=seq(0,2*pi,len=100))+(sqrt(0.5)+1i)

and the plot below (Fig. 8.15 in “Road to Reality”) is produced by

par(mfrow=c(1,2),mar=rep(3,4)) plot(z,type="l",asp=1,main="z-plane") plot((z+1/z)/2,type="l",main="w-plane",asp=0.5)

Maybe at some point I also understand what Penrose means by

… the (idealized) airflow around [the wing] can be directly obtained from that around a ‘wing’ of circular cross-section

Is it true that a vector field on the surface of the circle directly transforms into the wind field around the wing? How do you transform a vector field by the above equation? Does the transformed wind field explain why the plane flies? Would be nice indeed, but this is another post.

]]>The thing is, it does make a difference. If you stick, you have a probability of 1/3 to win the car, but if you change, your chances go up to 2/3. The idea is that the showmaster conveys information about where the car is by opening a wrong door. There are mathematical arguments as to why you should change. If you find that unintuitive, you are in good company. Why should you be smarter after the showmaster has opened a wrong door than before? An intuitive explanation of why the showmaster conveys information is this:

Imagine there were not 3 but 1000 doors to begin with. There is still one car but now there are 999 goats. You get to choose a single door initially. Now the showmaster does not open 1 door but 998 doors, all with goats, and you are again left with your initial choice and one more door. Do you stick to your choice now or do you change to the door the showmaster did not open? He clearly gave you a hint by not opening that particular door.

]]>If you were standing at arm’s length from somenone and each of you had one percent more electrons than protons, the repelling force would be incredible. How great? Enough to lift the Empire State Building? No! To lift Mount Everest? No! The repulsion would be enough to lift a “weight” equal to that of the entire Earth!

Really? Where is that back of the envelope? Alright here we go:

Coulomb’s law: ,

where , for two electrons and , since the distance is “arm’s length”. So we have if the two persons had one excess electron each. A single electron in person A’s body feels the force of each of the individual excess electron in person B’s body, that is . The total force on person A is the sum of forces of person B on each of its excess electrons, that is .

I model a person as a piece of sugar (). A single molecule has 96 electrons, one percent excess is about one electron per molecule. Each such molecule weighs about 180 atomic mass units which is about . The person is thus made up of about sugar molecules, and one percent excess electrons amounts to . So we have .

The Earth is an Iron sphere with radius and density . So it weighs about which is the equivalent of a gravitational force of which is more than the repelling force between the two charged sugar cubes.

We are off by 6 orders of magnitude. We might get them right by modeling the person as a wet sugar cube or something. Anyway, the analogy is nice, even though we might only be able to lift a fraction of the Moon after all.

]]>The current state of the rotor is completely described by the current angle of rotation and the current angular momentum . We shall be interested only in the state of the rotor right after the kick. All quantities evaluated right after the kick are indicated by a prime.

If the rotor rotates at an angular velocity , the angle of rotation changes between two kicks by . Only the component of in the direction of motion is transferred. Therefor the angular momentum increases by as a result of the kick. We have

,

where is the angular velocity after the kick. We can make the above equations dimensionless (unitless) by dividing the angular momentum by a typical angular momentum of the system. This typical angular momentum is given by where is the typical velocity of the system. We divide the – equation by and get

.

The dimensionless angular momentum is the rotors angular momentum measured in units of the typical angular momentum . We have summarized some parameters into the constant .

Now, the (already dimensionless) angle equation can be manipulated a little. We notice that which is the same as which is simply the dimensionless angular momentum after the kick . So we now have

,

where we can, without loss of generality, apply to the angle equation and thus also to the angular momentum equation.

This is the famous Chirikov standard map which describes the behavior of the kicked rotor right after the kicks. It maps a phase space point before the kick to its corresponding point after the kick.

What follows is an R-script that chooses 1000 random points on the square and uses each point as the starting point to draw 1000 iterations of the standard map:

smap<-function(t,l) { K<- 1 for ( i in seq(1e3) ) { t[i+1]<- (t[i]+l[i]) %% (2*pi) l[i+1]<- (l[i]+K*sin(t[i+1])) %% (2*pi) } cbind(t,l) } par(mar=rep(0,4)) plot(NULL,xlim=c(0,2*pi),ylim=c(0,2*pi),axes=F,xlab="",ylab="") for (i in seq(1000) ) { t<-c(runif(1)*2*pi,runif(1)*2*pi) m<-smap( t[1],t[2] ) points(m[,1],m[,2],col=rgb(runif(1),runif(1),runif(1)) ,pch=".",cex=2) }

And this is the output:

]]>