--- title: "Tutorial on the snha package" shorttitle: "snha package tutorial" author: Detlef Groth, University of Potsdam, Germany date: 2023-09-04 abstract: > The *snha* package provides easy to use R functions to apply the St. Nicolas House Analysis to your data. The algorithm traces associations chains between interactiving variables. The algorithm was described recently by Groth et. al. (2019) @Groth2019 and more detailed by Hermanussen et. al. (2021) @Hermanussen2021. In this package vignette the basic workflow for analyzing your and raw data and as well for analysing precomputed correlation matrices is demonstrated. bibliography: bibliography.bib csl: nature-biotechnology.csl output: function () { rmarkdown::html_vignette(toc=TRUE,css="style.css") } vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Tutorial on the snha package} %\usepackage[utf8]{inputenc} --- ## Introduction The package `snha` explores interacting variables by searching association chains where correlation coefficients between variables drop in a regular order between a set of variables. The package can be used by calling the function `snha` with your data, where the columns must be your variables. The return value is an object of class `snha` which can be visualized using a plot function. The details of the analysis can be inspected by looking at the internal variables of this object. Below follows a minimal analysis for the `birthwt` data from the _MASS_ R package. The variables are: * _age_ - mother's age in years. * _lwt_ - mother's weight in pounds at last menstrual period. * _race_ - mother's race (1 = white, 2 = black, 3 = other). * _smoke_ - smoking status during pregnancy (0 = no, 1 = yes). * _ptl_ - number of previous premature labours. * _ht_ - history of hypertension (0 = no, 1 = yes). * _ui_ - presence of uterine irritability (0 = no, 1 = yes). * _ftv_ - number of physician visits during the first trimester. * _bwt_ - birth weight of child in grams. Let's start with the data preparation. For illustrative purposes we add a random data vector as well: ```{r label=ddataprep} set.seed(125) ### retrieve the data library(MASS) data(birthwt) birthwt$low=NULL ### remove column for the low indicator ### which is 1 i a child has low birtwt rnd=round(rnorm(nrow(birthwt),mean=10,sd=2),2) ### rnd just contains random data birthwt=cbind(birthwt,rnd=rnd) # adding it head(birthwt) ``` OK, we are ready to go: We added the random data column `rnd` and removed the redundant column `low` which indicated low birth weight, for this we have the `bwt` column in the data set. So we do not need a redundant variable. Let's now first start for illustrative purposes with a PCA and then with our SNHA method where we use Spearman correlation as it is more robust against outliers than Pearson correlation, we set the p-value threshold, _alpha_ to 0.1 as the algorithm is very resistant against the detection of spurious correlations. ```{r label=plotsnha, fig.width=10,fig.height=5,fig.cap="Birth weight data variable interactions"} opar=par(mfrow=c(1,2),mai=c(0.8,0.8,0.1,0.2)) library(snha) ### retrieve some data pca=prcomp(t(scale(birthwt))) summary(pca) plot(pca$x[,1:2],xlab='PC1', ylab='PC2',pch=19,cex=5,col='salmon') text(pca$x[,1:2],colnames(birthwt)) text(-5,-10,"PCA",cex=2) as=snha(birthwt,method="spearman",alpha=0.1) par(mai=c(0.8,0.2,0.1,0.2)) plot(as,layout="sam",vertex.size=7,lwd=3,edge.width=3) text(-1.5,-1.8,"SNHA",cex=2) box() par(opar) ``` In the PCA plot on the left, the most import variables, having high values in the first component, are on the left and right borders of the plot, unimportant variables are in the center, negatively associated deeply interacting variables such as birthweight (bwt) and premature labours (ptl) are on opposite sides of the plot. These characteristics of the PCA plot make it hart to follow the variable relations. In contrast the variables in the SNHA graph on the right show immediately logical interactions, the birth weight is positively associated to mothers last weight, and negatively to smoking, premature labours and uterine irritability, white people smoke more and white mothers visit more often physicians ... The older the mother the more visits at physicians and hypertension is positively associated with weight of the mother. What are the R-square values, the prediction power for every node based on linear models and what are the connections between the variables stored in the adjacency matrix 'theta': ```{r} round(snha_rsquare(as),2) as$theta ``` It can be seen that the overall strength of the association is very small, largest r-square value is 0.18 for smoke, 0.13 for birthweight (bwt) but still the analysis show reasonable results without having the necessity of finding some optimal threshold. ## Decathlon data Here is an other example where we analyze the relationship between the different decathlon disciplines with athletes taking part in the 1988 Olympics and which had results above 7000 points. We perform the St. Nicolas House Analysis and later check the average R-square values for the each node. ```{r label=dec,fig.width=9,fig.height=6,fig.cap="SNHA - Decathlon Data 1988"} ### data loading data(decathlon88) head(decathlon88) A=snha(decathlon88,method="spearman",alpha=0.1) cols=rep("salmon",10) cols[names(A$data) %in% c("jave","shot","disc","pole")]="skyblue" plot(A,layout="sam",vertex.color=cols,vertex.size=8,cex=1.1,edge.width=5) snha_rsquare(A) mn=mean(snha_rsquare(A)) title(paste("R-square = ",round(mn,2))) ``` As you can see the variables nicely separates between disciplines related to the upper part of the body (blue) and disciplines where the legs do most of the work (salmon). The mostly hated 1500m run is negatively associated to the throwing disciplines. The running distances are building a chain _100-400-1500m_ as expected and the jump disciplines are close to each other high-jump (high), hurdles (X110), long jump (long) and pole. As you can see the variables are just in their logical order. ```{r} round(A$sigma,2) round(A$p.value,3) ``` For illustrative purposes create a graph with the same layout but with edges showing all significant correlations. ```{r label=dec2,fig.width=9,fig.height=6,fig.cap="Decathlon Data 1988 (p-value Graph)"} B = A$theta B[]=0 B[A$p.value<0.05]=1 diag(B)=0 plot.snha(B,layout='sam',vertex.color=cols,vertex.size=8,cex=1.1,edge.width=5) ``` As you can see the major relationships are the same, but there are a few more edges which did however not enhance the overall data structure. In case of really interacting variables it would be as well difficult to distinguish between direct and indirect associations, as the latter can be as well very easily become significant if the primary interaction is highly significant. ## Swiss dataset example Let's finish with an other data set, the `swiss` data which are available in every R installation. Here we try out both the correlation methods, Spearman and Pearson correlation. We use the function `snha_layout` to determine a layout matrix which we will then reuse for both plots. ```{r label=plot,fig.width=10,fig.height=5,fig.cap="Swiss data variable associations"} library(snha) data(swiss) head(swiss,4) ### shorter names useful for display later in the graph colnames(swiss)=abbreviate(colnames(swiss)) head(swiss,4) opar=par(mfrow=c(1,2)) ### options(warn=-1) as=snha(swiss,method="pearson") ### store layout for reuse in two graphs lay = snha_layout(as,mode="sam") plot(as,layout=lay,vertex.size=8,main="Pearson") as=snha(swiss,method="spearman") plot(as,layout=lay,vertex.size=8,main="Spearman") par(opar) ``` Here is the resulting adjacency matrix: ```{r label=theta,results='asis'} knitr::kable(as$theta) ``` As you can see the structure remains the same, but Pearson correlation shows more edges, we should check if the data are normally distributed. Again, without playing around with some parameters or thresholds we get immediately the general associations between the data. Let's just check if the data are normally distributed and then conclude if we should use Spearman correlation for non-normally distributed data or Pearson correlation for normally distributed data: ```{r, result="asis"} ### prepare a test returning only p-values mtest = function (x) { return(shapiro.test(x)$p.value) } df=data.frame(orig=round(apply(swiss,2,mtest),3)) df=cbind(df,log2=round(apply(log2(swiss),2,mtest),3)) knitr::kable(df) ``` As you can see, both with the original data and as well with the log-normalized data the Shapiro-Wilk test has a few significant entries, so we reject the Null-hypothesis that these data are coming from a normal distribution. So for our example using the `swiss` data we should very likely prefer using the Spearman correlation. ## Plotting The plotting of the graph can be changed in various ways, for details see `?plot.snha`. Here I just give a few examples. As the graph is generated based on the underlying pairwise correlations, it might be useful to display the pairwise correlation either in a correlation plot or by adding the correlations values on the edges of the graph. Here an example where we do first a correlation plot and then a plot of the SNHA graph overlaying the edges with the correlation values. ```{r label=corrplot,fig.width=10,fig.height=5,fig.cap="Correlation and Network plot with correlation values on the edges"} opar=par(mfrow=c(1,2),mai=rep(0.2,4)) sw=snha(swiss,method="spearman",alpha=0.1) plot(sw,type="corrplot") plot(as,edge.text=round(as$sigma,2),edge.pch=15,layout='sam') par(opar) ``` ## Log-Likelihood The edge quality can be judged either by the log-likelihood ratio for the individual chains or by bootstrapping where we look how often a certain chain was found if we do re-samplings with our data set. Let's first calculate the log-likelihoods for the different chains which were found. We can see the underlying chains either directly using the internal object chains or by using the `snha_get_chains` method which returns a data frame: ```{r label=chains} snha_get_chains(as) ``` The `m` in front of a chain name indicated that the chain was found by investigating the variable to be in the middle of a chain, the `a` indicated that the chain was at the beginning of the investigated chain. For details on the algorithm have a look at Hermanussen et. al. (2021[@Hermanussen2021]). The log-likelihood for these chains can be calculated using the function `snha_ll` like this: ```{r label=ll} snha_ll(as) ``` The relevant p-values are in the last column, if the p-value is higher than 0.05 we can assume that the chain is sufficient to capture the dependency between the variables of the chain. Here for the chain 2 and 3 this is the case, whereas for the first chain the p-value is very low indicating that the chain is not sufficient to capture the variable dependencies. One reason might be that we used Spearman correlation to create the graph whereas log-likelihood assumes linear dependencies. ## Bootstrapping Another approach to evaluate the quality of chains and edges is bootstrapping. We sample several times items from the data set with replacement and we redo thereafter the analysis with each of the samples. Edges which appear only very rarely are less likely to be of importance and significance. Let's use an example: ```{r label=boot,fig.width=9,fig.height=4,fig.cap="Boostrap Example"} opar=par(mfrow=c(1,2),mai=c(0.1,0.1,0.7,0.1)) as.boot=snha(swiss,method="spearman",prob=TRUE) lay=snha_layout(as.boot,method="sam") plot(as,layout=lay,vertex.size=6,main="Single Run") plot(as.boot,layout=lay,vertex.size=6,main="Bootstrap Run") par(opar) ``` Solid lines shown in the graph above indicate that edges where found in more than 75 percent of all re-samplings, broken lines indicate edges appearing in more than 50% of all re-samplings and dotted lines in 25-50% of all re-samplings. As you can see the bootstrap method does find a few more edges than the single run variation of the `snha` method. If you network is not too large it is usually recommended to use bootstrapping to get more insights into the edge quality and to get as well edges if the network is more dense and has a lot of highly connected nodes. ## Creating your own data In order to test the algorithm there is as well in the package a function which allows you to generate data for directed and undirected graphs, either using the given adjacency matrix as precision matrix or using a Monte Carlo simulation as described by Novine et. al (2021 @Novine21). Here an example: ```{r werner} W=matrix(0,nrow=6,ncol=6,dimnames=list(LETTERS[1:6],LETTERS[1:6])) W[1:2,3]=1 W[3,4]=1 W[4,5:6]=1 W[5,6]=1 W ``` For such an adjacency matrix we can create data like this: ```{r} data=snha_graph2data(W) dim(data) round(cor(t(data)),2) ``` As you can see the correlations follow the given graph, we can as well plot these for better illustration: ```{r label=wplot,fig.width=8,fig.height=3,out.width=900,fig.cap="True graph, correlations and predicted graph (left to right)"} opar=par(mfrow=c(1,3),mai=rep(0.2,4)) plot.snha(W) plot.snha(cor(t(data)),type="cor") plot.snha(snha(t(data))) par(opar) ``` ## Comparing two analysis Sometimes you will create two different graphs from the same set of data and then you would like to visualizes these graphs and compare differences and simlarities. For instance you would like to analyse only a subset of the data or use an other statistical approach for the SNHA.. In this case it is important to keep the same layout for both plotted graphs. Here an example on how to do this: ```{r label=compareplot,fig.width=9,fig.height=3,out.width=900,fig.cap="Comparing graphs with the same layout"} as1=snha(swiss,method="pearson") as2=snha(swiss,method="spearman") ### get a layout lay=snha_layout(as1,mode="sam") par(mfrow=c(1,3),mai=rep(0.2,4)) plot(as1,layout=lay) plot(as2,layout=lay) ### highlight edges in common theta=as1$theta+as2$theta theta theta[theta<2]=0 theta[theta==2]=1 plot.snha(theta,layout=lay) ``` The last graph then contains only edges which belong to both graphs. Similarly you could highlight edges which are in graph 1 but not in graph 2 etc. That way you can explore more easily the differences between both graphs. To get some overall measures for the similarity you could do something this: ```{r} table(as1$theta[upper.tri(as1$theta)],as2$theta[upper.tri(as2$theta)]) ### so 11 out of 15 possible edges/no-edges are the same. cor(as1$theta[upper.tri(as1$theta)],as2$theta[upper.tri(as2$theta)]) ``` But that general values for the full graph does not take the more specific things, edges into consideration which you might be interested. ## Installation As long as the package is not yet on the CRAN repository the package can be usually installed using the submitted `tar.gz` archive with the following commands: ``` library(tcltk) pkgname=tclvalue(tkgetOpenFile( filetypes="{{Tar.gz files} {*.tar.gz}} {{All files} {*.*}}")) if (pkgname != "") { install.packages(pkgname,repos=NULL) } ``` It is as well possible to install the latest version directly from the Github repository like this: ``` library(remotes) remotes::install_github("https://github.com/mittelmark/snha") ``` Thereafter you can check the installation like this: ``` library(snha) citation("snha") ``` ## Background Details and Concept Analyzing multivariate data is often done using visualization of pairwise correlations, using principal component analysis or multidimensional scaling as typical methods in this area. The `snha` package provides an alternative approach, by uncovering ordered sequences of correlation coefficients which can be reversed [@Groth2019] [@Hermanussen2021]. Existing chains are translated into edges between the variables, here taken as nodes of a graph. The graph can be then visualized and the major relations between the variables are visible. The basic assumption of the method is the assumption that correlations coefficients between two variables, where one variable directly influences the other, are larger than those of secondary associations. So for instance if we assume that a variable _A_ influences a variable _B_, and _B_ influences _C_, it can be assumed, that _r(AB)_ > _r(AC)_ and that in the opposite direction _r(CB)_ > _r(CA)_. The algorithm provided in the `snha` package uncovers such association chains where the order of correlation coefficient can be reversed. The advantage of the method is that there is only a very limited requirement for choosing thresholds for instance for the p-value or for the correlation coefficient. The reason is that the existence of such association chains with the correct ordering of three or more nodes is much less likely to exists by accident then significant pairwise correlations. In the following we will first illustrate the concept on a simple hypothetical association chain and thereafter you might again study the real world examples at the beginning of this vignette with more understanding. ## Simple association chain Let's assume we have a simple association chain where a variable _A_ is influencing a variable _B_, _B_ is influencing a variable _C_ and _C_ is influencing variable _D_ like this: ```{r label=start,fig.width=6,fig.height=1.7,fig.cap="An association chain"} opar=par(mai=c(0.1,0.1,0.1,0.0)) plot(1,xlab="",ylab="",axes=FALSE,type="n",xlim=c(0.5,4.5),ylim=c(0.8,1.2)) arrows(1:3,rep(1,3),1:3+0.8,rep(1,3),lwd=3,length=0.1) points(1:4,rep(1,4),pch=19,col="salmon",cex=6) text(1:4,1,LETTERS[1:4],cex=2) par(opar) ``` In this situation we can assume that, despite of the omnipresent noise in such situation, the correlations of directly interacting variables is higher in comparison to variables only connected only via other variables. Let's assume for simplicity reasons, that the correlation between directly connected variables drops down from r=0.7 to around r=0.5 for secondary connected variables and r=0.3 for tertiary connected variables. So a possible correlation matrix could look like this: ```{r result='asis'} C=matrix(c(1,0.7,0.5,0.3, 0.7,1,0.7,0.5, 0.5,0.7,1,0.7, 0.3,0.5,0.7,1), nrow=4,byrow=TRUE) rownames(C)=colnames(C)=LETTERS[1:4] knitr::kable(C) ``` Let's now add a little bit of noise and visualize the pairwise correlations using the plot function of the `snha` package. ```{r label=corplot,fig.width=6,fig.height=3,fig.cap="Visualization of correlation matrix sigma and the adjacency matrix theta"} set.seed(123) opar=par(mfrow=c(1,2),mai=c(0.1,0.1,0.1,0.1)) C=C+rnorm(length(C),mean=0,sd=0.1) C[lower.tri(C)]=t(C)[lower.tri(C)] diag(C)=1 as=snha(C) round(as$sigma,3) plot(as,type="corplot") as$theta plot(as) par(opar) ``` As we can see, the correlations are now slightly altered. A simple *r* threshold mechanism, for instance taking only correlations larger than 0.5 into consideration would as well have false positive edges like between the nodes B and D. The function `snha` takes as input either a correlation matrix or a data matrix or data.frame and tries to find such association chains. The association chain is stored in the internal object theta and can be visualized using the default plot command. ## Summary Here are the functions to be used by the normal user of the package: * _snha_ - create a snha graph object * _plot_ - plot a snha graph object * _as.list_ - create a list out of a snha graph object, ready to write for instance into an Excel file * *snha_get_chains* - get the actual chains which were found and which build the graph * *snha_graph2data* - generate for a given adjacency matrix some data * *snha_rsquare* - get r-square values for the nodes based on linear model to have a qualitative measure for the graph prediction. The snha graph object contains a few internal variables which might be of interest for the user: * _alpha_ - the chosen p-value threshold * _chains_ - the found association chains * _data_ - the input data * _method_ - the correlation method * _p-values_ the pairwise p-values * _probabilities_ - in case of bootstrapping the proportion how often a chain was found * _theta_ - the adjacency matrix for the nodes / variables ## Build information The package was build using `r R.version.string` on `r R.version$platform` using snha package `r packageVersion('snha')`. ```{r} print(sessionInfo()) ``` ## References