Scan-Statistics-Project-4Y-.../Comparaison_of_methods.rmd

490 lines
15 KiB
Plaintext
Raw Normal View History

2022-03-29 07:17:34 +00:00
---
title: "Comparaison of methods"
output: pdf_document
---
# Scan statistique - Méthode de Monte Carlo et calcul de p-value
2022-04-10 20:58:57 +00:00
## Import libraries
```{r}
library("localScore")
library("latex2exp")
library("Rcpp")
library("caret")
2022-05-10 08:05:55 +00:00
library("ROCR")
library("pROC")
2022-04-10 20:58:57 +00:00
```
## 1. Proposition for simulations under $\mathcal{H}_1$
In this part, we propose a method that simulates a Poisson process under the hypothesis $\mathcal{H}_1$. The idea is to simulate a sample under $\mathcal{H}_0$, and add randomly a subsequence under the alternative hypothesis in this sequence.
2022-04-19 09:13:31 +00:00
The function `PoissonProcess` creates a sequence of Poisson process of a parameter lambda
2022-03-29 07:17:34 +00:00
```{r}
PoissonProcess <- function(lambda,T) {
return(sort(runif(rpois(1,lambda*T),0,T)))
}
2022-04-19 09:13:31 +00:00
```
2022-03-29 07:17:34 +00:00
2022-04-19 09:13:31 +00:00
The following function creates a sequence under H0 and add a sequence under H1.
```{r}
SimulationH1 <- function(lambda0, lambda1,T,tau){
ppH0=PoissonProcess(lambda0,T)
ppH1.segt=PoissonProcess(lambda1,tau)
dbt=runif(1,0,T-tau)
ppH0bis=PoissonProcess(lambda0,T)
ppH1.repo=dbt+ppH1.segt
ppH0_avant=ppH0bis[which(ppH0bis<ppH1.repo[1])]
ppH0_apres=ppH0bis[which(ppH0bis>ppH1.repo[length(ppH1.repo)])]
ppH1=c(ppH0_avant,ppH1.repo,ppH0_apres)
return (ppH1)
}
2022-04-05 07:17:58 +00:00
```
2022-03-29 07:17:34 +00:00
2022-04-19 09:13:31 +00:00
`TimeBetweenEvent` compute Time Between Event for a `pp` interval.
2022-04-05 07:17:58 +00:00
```{r}
TimeBetweenEvent <- function(pp){
n=length(pp)
tbe=pp[2:n]-pp[1:n1-1]
tbe=c(0,tbe)
return (tbe)
}
2022-03-29 07:17:34 +00:00
DataFrame <- function(pp,tbe){
list=data.frame(ProcessusPoisson=pp, TimeBetweenEvent=tbe)
}
```
## 2. Simulation of the sequences under $\mathcal{H}_0$ via a Monte Carlo Method
2022-04-19 09:13:31 +00:00
In this part, we will try to simulate, using a Monte Carlo method, a set of $10^5$ independant samples, under the assumption that $\lambda=\lambda_0$, hence, that we are under the null hypothesis $\mathcal{H}_0$.
The function `ScanStat` compute the scan statistic for a sequence, given some parameters $T$ and $\tau$. This function returns the value of the scan stat, and the index of the sequence where it happens
```{r}
ScanStat <- function(pp, T, tau){
n=length(pp)
stop=n-length(which(pp>(T-tau)))
ScanStat=0
for (i in (1:stop)) {
x=which((pp>=pp[i])&(pp<=(pp[i]+tau)))
scan=length(x)
2022-04-19 09:13:31 +00:00
if (scan>ScanStat) {ScanStat=scan
max=i}
}
2022-04-19 09:13:31 +00:00
return (c(max,ScanStat))
}
```
2022-03-29 07:17:34 +00:00
We test the scan statistic method for different values of $\lambda_0$. The method of scan statistic we implemented will allow us to have access to the scan test statistic and where it happens in the sequence.
2022-04-19 09:13:31 +00:00
This function `EmpDistrib` compute the empirical distribution using a Monte Carlo estimator for the scan statistic method. It returns a Data Frame, containing the value of the scan, its probability and the value of its cumulative distribution function.
```{r}
EmpDistrib <- function(lambda, n_sample,T,tau){
pp=PoissonProcess(lambda,T)
scan=c(ScanStat(pp,T, tau)[2])
index=c(ScanStat(pp,T, tau)[1])
for (i in 2:(n_sample)){
pp=PoissonProcess(lambda,T)
scan=rbind(scan,ScanStat(pp,T, tau)[2])
index=rbind(index,ScanStat(pp,T, tau)[1])
}
2022-05-09 21:11:16 +00:00
scan=unlist(scan)
min_scan=min(scan)-1
max_scan=max(scan)
table1=table(factor(scan, levels = min_scan:max_scan))
EmpDis=data.frame(cdf=cumsum(table1)/sum(table1), proba=table1/sum(table1), index_scan=min_scan:max_scan)
EmpDis<-EmpDis[,-2]
return(EmpDis)
}
```
2022-04-19 09:13:31 +00:00
This function plot the cumulative distribution function associated to an empirical distribution function
```{r}
Plot_CDF <- function(lambda,n_sample,T,tau){
Emp=EmpDistrib(lambda,n_sample,T,tau)
title=TeX(paste(r'(Cumulative distribution function for $\lambda=$)', lambda))
plot(Emp$index_scan, Emp$cdf,type="s",xlab="Number of occurrences",ylab="Probability", main=title, col="red")
return(Emp)
}
```
2022-04-19 06:10:45 +00:00
### 2.1. Test of $\mathcal{H}_0: \lambda=\lambda_0$ against $\mathcal{H}_0: \lambda=\lambda_1$, where $\lambda_1 > \lambda_0$
In this part, we will test different values for $\lambda_0$ and $\lambda_1$, and compute the probability of occurrence of a certain scan statistic.
2022-03-29 07:17:34 +00:00
```{r}
#Empiricial distribution under H0
n_sample=10**4
lambda0=3
T=10
tau=1
ppH0=PoissonProcess(lambda0,T)
CDF=Plot_CDF(lambda0,n_sample,T,tau)
```
2022-03-29 07:17:34 +00:00
2022-04-19 09:13:31 +00:00
Compute $p$-value for scan statistic of `ppH1` with `Emp`:
```{r}
2022-04-19 09:13:31 +00:00
PValue <- function(Emp,ppH, T, tau){
SS = ScanStat(ppH,T,tau)
scanH = SS[2]
index_scanH = SS[1]
index = Emp$index_scan
n=length(index)
if (scanH< min(Emp$index_scan)){
return (c(scanH,1,index_scanH))
} else{
if(min(Emp$index_scan)<scanH && scanH<=max(Emp$index_scan)){
return(c(scanH,1-Emp$cdf[scanH-min(Emp$index_scan)],index_scanH))
} else{return (c(scanH,0,index_scanH))}}
}
2022-03-29 07:17:34 +00:00
```
### 2.2. Simulation under $\mathcal{H}_0$ and computation of p-values
On simule des séquences sous $\mathcal{H}_0$, que l'on stocke. On calcule la valeur de la scan stat et de la p-value, que l'on stocke aussi. On a une séquence de p-valeur des scans et une séquence de score local.
```{r}
NbSeqH0=10000
2022-04-05 07:17:58 +00:00
NbSeqH1=NbSeqH0
DataH0=vector("list")
DataH1=vector("list")
2022-04-19 09:13:31 +00:00
lambda0=2
lambda1=5
T=10
tau=1
#Creation of a sequence that contains the sequence simulated under the null hypothesis
for (i in 1:NbSeqH0) {
ppi=PoissonProcess(lambda0,T)
DataH0[[i]]=ppi
}
#Creation of a sequence that contains the sequence simulated under the alternative hypothesis
2022-04-05 07:17:58 +00:00
seqH1begin=c()
for (i in 1:NbSeqH1) {
pphi=SimulationH1(lambda0, lambda1,T,tau)
DataH1[[i]]=pphi
}
#Computation of the time between events
TimeBetweenEventList <- function(list,n_list){
TBE=vector("list",length=n_list)
for (i in (1:n_list)) {
ppi=list[[i]]
ni=length(ppi)
tbei=ppi[2:ni]-ppi[1:ni-1]
TBE[[i]]=tbei
}
return (TBE)
}
tbe0=TimeBetweenEventList(DataH0,NbSeqH0)
```
2022-04-19 09:13:31 +00:00
We compute the p-value associated to all 10000 sequences, and stock them in a vector.
```{r}
#We start by computing the empirical distribution for lambda0
2022-04-12 16:23:44 +00:00
Emp = EmpDistrib(lambda0,n_sample,T,tau)
scan = c()
pvalue = c()
index_scan = c()
#Then, we stock the p-value and the
for (i in 1:NbSeqH0){
2022-04-12 16:23:44 +00:00
ppi = DataH0[[i]]
result = PValue(Emp,ppi,T,tau)
scan = c(scan,result[1])
pvalue = c(pvalue,result[2])
index_scan = c(index_scan,result[3])
}
2022-04-12 16:23:44 +00:00
2022-04-19 09:13:31 +00:00
ScS_H0=data.frame(num=(1:NbSeqH0), scan_stat=scan, pvalue_scan=pvalue,class=c(pvalue<0.05)*1)
ScS_H0
sum(ScS_H0$class[which(ScS_H0$class=='1')])/NbSeqH0
```
2022-04-05 07:17:58 +00:00
```{r}
#We start by computing the empirical distribution for lambda0
scan=c()
pvalue=c()
index_scan=c()
#Then, we stock the p-value and the
for (i in 1:NbSeqH1){
ppi=DataH1[[i]]
result=PValue(Emp,DataH1[[i]],T,tau)
scan=c(scan,result[1])
pvalue=c(pvalue,result[2])
index_scan=c(index_scan,result[3])
}
2022-05-09 21:26:15 +00:00
ScS_H1 = data.frame(num=1:NbSeqH1, scan_stat=scan, pvalue_scan=pvalue, class=as.numeric(pvalue<0.05))
ScS_H1$begin_scan=index_scan
2022-04-19 09:13:31 +00:00
ScS_H1
sum(ScS_H1$class[which(ScS_H1$class=='1')])/NbSeqH1
2022-04-05 07:17:58 +00:00
```
2022-04-19 09:13:31 +00:00
`ScanStatMC` compute local score for `Emp`:
2022-04-12 16:23:44 +00:00
```{r}
ScanStatMC <- function(NbSeq, T, tau, Emp, pp0){
scan=c()
pvalue=c()
index_scan=c()
for (i in 1:NbSeq){
ppi=pp0[[i]]
result=PValue(Emp,ppi,T,tau)
scan=c(scan,result[1])
pvalue=c(pvalue,result[2])
index_scan=c(index_scan,result[3])
}
2022-05-09 21:11:16 +00:00
ScS_H0=data.frame(num=(1:NbSeq), scan_stat=scan, pvalue_scan=pvalue,class=as.numeric(pvalue<0.05))
2022-04-12 16:23:44 +00:00
return(ScS_H0)
}
```
2022-04-05 07:17:58 +00:00
## 3. Local score
2022-04-19 06:10:45 +00:00
### 3.1. Distribution of scores via Monte Carlo
2022-04-19 09:13:31 +00:00
`ComputeE` compute `E` coefficient:
2022-04-05 09:20:46 +00:00
```{r}
ComputeE <- function(lambda0, lambda1){
E = 1
maxXk = floor(E*(log(lambda1/lambda0)))
while (maxXk < 3) {
E = E+1
maxXk = floor(E*(log(lambda1/lambda0)))
}
return (E)
}
```
2022-04-19 09:13:31 +00:00
`ScoreDistribEmpiric` compute score for empiric distribution:
2022-04-05 09:20:46 +00:00
```{r}
ScoreDistribEmpiric <- function(lambda0, lambda1, n_sample, T){
2022-04-05 09:20:46 +00:00
E = ComputeE(lambda0, lambda1)
2022-04-12 15:58:13 +00:00
Score = c()
for (i in 1:n_sample){
ppH0 = PoissonProcess(lambda0,T)
n1 = length(ppH0)
tbe0 = ppH0[2:n1]-ppH0[1:n1-1]
X = floor(E*(log(lambda1/lambda0)+(lambda0-lambda1)*tbe0))
Score=c(Score,X)
2022-04-12 07:29:10 +00:00
}
min_X = min(Score)
max_X = max(Score)
2022-04-05 09:20:46 +00:00
P_X = table(factor(Score, levels = min_X:max_X))/sum(table(Score))
df = data.frame("Score_X" = min(Score):max(Score), "P_X" = P_X)
df <- df[,-2]
2022-04-05 09:20:46 +00:00
return (df)
}
2022-04-12 15:26:00 +00:00
```
2022-04-19 09:29:43 +00:00
```{r}
2022-04-19 09:24:55 +00:00
2022-04-19 09:29:43 +00:00
lambda0=5
lambda1=7
distrib_mc=ScoreDistribEmpiric(lambda0,lambda1,10000,T)
score_moyen=mean(distrib_mc[,1])
print(score_moyen)
score_max=max(distrib_mc[,1])
print(score_max)
score_min=min(distrib_mc[,1])
print(score_min)
amplitude=abs(score_max-score_min)
print(amplitude)
E=ComputeE(lambda0, lambda1)
print(E)
barplot(distrib_mc[,2])
```
2022-04-19 09:24:55 +00:00
2022-04-12 15:26:00 +00:00
```{r}
ScoreDistribTheo <- function(lambda0, lambda1, T){
2022-04-12 15:26:00 +00:00
E = ComputeE(lambda0, lambda1)
2022-04-12 15:58:13 +00:00
score_max = floor(E*log(lambda1/lambda0))
2022-04-19 09:24:55 +00:00
2022-04-12 15:26:00 +00:00
## score_min compute
2022-04-12 15:58:13 +00:00
score_min_c = floor(E*log(lambda1/lambda0)+E*(lambda0-lambda1)*T)
2022-04-19 09:24:55 +00:00
2022-04-12 15:26:00 +00:00
2022-04-12 15:58:13 +00:00
l = seq(score_min_c,score_max,1)
borne_inf = (l-E*log(lambda1/lambda0))/(E*(lambda0-lambda1))
borne_sup = (l+1-E*log(lambda1/lambda0))/(E*(lambda0-lambda1))
proba.l = pexp(rate=lambda0,borne_inf)-pexp(rate=lambda0,borne_sup)
S = sum(proba.l)
new.proba.s = proba.l/S
df = data.frame("Score_X" = l, "P_X" = new.proba.s)
2022-04-12 15:26:00 +00:00
2022-04-12 15:58:13 +00:00
return (df)
2022-04-05 09:20:46 +00:00
}
```
2022-04-18 14:37:55 +00:00
```{r}
2022-05-06 07:27:44 +00:00
T=10
2022-04-18 14:37:55 +00:00
distrib_score_mc=ScoreDistribEmpiric(2,3,10000,T)
2022-05-09 21:26:15 +00:00
distrib_score_theo=ScoreDistribTheo(2,3,T)
2022-04-18 14:37:55 +00:00
2022-04-19 09:24:55 +00:00
distrib_score_mc
distrib_score_theo
2022-04-18 14:37:55 +00:00
distrib_score_mc = ScoreDistribEmpiric(2,3,10000,T)
distrib_score_theo = ScoreDistribTheo(2,3,T)
plot_graph_distrib_score <- function(distrib_score_theo, distrib_score_mc){
# length(distrib_score_mc[,2])
# length(distrib_score_theo[,2])
#diff_distrib_score=abs(distrib_score_mc[,2]-distrib_score_theo[,2])
#par(mfrow = c(1,2))
barplot(distrib_score_mc[,2],col="blue",axes=F)
mtext("Distribution of scores via Monte Carlo",side=1,line=2.5,col="blue")
axis(2, ylim=c(0,10))
par(new = T)
barplot(distrib_score_theo[,2],col="red",axes=F)
mtext("Distribution of scores using the theoretical method",side=1,line=4,col="red")
}
2022-04-18 14:37:55 +00:00
2022-05-06 07:27:44 +00:00
plot_graph_distrib_score(distrib_score_theo, distrib_score_mc)
2022-04-18 14:37:55 +00:00
```
2022-04-19 06:10:45 +00:00
### 3.2. Local score calculation
2022-03-29 07:17:34 +00:00
```{r}
2022-04-12 16:28:04 +00:00
LocalScoreMC <- function(lambda0, lambda1, NbSeq, T, X_seq, P_X, tbe0){
E = ComputeE(lambda0, lambda1)
pvalue = c()
X = c()
min_X = min(X_seq)
max_X = max(X_seq)
NbSeq.NonNulles = 0
for (i in 1:NbSeq){
x = floor(E*log(dexp(tbe0[[i]], rate = lambda1)/dexp(tbe0[[i]], rate = lambda0)))
if (length(x)!=0){
X = c(X,x)
LS = localScoreC(x)$localScore[1]
daudin_result = daudin(localScore = LS, score_probabilities = P_X, sequence_length = length(x), sequence_min = min_X, sequence_max = max_X)
options(warn = -1) # Disable warnings print
pvalue = c(pvalue, daudin_result)
NbSeq.NonNulles = NbSeq.NonNulles + 1
}
}
LS_H0=data.frame(num=1:NbSeq.NonNulles, pvalue_scan=pvalue, class=(pvalue<0.05)*1)
return(LS_H0)
}
```
2022-03-29 07:17:34 +00:00
2022-04-12 16:23:44 +00:00
## 4. Experience plan for comparaison
```{r}
2022-05-10 08:05:55 +00:00
CompareMethods <- function(lambda0, lambda1, NbSeq, T, tau){
if (lambda0 < lambda1){
cat("For T = ", T, ", Nb = ", NbSeq, ", lambda0 = ", lambda0, " and lambda1 = ", lambda1, ":\n", sep = "")
tbe0 = vector("list",length=NbSeq)
2022-05-10 08:05:55 +00:00
pp0 = vector("list", length = NbSeq)
pp1 = vector("list", length = NbSeq)
tbe1 = vector("list", length = NbSeq)
for (i in (1:NbSeq)) {
#Simulation for sequences under H0
ppi = PoissonProcess(lambda0,T)
ni=length(ppi)
pp0[[i]] = ppi
tbei = ppi[2:ni]-ppi[1:ni-1]
tbe0[[i]] = tbei
#Simulation for sequences under H1
ppj1 = SimulationH1(lambda0, lambda1, T, 3)
2022-05-10 08:05:55 +00:00
nj = length(ppj1)
pp1[[i]] = ppj1
tbej = ppj1[2:nj]-ppj1[1:nj-1]
tbe1[[i]] = tbej
}
#cat("- Empiric version:\n")
Score = ScoreDistribEmpiric(lambda0, lambda1, 10**4, T)
2022-05-10 08:05:55 +00:00
LS_H0 = LocalScoreMC(lambda0, lambda1, NbSeq, T, Score$Score_X, Score$P_X, tbe0)
LS_H1 = LocalScoreMC(lambda0, lambda1, NbSeq, T, Score$Score_X, Score$P_X, tbe1)
LS_obtained = c(LS_H0$class, LS_H1$class)
options(warn = -1)
Emp = EmpDistrib(lambda0,10**4,T,tau)
2022-05-10 08:05:55 +00:00
SS_H0 = ScanStatMC(NbSeq, T, tau, Emp, pp0)
SS_H1 = ScanStatMC(NbSeq, T, tau, Emp, pp1)
SS_obtained = c(SS_H0$class, SS_H1$class)
cat("--- Confusion matrix for scan statistic method --- \n")
theoretical_results_SS = c(rep(0,length(SS_H0$num)), rep(1,length(SS_H1$num)))
print(confusionMatrix(as.factor(SS_obtained), as.factor(theoretical_results_SS),
dnn = c("Prediction", "Reference"))$table)
roc_SS = roc(theoretical_results_SS, SS_obtained)
areaSS = auc(roc_SS)
cat("Area under the ROC curve for SS = ", areaSS, "\n")
2022-05-10 08:05:55 +00:00
cat("--- Confusion matrix for local score method --- \n")
theoretical_results_LS = c(rep(0,length(LS_H0$num)), rep(1,length(LS_H1$num)))
print(confusionMatrix(as.factor(LS_obtained), as.factor(theoretical_results_LS),
dnn = c("Prediction", "Reference"))$table)
title_ROC = TeX(paste(r'(ROC curve for $H_0: \lambda_0=$)', lambda0,
r'(against $H_1: \lambda_0=$)', lambda1))
pred.SS = prediction(theoretical_results_SS,SS_obtained)
pred.LS = prediction(theoretical_results_LS,LS_obtained)
perf.SS = performance(pred.SS,"tpr", "fpr")
perf.LS = performance(pred.LS,"tpr", "fpr")
par(new=T)
roc_LS = roc(theoretical_results_LS, LS_obtained)
areaLS = auc(roc_LS)
cat("Area under the ROC curve for LS = ", areaLS, "\n")
2022-05-10 08:05:55 +00:00
cat("-----------------------------------\n")
options(warn = -1)
2022-05-10 08:05:55 +00:00
result <- c('performance.SS'= perf.SS,'performance.LS'= perf.LS)
return(result)
}
}
```
```{r}
NbSeq = 10**4
2022-05-10 08:05:55 +00:00
T = 10
tau = 2
2022-05-10 09:06:27 +00:00
list_of_lambda = list()
list_of_lambda[[1]] = c(1, 3)
list_of_lambda[[2]] = c(2, 6)
list_of_lambda[[3]] = c(4, 7)
list_of_lambda[[4]] = c(2, 9)
2022-05-10 08:05:55 +00:00
2022-05-10 09:06:27 +00:00
i = 1
legend_list = c()
2022-05-10 08:05:55 +00:00
2022-05-10 09:06:27 +00:00
for (Lambda in list_of_lambda){
lambda0 = Lambda[1]
lambda1 = Lambda[2]
result = CompareMethods(lambda0, lambda1, NbSeq, T, tau)
title_ROC = TeX(paste(r'(ROC curve for several values of $\lambda_0$ and $\lambda_1$)'))
perfSS = result[1]
perfLS = result[2]
plot(perfSS$performance.SS, lty=1, col=i, lwd = 2)
par(new=T)
plot(perfLS$performance.LS, lty=2, col=i,lwd = 2)
legend_list=c(legend_list, paste(c("lambda0 = ", lambda0, ", lambda1 = ", lambda1), collapse = ""))
i=i+1
}
legend(0.5, 0.3, legend=legend_list, col=1:length(list_of_lambda), lty=1, cex=0.9,lwd=4, box.lty=0)
```