Scan-Statistics-Project-4Y-.../Dataset_study.rmd

650 lines
18 KiB
Plaintext

---
title: "Evaluation of the performance of the new statistical tool"
output: pdf_document
---
\tableofcontents
In this part, we will try to generate a dataset and study the premises of the scan statistic methods.
To do so, we will use different types of data: either it will be a dataset we created ourselves, or a dataset that we studied last semester in Statistical Modeling. This dataset, called _SAheart_. This dataset is a compilation of 462 observations following 10 variables to explain the occurrence of a coronary heart disease.
\section{1. Study of a random dataset, with given law of probability}
In this section, we will try to randomly simulate a sample, given a known law of probability. To do so, we will use the R built-in function to simulate a variable of distribution $\mathcal{U}[0,1]$, and use the Monte Carlo's method to create these sample. The idea is to see if there are any irregularities to this dataset for random laws.
\subsection{1.1. Simulation of random samples, for a continiuous law of probability}
We will first try to create a sample of known-distribution, for instance, an exponential distribution of whom we know the value of the parameter $\lambda$. Thus, we will take $\displaystyle\lambda=\frac{1}{1500}$.
For instance, let's consider that the sample created $X$ represent the lifetime of $n=100$ bulbs. Hence, if we create a dataset of law $\mathcal{E}(\lambda)$, then the average lifetime of a bulb. is equal to $$\mathbb{E}[X]=\frac{1}{\lambda}=1500$$.
To create this sample, we will use a Monte Carlo method. Indeed, if $U\sim \mathcal{U}[0,1]$, $X\sim \mathcal{E}(\lambda)$, and if we denote $u$ a simulation of law $U$, then we have that $$x=\frac{\ln(1-u)}{\lambda}$$ is a simulation of the same distribution as $X$.
```{r}
n=100
u=runif(n,0,1)
x=c(1:n)*0
lamb=1/1500
x=-log(1-u)/lamb
```
This Monte Carlo method generates automatically a $n$-sample (where $n=100$). We can verify, using Kolmogorov-Smirnov adequacy test, that the created sample is following an exponential distribution, considering a certain error $\alpha$. The alternative `two.sided` will test:
$$\mathcal{H}_0: F=F_X \quad \text{vs.} \quad \mathcal{H}_1: F\ne F_X $$
where $F$ denotes the cumulative distribution function of the exponential law $\mathcal{E}(\lambda)$, and $F_X$ the cumulative distribution function of the sample of law $X$.
```{r}
ks.test(x,pexp,1/1500, alternative="two.sided")
```
Here, since the p-value is larger than $\alpha=5\%$, we can conclude that the dataset we generated is following an exponential distribution of parameter $\lambda$, given a level of confidence $1-\alpha$.
\subsection{1.2. Insertion of random irregularities of a given law of probability}
In order to test the method of scan statistic, we will randomly change $m=30$ variables in the sample. Logically, if the n-sample is distributed given an exponential law, then, since we use a Monte-Carlo method to create the sample, then all the variables are distributed according to this law of probability. Hence, any subset of the sample is distributed given this law of probability. \
Inserting randomized values of another law of probability will allow us to test if we have some significant changes of data, and will illustrate the scan statistic method.
```{r}
m=30 #number of changes
index=floor(runif(m,1,n))
lamb2=1/500
y=x
for (i in 1:m){
u=runif(1,0,1)
y[index[i]]=-log(1-u)/lamb2}
```
Here if we apply Kolmogorov-Smirnov adequacy test, we obtain the following result:
```{r}
ks.test(y,pexp,1/1500, alternative="two.sided")
result=ks.test(y,pexp,1/1500, alternative="two.sided")
print(result$statistic)
```
We can see that if we change randomly 40 out of 100 data, then the Kolmogorov-Smirnov adequacy test reject the null hypothesis. Hence, the fact that the sample is distributed given an exponential law of parameter $\displaystyle\lambda=\frac{1}{1500}$.
\subsection{1.3. First implementation of a scan statistic method}
Then, we can try to implement a method of scan statistic that will study the Kolmogorov-Smirnov statistic of test. The idea is to find the range of elements of the sample that has the largest statistic of test, that is to say to find the range where the distribution is less-likely to be compared to an exponential distribution of parameter $\displaystyle \lambda=\frac{1}{1500}$.
We will try to implement this method of statistic scan on a constant window of size $10$. Then, each window will go through all the data set, until we reach its end.
```{r}
n=100
size=10
max=0
rangemax=c(1:size)*0
start=1
end=size
for (i in 1:(n-size)){
result=ks.test(y[start:end],pexp,1/1500, alternative="two.sided")
compared=result$statistic
if (compared > max){max=compared
rangemax=c(start:end)}
start=start+1
end=end+1
}
print(max)
print(rangemax)
```
This code is a first implementation of a scan statistic on a given set of known probability, with some randomization. Here, we can see that, if we consider a constant-sized window, then the set with the largest Kolmogorov-Smirnov test statistic is given by `maxrange`. Hence, if we consider a serial number on the production of bulbs, then the set we will try to adjust will be the one given by `maxrange`.
We can try to change the size of the window using the exact same code, and ammending the value of `size`.
```{r}
size=5
max=0
rangemax=c(1:size)*0
start=1
end=size
for (i in 1:(n-size)){
result=ks.test(y[start:end],pexp,1/1500, alternative="two.sided")
compared=result$statistic
if (compared > max){max=compared
rangemax=c(start:end)}
start=start+1
end=end+1
}
print(max)
print(rangemax)
```
With `size=5`, we find that the subset with maximal test statistic is a subset of the one found for `size=10`. In terms of p-value:
```{r}
print(ks.test(y[rangemax],pexp,1/1500, alternative="two.sided")$p.value)
```
At level $\alpha=5\%$, we reject the null hypothesis.
If we study the original `x` sample, we observe:
```{r}
size=10
max=0
rangemax=c(1:size)*0
start=1
end=size
for (i in 1:(n-size)){
result=ks.test(x[start:end],pexp,1/1500, alternative="two.sided")
compared=result$statistic
if (compared > max){max=compared
rangemax=c(start:end)}
start=start+1
end=end+1
}
print(max)
print(rangemax)
print(ks.test(y[rangemax],pexp,1/1500, alternative="two.sided")$p.value)
```
Hence, we can see that, in the dataset `x`, wihtout any irregularities, there are some window where the scan statistic is already detecting some problems. The range `rangemax` does not have to be the same as the one with random data.
Now we will use another method using the uniform distribution in order to implement a Poisson Process.
```{r}
PoissonProcess <- function(lambda,T) {
return(sort(runif(rpois(1,lambda*T),0,T)))
pp1=PoissonProcess(lambda0,Ti)
print(pp1)
plot(c(0,pp1),0:length(pp1),type="s",xlab="time t",ylab="number of events by time t")
pp2=PoissonProcess(lambda1,Ti)
print(pp2)
plot(c(0,pp2),0:length(pp2),type="s",xlab="time t",ylab="number of events by time t")
#time between events
n1=length(pp1)
tbe1=pp1[2:n1]-pp1[1:n1-1]
tbe1
n2=length(pp2)
tbe2=pp2[2:n2]-pp2[1:n2-1]
tbe2
ks.test(tbe1,pexp,lambda0, alternative="two.sided")
ks.test(tbe2,pexp,lambda1, alternative="two.sided")
```
The Kolmogorov-Smirnov test rejects the hypothesis that the time between events sequence is following an exponential distribution.
\section{Proposition de simulation sous H1}
Je reprends votre code pour faire un data set :
```{r}
PoissonProcess <- function(lambda,T) {
return(sort(runif(rpois(1,lambda*T),0,T)))
}
# Etape 1 : simu Poisson process sous H0
ppH0=PoissonProcess(lambda0,Ti)
# Etape 2 : creation d'un segment sous H1
tau= 3 # longeur de l'intervalle modifie, a fortiori tau < Ti
ppH1.segt=PoissonProcess(lambda1,tau)
# Etape 3 : insertion du segment dans la sequence H0
dbt=runif(1,0,Ti-tau) # choix de l'indice de temps ou va commencer le segment modifié
ppH0bis=PoissonProcess(lambda0,Ti)
ppH1.repo=dbt+ppH1.segt # repositionnement des observations dans le temps
ppH0_avant=ppH0bis[which(ppH0bis<ppH1.repo[1])]
ppH0_apres=ppH0bis[which(ppH0bis>ppH1.repo[length(ppH1.repo)])]
ppH1=c(ppH0_avant,ppH1.repo,ppH0_apres)
#time between events
n1=length(ppH1)
tbe1=ppH1[2:n1]-ppH1[1:n1-1]
tbe1=c(0,tbe1)
n0=length(ppH0)
tbe0=ppH0[2:n0]-ppH0[1:n0-1]
list1=data.frame(ProcessusPoissonH1=ppH1,
TimeBetweenEventH1=tbe1)
tbe0=c(0,tbe0)
list0=data.frame(ProcessusPoissonH0=ppH0,
TimeBetweenEventH0=tbe0)
list0
list1
```
```{r}
lambda=2
Tps=10
tau=1
pp=PoissonProcess(lambda,Tps)
pp
n=length(pp)
n
which(pp>(Tps-tau))
stop=n-length(which(pp>(Tps-tau)))
stop
ScanStat=0
for (i in (1:stop)) {
cat('\n i=',i)
cat('\t ppi=',pp[i])
x=which((pp>=pp[i])&(pp<=(pp[i]+tau)))
x
cat('\t which=',x)
scan=length(x)
cat(' scan=',scan)
if (scan>ScanStat) ScanStat=scan
}
ScanStat
```
```{r}
lambda0=list(1:10)
lambda1=list(1:10)
#mu0=2
E=10
mu1=1
for (mu0 in 1:10 ) {
#for (mu1 in 1:10) {
repe=10^3
# longueur de chaque séquence
n=200
SL_vect=vector(length=repe) # vecteur contenant le score local pour chaque séquence
for (j in 1:repe)
{
cat('\n repe=',j)
w.E=0 # initialisation de W (processus de Lindley) pour la séquence j
SL=0 # init du score local pour la séquence j
for (i in 1:n) {
a=rexp(1,mu0) # ici simulation d'une observation loi normale ; on peut aussi aller lire une observation dans un fichier de données
s.E=floor(E*log(dexp(a,mu0)/dexp(a,mean=mu0,sd=s0))) # calcul du score LLR associé à l'observation a
w.E=max(0,w.E+s.E) # calcul de la valeur W à l'indice j
if (w.E>SL) SL=w.E # actualisation du score local, cf. SL=max_j Wj
SL_vect[j]=SL # remplissage du vecteur des valeur de score local
}
}
p <- hist(SL_vect)
print(p)
#}
}
SL_vect
#p <- hist(SL_vect)
#print(p)
```
```{r}
library(stringr)
E=10
lambda0=1
lambda1=2
max=0
while (exp(-lambda0*max) > 10^(-9)){
max = max + 1
}
max
# en fonction de lambda0 tq P(Expo(lambda
#0)>=max)<10^-9
t=seq(0,max,0.1)
x=floor(E*log(lambda1/lambda0)+(lambda0-lambda1)*t)
head(x)
range(x)
s=seq(-15,6,0.1)
##tronquer la queue des x (x négatifs)
A=1/(lambda0-lambda1)
B=A*log(lambda1/lambda0)
proba.s=lambda0*exp(-lambda0*(A*s-B))
proba.s = proba.s/sum(proba.s)
barplot(proba.s)
##trouver 1/lambda en fonction intervalle seq
## ecart ou rapport entre les lambda qui est le + important ?
for (lambda0 in c(5,50,500,5000,50000) ) {
for (lambda1 in c(2,20,200,2000,20000) ) {
probaS_vect=vector(length=length(s))
for (j in 1:length(x))
{
A=1/(lambda0-lambda1)
B=A*log(lambda1/lambda0)
probaS=floor(E*lambda0*exp(-lambda0*(A*x[j]-B)))
# print(s)
probaS_vect[j]=probaS
}
probaS_vect = probaS_vect/sum(probaS_vect)
print(probaS_vect)
p <- barplot(probaS_vect, main=str_c("lambda0=", as.character(lambda0), "et", "lambda1=", as.character(lambda1), sep=" " ))
p
}
}
```
```{r}
# Vérification de la distribution theorique des scores
T=10
E=1
lambda0=2
lambda1=3
## Choix du E
maxXk = floor(E*(log(lambda1/lambda0)))
while (maxXk < 3) {
E = E+1
maxXk = floor(E*(log(lambda1/lambda0)))
}
E
## Calcul score max
score_max=floor(E*log(lambda1/lambda0))
score_max
## Calcul score_min
score_min_c=floor(E*log(lambda1/lambda0)+E*(lambda0-lambda1)*T)
## Calcul distrib score
l=seq(score_min_c,score_max,1)
borne_inf=(l-E*log(lambda1/lambda0))/(E*(lambda0-lambda1))
borne_sup=(l+1-E*log(lambda1/lambda0))/(E*(lambda0-lambda1))
proba.l=pexp(rate=lambda0,borne_inf)-pexp(rate=lambda0,borne_sup)
proba.l
S=sum(proba.l)
new.proba.s=proba.l/S
new.proba.s
barplot(new.proba.s)
```
```{r}
library(stringr)
# Effacement des objets de l'environnement de travail
rm(list=ls())
# Chgt du nbr de chiffres après la virgule
options("digits" = 15)
# Chemin du dossier à adapter
workpath = "."
E=1
lambda0=2
lambda1=3
# en fonction de lambda0 tq P(Expo(lambda
#0)>=max)<10^-9
probaseuilmin = 10^-9
# Vérification de la distribution theorique des scores
E=1
lambda0
lambda1
maxXk = floor(E*(log(lambda1/lambda0)))
while (maxXk < 3) {
E = E+1
maxXk = floor(E*(log(lambda1/lambda0)))
}
E
score_max=floor(E*log(lambda1/lambda0))
score_max
l=-50
borne_inf=(l-E*log(lambda1/lambda0))/(E*(lambda0-lambda1))
borne_sup=(l+1-E*log(lambda1/lambda0))/(E*(lambda0-lambda1))
pexp(rate=lambda0,borne_inf)-pexp(rate=lambda0,borne_sup)
l=seq(-100,score_max,1)
x=floor(E*log(lambda1/lambda0)+E*(lambda0-lambda1)*l)
proba.s=(lambda1/lambda0)*(exp(lambda0/(lambda0-lambda1)*(1-l/E))*exp(-lambda0/(E*lambda0-lambda1))-1)
cbind(x,proba.s)
tail(proba.s)
boxplot(proba.s)
sum(proba.s)
score_min_c=floor(E*log(lambda1/lambda0)+E*(lambda0-lambda1)*max)
somme=(lambda1/lambda0)*(exp(-lambda0/(E*(lambda0-lambda1)))-1)* exp(lambda0/(lambda0-lambda1))*exp(lambda0/(E*(lambda0-lambda1))*(1-score_min_c))/(1-exp(lambda0/(E*(lambda0-lambda1))))
#proba.s = proba.s/sum(proba.s)
proba.s= proba.s/(1-somme)
proba.s
barplot(proba.s)
if (lambda0 < lambda1) {
# Troncage à un score minimal
minXk = as.numeric(s[max(which(proba.s<(probaseuilmin)))])# On définit la classe pour proba < probaseuilmin
names(proba.s) = as.character(s)
proba.s[as.character(minXk)] = sum(proba.s[which(proba.s<probaseuilmin)]) # probabilité de la classe SCORE < minXk
score = s
score=score[which(score>=minXk)]
proba.s = proba.s[which(s>=minXk)] # On ne garde que les scores supérieurs à minXk
subtitle = paste("lambda0=",lambda0,";lambda1=",lambda1,";E=",E,sep="")
barplot(proba.s, col="steelblue",xlab="Score",ylab="Probabilité", main = paste("Probabilité d'apparition de chaque score\nLoi géométrique : ",subtitle,sep=""))
print(score) }
probaseuilmax= 10^-5
if (lambda1 < lambda0) {
# Troncage à un score minimal
maxXk = as.numeric(s[max(which(proba.s>(probaseuilmax)))])# On définit la classe pour proba < probaseuilmax
names(proba.s) = as.character(s)
proba.s[as.character(maxXk)] = sum(proba.s[which(proba.s>probaseuilmax)]) # probabilité de la classe SCORE < maxXk
score = s
score=score[which(score<=maxXk)]
proba.s = proba.s[which(s<=maxXk)] # On ne garde que les scores inférieurs à maxXk
subtitle = paste("lambda0=",lambda0,";lambda1=",lambda1,";E=",E,sep="")
barplot(proba.s, col="steelblue",xlab="Score",ylab="Probabilité", main = paste("Probabilité d'apparition de chaque score\nLoi géométrique : ",subtitle,sep=""))
print(score) }
min(score)
max(score)
mean(score)
# Enregistrement du fichier
fichier_proba = paste(workpath,"/LLR_Theorique_lambda0",gsub('\\.','',as.character(lambda0)),"_lambda1",gsub('\\.','',as.character(lambda1)),"_E",E,"_CL.txt",sep="")
write.table(proba.s, file = fichier_proba)
# Lecture fichier
#prob.X = read.table(fichier_proba)
```
```{r}
PoissonProcess <- function(lambda,T) {
return(sort(runif(rpois(1,lambda*T),0,T)))
}
lambda0=2
lambda1=3
Ti=100000
pp1=PoissonProcess(lambda1,Ti)
print(pp1)
n1=length(pp1)
tbe1=pp1[2:n1]-pp1[1:n1-1]
tbe1
ks.test(tbe1,'exp')
x=log(lambda1/lambda0)+(lambda0-lambda1)*tbe1 # ne pas mettre le floor ni le E (certes égale à 1)
hist(x)
summary(x)
# Calcul du maximum des scores
E=1
# THEO à faire !!! max.s=log(lambda1/lambda0)
maxXk = floor(E*(log(lambda1/lambda0)))
maxXk
while (maxXk < 3) {
E = E+1
maxXk = floor(E*(log(lambda1/lambda0)))
}
E
x=floor(E*(log(lambda1/lambda0)+(lambda0-lambda1)*tbe1))
dist.emp.scores=table(x)/sum(table(x))
dist.emp.scores
hist(x)
range(x)
x.verif=seq(range(x)[1],range(x)[2],1)
#dist.theo.scores=lambda0*exp(-lambda0*(A*x.verif-B))
#dist.theo.scores
dist.emp.scores
barplot(dist.emp.scores)
```
```{r}
barplot(proba.s)
##trouver 1/lambda en fonction intervalle seq
## ecart ou rapport entre les lambda qui est le + important ?
for (lambda0 in c(5,50,500,5000,50000) ) {
for (lambda1 in c(2,20,200,2000,20000) ) {
probaS_vect=vector(length=length(s))
for (j in 1:length(x))
{
A=1/(lambda0-lambda1)
B=A*log(lambda1/lambda0)
probaS=floor(E*lambda0*exp(-lambda0*(A*x[j]-B)))
# print(s)
probaS_vect[j]=probaS
}
probaS_vect = probaS_vect/sum(probaS_vect)
print(probaS_vect)
p <- barplot(probaS_vect, main=str_c("lambda0=", as.character(lambda0), "et", "lambda1=", as.character(lambda1), sep=" " ))
p
}
}
```
Import data of rainfall in France every 3 hours.
```{r}
Rain_Dataset = read.csv("data/synop.202202.csv", sep = ";")
print("Rain Dataset")
summary(Rain_Dataset)
Rain_Dataset_Red = Rain_Dataset[,c('date', 'rr3')]
Rain_Dataset_Red[,'rr3'] = as.numeric(Rain_Dataset_Red[,'rr3'])
summary(Rain_Dataset_Red)
head(Rain_Dataset_Red)
l=Rain_Dataset_Red[,1]
l
```
SST
```{r}
T = 1.
tau = .25
table_ppH1 = table(ppH1)
sst = -1
for (i in seq(0,(T-tau)/tau)){
t = i*tau
print(i)
partial_sst = length(table_ppH1[names(table_ppH1)>=t && names(table_ppH1)<=t+tau])
print(partial_sst)
if (sst<partial_sst) {
sst = partial_sst
}
}
sst
```
```{r}
lambda=2
Tps=10
tau=1
pp=PoissonProcess(lambda,Tps)
pp
n=length(pp)
n
which(pp>(Tps-tau))
stop=n-length(which(pp>(Tps-tau)))
ScanStat=0
for (i in (1:stop)) {
cat('\n i=',i)
cat('\t ppi=',pp[i])
x=which((pp>=pp[i])&(pp<=(pp[i]+tau)))
cat('\t which=',x)
scan=length(x)
cat(' scan=',scan)
if (scan>ScanStat) ScanStat=scan
}
ScanStat
```
```{r}
lambda=2
Tps=10
tau=1
pp=Rain_Dataset_Red[,1]
pp
n=length(pp)
n
which(pp>(Tps-tau))
stop=n-length(which(pp>(Tps-tau)))
ScanStat=0
for (i in (1:stop)) {
cat('\n i=',i)
#cat('\t ppi=',pp[i])
x=which((pp>=pp[i])&(pp<=(pp[i]+tau)))
cat('\t which=',x)
scan=length(x)
cat(' scan=',scan)
if (scan>ScanStat) ScanStat=scan
}
ScanStat
```
```{r}
pp=Rain_Dataset_Red
print(Rain_Dataset)
```
Local score
```{r}
library("localScore")
E = 10
X = floor(E*log(dexp(tbe1, rate = lambda1)/dexp(tbe1, rate = lambda0)))
max_X = max(X)
min_X = min(X)
P_X = table(factor(X, levels = min_X:max_X))/length(X)
LS=localScoreC(X)$localScore[1]
result = daudin(localScore = LS, score_probabilities = P_X, sequence_length = length(x), sequence_min = min_X, sequence_max = max_X)
result
```