The tournament is on a 400 skating rink. To quote wikipedia 'All races are run counter-clockwise. In all individual competition forms, only two skaters are allowed to race at once, and must remain in their respective lanes. Skaters must change lanes every lap. The skater changing from the outside lane to the inside has right-of-way.' It follows that there are 25 rounds. Hence there are also 25 lap times.

#### Source data

Data for 10 km have been downloaded from schaatsstatistieken (pdf file in Dutch English language version) beginning of December. It contains lap times where the final time was under 13'30 at that point in calendar time. Since then some additional results have been added. Surely Dutch championships this weekend will add more. The data was transformed to .txt file in Calibre. Unfortunately the result seems to have rather random placed line-breaks and no delimiters except the space so it took me some lines to read the data. It was not possible to just use the counts of spaces, these vary over skater's names and event locations. I put this script in the appendix since it breaks the text too much.#### Plotting data

The plot shows all lap times in the data. The pattern is fairly simple; the first lap is slow, speed is made. The intermediate laps have a fairly constant speed. The last laps have an increase or decrease in speed.# plot data

long <- reshape(scating4,varying=list(colnames(times),colnames(rtimes)),

v.names=c('total','round'),direction='long',times=seq(400,10000,400),

timevar='Distance')

xyplot(round ~ Distance ,groups= nums,data=long,type='l')

The second plot shows selected details, all results from the year 2000. The plot shows typical results. Frank Dittrich is ever increasing speeds. Gianni Romme in Heerenveen slowed lightly over laps. Bob de Jong had nice flat lap times.

long$year <- substr(long$date,1,4)

xyplot(lap ~ Distance ,groups= paste(name,' (',location,')',

sep=''),data=long[long$year=='2000',],

type='l',ylab='lap time',main='Year 2000',

auto.key=list(space='bottom',lines=TRUE,points=FALSE))

#### Predictive model

To model these these correlated data it seems obvious to use PLS. The question is then; at which distance is the final time well predicted? At 2 km it is not so good. A Root Mean Squared Error of Prediction (RMSEP) of 7.5 at 4 components in the model translates into easily fat chance of an error of more than 7 seconds, see also the plot observed-predicted.pls1 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000,data=scating4,validation='CV',ncomp=5)

RMSEP(pls1)

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps

CV 9.879 7.766 7.569 7.492 7.469 7.469

adjCV 9.879 7.764 7.566 7.490 7.466 7.465

formattl <- function(x)

paste(x %/% 60,"'",

sprintf('%02i',round(x %% 60))

,sep='')

obspredplot <- function(y,main) {

plot(x=scating4$m10000,y=y,axes=FALSE,

xlab='Observed time',ylab='Predicted time',

main=main,

xlim=c(759,811),ylim=c(759,821))

axis(1,at=seq(760,810,10),labels=formattl(seq(760,810,10)))

axis(2,at=seq(760,820,10),labels=formattl(seq(760,820,10)))

abline(a=0,b=1,col='blue')

return(invisible(0))

}

obspredplot(y=as.numeric(predict(pls1,ncomp=2,type='response')),paste(x %/% 60,"'",

sprintf('%02i',round(x %% 60))

,sep='')

obspredplot <- function(y,main) {

plot(x=scating4$m10000,y=y,axes=FALSE,

xlab='Observed time',ylab='Predicted time',

main=main,

xlim=c(759,811),ylim=c(759,821))

axis(1,at=seq(760,810,10),labels=formattl(seq(760,810,10)))

axis(2,at=seq(760,820,10),labels=formattl(seq(760,820,10)))

abline(a=0,b=1,col='blue')

return(invisible(0))

}

'Data upto 2 km')

As more data is added, the predictions get better. After 8 km data the RMSEP is 2.5 so a good prediction can be made.

pls2 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000

+rm2400 + rm2800 + rm3200 + rm3600 + rm4000,data=scating4,

validation='CV',ncomp=9)

RMSEP(pls2)

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps

CV 9.879 6.808 6.484 6.354 6.329 6.320 6.315

adjCV 9.879 6.807 6.482 6.351 6.323 6.312 6.307

7 comps 8 comps 9 comps

CV 6.319 6.319 6.319

adjCV 6.311 6.311 6.311

pls3 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000

+rm2400 + rm2800 + rm3200 + rm3600 + rm4000

+rm4400 + rm4800 + rm5200 + rm5600 +rm6000,data=scating4,

validation='CV',ncomp=14)

RMSEP(pls3)

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps

CV 9.879 5.219 4.645 4.507 4.474 4.492 4.525

adjCV 9.879 5.217 4.643 4.505 4.469 4.485 4.514

7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps

CV 4.536 4.538 4.539 4.540 4.540 4.540 4.540

adjCV 4.525 4.526 4.527 4.528 4.528 4.528 4.528

14 comps

CV 4.540

adjCV 4.528

pls4 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000

+rm2400 + rm2800 + rm3200 + rm3600 + rm4000

+rm4400 + rm4800 + rm5200 + rm5600 +rm6000

+rm6400 + rm6800 + rm7200 + rm7600 +rm8000,data=scating4,

validation='CV',ncomp=14)

RMSEP(pls4)

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps

CV 9.879 2.806 2.558 2.411 2.366 2.351 2.372

adjCV 9.879 2.805 2.557 2.409 2.362 2.347 2.366

7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps

CV 2.385 2.386 2.387 2.389 2.390 2.390 2.391

adjCV 2.377 2.378 2.380 2.381 2.382 2.383 2.383

14 comps

CV 2.391

adjCV 2.383

obspredplot(as.numeric(predict(pls4,ncomp=2,type='response')),'Data upto 8 km')

Further calculations shows that it is not the amount of data that determines the prediction quality. Just using data at 2, 4, 6 and 8 km gives good predictions.

pls4b <- plsr(m10000 ~ rm2000 + rm4000 +rm6000 +rm8000,data=scating4,

validation='CV',ncomp=4)

RMSEP(pls4b)

(Intercept) 1 comps 2 comps 3 comps 4 comps

CV 9.879 2.979 2.903 2.891 2.892

adjCV 9.879 2.978 2.902 2.889 2.889

In hindsight I could have found this much simpler. Just looking at correlations shows highest correlation just under 6 and 8 km.

plot(y=cor(cbind(rtimes,times[,25]),use='complete.obs')[26,1:25],x=seq(400,10000,400),

ylab='Correlation with 10 km time',xlab='Lap (distance)',

main='Correlation between lap time and 10 km time')

#### Appendix: reading and transforming data

library(lattice)

library(pls)

library(randomForest)

r1 <- readLines("Rondetijden 10000 meter mannen onder 13.30,00 - SchaatsStatistieken.nl.txt",encoding='UTF-8')

# removing unneeded input

nc <- sapply(1:length(r1),function(x) nchar(r1[x]))

r2 <- r1[nc!=0]

r2 <- r2[r2!="datum"]

r2 <- r2[r2!="# naam"]

r2 <- r2[-grep("pagina",r2)]

r2 <- r2[-grep("SchaatsStatistieken",r2)]

r2 <- r2[-grep("^400m",r2)]

r2 <- r2[-grep("^nat plaats",r2)]

r4 <- sapply(1:length(r2),function(x)

gsub("800m 1200m 1600m 2000m 2400m 2800m 3200m 3600m 4000m 4400m 4800m 5200m 5600m 6000m 6400m 6800m 7200m 7600m 8000m 8400m 8800m 9200m 9600m 10000m ",

"", r2[x]))

lengths <- function(x) {

nc <- sapply(1:length(x),function(xx) nchar(x[xx]))

nc

}

#appending lines - each line should start with a number

while (length(grep('^[0-9]{1,3} ',r4,invert=TRUE))>0) {

le <- grep('^[0-9]{1,3} ',r4,invert=TRUE)[1]

r4[le-1] <- paste(r4[le-1],r4[le])

r4 <- r4[-le]

}

r5 <- r4[-grep('- - - -',r4)]

# extracting columns

num <- regexpr(" ",r5)

scating1 <- data.frame(nums = as.numeric(substr(r5,1,num-1)),

strings = substring(r5,num+1))

#country three letter uppercase, name befor that

num <- regexpr("[A-Z]{3}",scating1$strings)

scating1$name <- substr(scating1$strings,1,num-1)

scating1$country <- substr(scating1$strings,num,num+2)

scating1$strings <- substring(scating1$strings,num+4)

#head(scating1)

#date starts with 4 numbers characterising year

#location befor date

num <- regexpr("[0-9]{4}",scating1$strings)

#table(num)

scating1$location <- substr(scating1$strings,1,num-1)

scating1$date <- substr(scating1$strings,num,num+10)

scating1$strings <- substring(scating1$strings,num+11)

#head(scating1)

#notation of , and .

scating1$strings <- gsub('.','#',scating1$strings,fixed=TRUE)

scating1$strings <- gsub(',','.',scating1$strings,fixed=TRUE)

#head(scating1)

distance <- paste('m',seq(400,10000,400),sep='')

scating1$strings <- paste(scating1$strings,' ')

times <- matrix(0.0,nrow=nrow(scating1),

ncol=25,dimnames=list(1:nrow(scating1),distance))

for (i in 1:25) {

num <- regexpr(" ",scating1$strings)

time <- substr(scating1$strings,1,num-1)

time[time=='-'] <- '0#0'

ntime <- 60*as.numeric(substr(time,1,regexpr("#",time)-1))+

as.numeric(substr(time,regexpr("#",time)+1,lengths(time)))

ntime[time=='0#0'] <- NA

times[,i] <- ntime

scating1$strings <- substr(scating1$strings,num+1,

lengths(scating1$strings))

}

scating2 <- cbind(scating1[,!(names(scating1)=='strings')],times)

rtimes <- times

colnames(rtimes) <- paste('r',colnames(times),sep='')

for (i in 2:25) rtimes[,i] <- times[,i] - times[,i-1]

scating3 <- cbind(scating2,rtimes)

scating4 <- scating3[complete.cases(scating3),]

# next two corrections have been applied in the source data (website)

scating4 <- scating4[scating4$m400>20,] # wrong data in 1st lap

scating4 <- scating4[scating4$num != 336,] # wrong data at 6400 m

library(randomForest)

r1 <- readLines("Rondetijden 10000 meter mannen onder 13.30,00 - SchaatsStatistieken.nl.txt",encoding='UTF-8')

# removing unneeded input

nc <- sapply(1:length(r1),function(x) nchar(r1[x]))

r2 <- r1[nc!=0]

r2 <- r2[r2!="datum"]

r2 <- r2[r2!="# naam"]

r2 <- r2[-grep("pagina",r2)]

r2 <- r2[-grep("SchaatsStatistieken",r2)]

r2 <- r2[-grep("^400m",r2)]

r2 <- r2[-grep("^nat plaats",r2)]

r4 <- sapply(1:length(r2),function(x)

gsub("800m 1200m 1600m 2000m 2400m 2800m 3200m 3600m 4000m 4400m 4800m 5200m 5600m 6000m 6400m 6800m 7200m 7600m 8000m 8400m 8800m 9200m 9600m 10000m ",

"", r2[x]))

lengths <- function(x) {

nc <- sapply(1:length(x),function(xx) nchar(x[xx]))

nc

}

#appending lines - each line should start with a number

while (length(grep('^[0-9]{1,3} ',r4,invert=TRUE))>0) {

le <- grep('^[0-9]{1,3} ',r4,invert=TRUE)[1]

r4[le-1] <- paste(r4[le-1],r4[le])

r4 <- r4[-le]

}

r5 <- r4[-grep('- - - -',r4)]

# extracting columns

num <- regexpr(" ",r5)

scating1 <- data.frame(nums = as.numeric(substr(r5,1,num-1)),

strings = substring(r5,num+1))

#country three letter uppercase, name befor that

num <- regexpr("[A-Z]{3}",scating1$strings)

scating1$name <- substr(scating1$strings,1,num-1)

scating1$country <- substr(scating1$strings,num,num+2)

scating1$strings <- substring(scating1$strings,num+4)

#head(scating1)

#date starts with 4 numbers characterising year

#location befor date

num <- regexpr("[0-9]{4}",scating1$strings)

#table(num)

scating1$location <- substr(scating1$strings,1,num-1)

scating1$date <- substr(scating1$strings,num,num+10)

scating1$strings <- substring(scating1$strings,num+11)

#head(scating1)

#notation of , and .

scating1$strings <- gsub('.','#',scating1$strings,fixed=TRUE)

scating1$strings <- gsub(',','.',scating1$strings,fixed=TRUE)

#head(scating1)

distance <- paste('m',seq(400,10000,400),sep='')

scating1$strings <- paste(scating1$strings,' ')

times <- matrix(0.0,nrow=nrow(scating1),

ncol=25,dimnames=list(1:nrow(scating1),distance))

for (i in 1:25) {

num <- regexpr(" ",scating1$strings)

time <- substr(scating1$strings,1,num-1)

time[time=='-'] <- '0#0'

ntime <- 60*as.numeric(substr(time,1,regexpr("#",time)-1))+

as.numeric(substr(time,regexpr("#",time)+1,lengths(time)))

ntime[time=='0#0'] <- NA

times[,i] <- ntime

scating1$strings <- substr(scating1$strings,num+1,

lengths(scating1$strings))

}

scating2 <- cbind(scating1[,!(names(scating1)=='strings')],times)

rtimes <- times

colnames(rtimes) <- paste('r',colnames(times),sep='')

for (i in 2:25) rtimes[,i] <- times[,i] - times[,i-1]

scating3 <- cbind(scating2,rtimes)

scating4 <- scating3[complete.cases(scating3),]

# next two corrections have been applied in the source data (website)

scating4 <- scating4[scating4$m400>20,] # wrong data in 1st lap

scating4 <- scating4[scating4$num != 336,] # wrong data at 6400 m