Estimating time for training – Passion is like genius; a miracle.

Training time becomes an issue, esp., if someone is using complicated model like svm with huge data. When that happens, it’s critical to estimate when the training will end. I’ve written a simple R code to do that efficiently (in terms of time but not in terms of accuracy).

> # Load your data into 'train'.
> 
> # I'll use svm.
> library(e1071)
> # Will use parallelization.
> library(doMC)
> 
> # Set the number of workers to the number of cores.
> registerDoMC()
> 
> # Time with svm for data size 1000, 2000, 3000, ..., 10000.
> # And estimate the time in parallel!
> results <- foreach(i=seq(1000, 10000, 1000)) %dopar% {
+   print(i)
+   start <- proc.time()
+   sub_train <- svm(label ~., data=train[1:i,], kernel='linear')
+   end <- proc.time()
+   return(list(size=i, time=(end - start)[3]))
+ }
[1] 1000
[1] 2000
[1] 3000
[1] 4000
[1] 5000
[1] 6000
[1] 7000
[1] 8000
[1] 9000
[1] 10000
> results
[[1]]
[[1]]$size
[1] 1000

[[1]]$time
elapsed 
   1.33 


[[2]]
[[2]]$size
[1] 2000

[[2]]$time
elapsed 
  5.845 

... omitted ...

> # Convert the list into data frame.
> time_taken <- as.data.frame(matrix(unlist(results), ncol=2, byrow=T))
> colnames(time_taken) <- c("size", "time")
> time_taken
    size    time
1   1000   1.330
2   2000   5.845
3   3000  13.934
4   4000  26.413
5   5000  45.508
6   6000  67.102
7   7000  96.151
8   8000 135.744
9   9000 183.990
10 10000 229.421
> # Let's see if time = a * size + b * size^2 + c * size^3 + d * size^4 + error.
> m <- lm(time ~ poly(size, 4), data=time_taken)
> summary(m)

Call:
lm(formula = time ~ poly(size, 4), data = time_taken)

Residuals:
      1       2       3       4       5       6       7       8       9      10 
 0.5405 -0.9368 -0.5183  0.3327  2.1665 -0.2056 -2.2956 -0.8783  2.8949 -1.1001 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     80.5438     0.6621 121.654 7.12e-10 ***
poly(size, 4)1 227.8825     2.0937 108.844 1.24e-09 ***
poly(size, 4)2  69.3148     2.0937  33.107 4.73e-07 ***
poly(size, 4)3   4.3338     2.0937   2.070   0.0932 .  
poly(size, 4)4  -3.2294     2.0937  -1.542   0.1836    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.094 on 5 degrees of freedom
Multiple R-squared: 0.9996,	Adjusted R-squared: 0.9993 
F-statistic:  3237 on 4 and 5 DF,  p-value: 1.024e-08 

> # Find the highest order whose p-value is bigger than 0.05. 
> # In this case, size^3 has p-value 0.0932 > 0.05. Thus, ignore terms
> # with order greater than or equal to 3. So, the model will be 
> # time = a * size + b * size^2 + error.
>
> # Note that, even if p-value of size^1 is larger than 0.05, we won't 
> # ignore size^1 term as size^2 is significant.
>
> # Predict the total time to train all data.
> predict(lm(time ~ poly(size, 2), data=time_taken), data.frame(size=nrow(train)))
       1 
91304.46 
> # It will take 1.06 day.
> 91304.46 / 60 / 60 / 24
[1] 1.056765

Here’s the graph: