Classifying text with R and RTextTools

Recently I have been playing with text classification with R. In this example we will provide R with a number of text documents, each document belonging to a category or “class”. We will use a number of machine learning algorithms to train different models. Finally we will use those models to classify new text.

Before jumping into the code, there is something you need to be aware: RTextTools is a nice library, but it has not been updated in a while, as a result, many calls to it will fail. In order to fix that, I forked it and applied the required fixes, you will need to do three things

  1. Prepare your system to compile R libraries

    sudo apt-get install -y r-cran-boot r-cran-class r-cran-cluster r-cran-codetools r-cran-foreign r-cran-kernsmooth r-cran-lattice r-cran-mass r-cran-matrix r-cran-mgcv r-cran-nlme r-cran-nnet r-cran-rpart r-cran-spatial r-cran-survival  r-base r-recommended r-base-dev

  2. Also, install a couple of R libraries

    install.packages(c(“SparseM”, “randomForest”, “tm”, “ipred”, “maxent”, “glmnet”))

  3. Clone the RTextTools that I patched from my git repo
  4. Compile and install the code, following the instructions I provide in this post.

I know, I know, it takes a while and you probably just want to run the code, but trust me, it is worth it.

Remember, that this example, along with the training data is available in my git repository.

Now, lets present our R code

library(RTextTools)


colnames(raw) # Print the column names, we have two: Text and Class

# Lets split the data, 70% for training/testing, 30% for cross validation
base_data <- raw[1:nrow(raw)*0.7, ]
cross_validation_data <- raw[as.integer((nrow(raw)*0.7)+1):nrow(raw)-1, ]


# Now, it is time to create a text matrix for testing, the create_matrix function
# will also take care of removing numbers, making everything lower case and some 
# other requirements in order to try to make the test set as standard as possible
doc_matrix <- create_matrix(base_data$Text, language="english", removeNumbers=TRUE,
                            stemWords=TRUE, removeSparseTerms = 0.998)


# Finally, lets create the 'container' in RTextTools it is a type of object that allows
# to operate all the functions with a common interface. The relevant arguments here are
#      trainSize=Of the matrix given, specifies which part is for training
#      testSize=Of the matrix given, specifies which part is for testing
#      virgin=Specifies whether or not we have the correct answers.
container <- create_container(doc_matrix, base_data$Class, trainSize=1:as.integer(nrow(base_data)*0.7), 
                              testSize=as.integer((nrow(base_data)*0.7)+1):(nrow(raw)-1), virgin=FALSE)

# Algorithms to use, note that 'RF' and 'TREE' are a bit slower, that's why I am not 
# using them
algs <- c('SVM'
          ,'GLMNET'
          ,'MAXENT'
          #, 'RF'
          #, 'TREE'
          )

# Run the machine learning itself!
models <- train_models(container, algs)

# Perform predictions
classify <- classify_models(container, models)

# Get a quick analysis on the predictions
analytics <- create_analytics(container, classify)
summary(analytics)

# This is one of the MOST important parts of the code, now we are going to prepare the matrix for
# our cross validation data. Note that there is a new parameter here:
#     originalMatrix=doc_matrix ==> This is the previous test matrix that we created, we NEED 
#     this parameter here, because essentially what we are training is on the words, and we 
#     need to have the same words in the train, test and cross validation sets. The 
#     originalMatrix parameter takes care of that
doc_matrix_cv2 <- create_matrix(cross_validation_data$Text, 
                               originalMatrix=doc_matrix,
                               language="english", 
                               removeNumbers=TRUE,
                               stemWords=TRUE, 
                               removeSparseTerms=0.998
                               )


# Now again, lets create a container and perform predictions on it
container_cv <- create_container(doc_matrix_cv2, 
                                 cross_validation_data$Class,
                                 trainSize=NULL,
                                 testSize=1:(nrow(cross_validation_data)),
                                 virgin=TRUE)



classify_cv <- classify_models(container_cv, models)
analytics_cv <- create_analytics(container_cv, classify_cv)

predictions_raw <- analytics_cv@document_summary

# Create a simple dataframe for predictions
predictions <- data.frame(real_label=cross_validation_data$Class, 
                          predicted_label=predictions_raw$CONSENSUS_CODE,
                          algorithms_agree=predictions_raw$CONSENSUS_AGREE)

min_algs_agreeing <- 2 # We will require at least two algorightms to agree on the class to be classified


# Change our predictions dataframe, if we have at least X algorithms agreeing on the results, then 
# we will keep the prediction, if not, we will set the prediction as "N/A"
predictions$predicted_label <- ifelse(predictions$algorithms_agree >= min_algs_agreeing, predictions$predicted_label, "N/A")


# Lets print out some final conclusions
print(paste("Total number of documents to classify", nrow(cross_validation_data)))
print(paste("Total number of documents classified", nrow(subset(predictions, predicted_label!="N/A"))))
print(paste("Total number of correctly classified documents", nrow(subset(predictions, predicted_label == real_label))))
print(paste("Total number of incorrectly classified documents", nrow(subset(predictions, predicted_label != real_label))))
print(paste("Total number of documents which were not classified", nrow(subset(predictions, predicted_label == "N/A"))))

 

I have tried to comment the code as it goes, so it should be easy to follow. There are however several areas worth mentioning.

  1. If you get an error like this

    Error in if (attr(weighting, “Acronym”) == “tf-idf”) weight <- 1e-09 :
    argument is of length zero
    Calls: create_matrix
    Execution halted

    It is because you installed the RTextTools package from CRAN, instead of that, you need to install the fork available in my repository and then follow the instructions on how to manage R libraries.

  2. The code takes a bit to run (a few minutes in my computer, which is an intel core i7 with 8GB of memory), be patient.

Finally, the results obtained are as follow

[1] “Total number of documents to classify 1646”
[1] “Total number of documents classified 1593”
[1] “Total number of correctly classified documents 1503”
[1] “Total number of incorrectly classified documents 143”
[1] “Total number of documents which were not classified 53”

Of course this is a simple dataset, if you explore it a bit, you will notice that has already been cleaned and treated, the reason for that is that I prefer this post to focus on the usage of RTextTools as a library rather than cleaning the input.

That said, be aware, cleaning the input and preparing the dataset is possibly the most important part of a machine learning task.

Happy coding.

 

 

A very simple neuronal network in R

R is a great tool for machine learning and data analysis. Today I am going to show you how to train a very simple neuronal network and use it to perform predictions.

In the past I have posted about other machine learning techniques such as linear regression, the problem with such techniques is that they are not as powerful as others, still it is important and good to understand linear regressions before jumping into more complex methods.

So, without further discussion, lets jump into the code!

library(neuralnet)
set.seed(42)
raw_data <- read.csv("sampleData1.csv")
head(raw_data)
train_data <- raw_data[1:75, ]
test_data <- raw_data[76:100, ]
 
model <- neuralnet(price ~ x + y, data=train_data, linear.output = TRUE)
plot(model)
 
# In order to perform predictions, we need to ensure we pass to the compute 
# function ONLY the columns that we use for the training, in this case, that 
# means the first and second columns only (they are the ones representing x and y)
predicted <- compute(model, test_data[, c(1,2)])
 
results <- data.frame(real=test_data$price, 
                      predictions=predicted$net.result,
                      error=test_data$price-predicted$net.result)
 
print(results)
print(paste("Neuronal network error is ", sum(abs(results$error))))

Created by Pretty R at inside-R.org

And now lets go step by step.

First of all let me clarify, this code is as simple as it can be, the data is also very very simple and it is already prepared to be used. What does this mean? It means that the data is clean and scaled, something that, unfortunately, does not occur often in real life, however for this example is very convenient, as it will allow us to focus on the neuronal network code only.

Note that this example along with the data can be found at my git repository.

Here we are simply loading the neuronal network library and setting always the same random seed, this is important to get repeatable results.

raw_data <- read.csv("sampleData1.csv")
head(raw_data)

Now we are loading the data and having quick look at it, the result of this will be

> head(raw_data)
     x    y  price
1 0.44 0.68 511.14
2 0.99 0.23 717.10
3 0.84 0.29 607.91
4 0.28 0.45 270.40
5 0.07 0.83 289.88
6 0.66 0.80 830.85

Created by Pretty R at inside-R.org

As you can see our data set is very very simple, 3 columns: x, y and price. Of course we will predict the price based on the values of x and y. Note that x and y are already scaled.

train_data <- raw_data[1:75, ]
test_data <- raw_data[76:100, ]

Now it is time to split our data in train and test sets. Our dataset contains a total of 100 rows, we will use 75% to train and 25% to test.

model <- neuralnet(price ~ x + y, data=train_data)
plot(model)

We have now trained our neuronal network model and we plot it. When training it we are first specifying that we want to predict the price column based on the values of columns x and y. For that we will use the train_data set.

After that we get a graphical representation of the neuronal network:

nn

Pretty exciting, uh? I know, it is a very simple neuronal network, but given the amount of lines of code applied, I would say is not bad: you have your input layer, your hidden layer and your activation function.

# In order to perform predictions, we need to ensure we pass to the compute 
# function ONLY the columns that we use for the training, in this case, that 
# means the first and second columns only (they are the ones representing x and y)
predicted <- compute(model, test_data[, c(1,2)])

 

Now it is time to get some predictions, but let me elaborate a bit here. The neuralnet package requires two things to predict, the first of course is the neuronal network itself, the second is the data, however this data must contain ONLY the columns that we base our predictions in, in this case, this means columns x and y. However our test_data contains columns x, y and price. In order to prevent this, we use the expression

test_data[, c(1,2]

Which simply means : give me all the rows of test_data and only columns 1 and 2.

results <- data.frame(real=test_data$price, 
                      predictions=predicted$net.result,
                      error=test_data$price-predicted$net.result)

Finally we just put all of our predictions into a dataframe that will contain 3 columns: real price, predicted price and the error in the prediction (the diff between the predicted price and the real one).

> print(results)
       real  predictions         error
76   572.31  581.1867039  -8.876703897
77   957.61  960.9513985  -3.341398527
78   518.29  504.1513515  14.138648483
79  1143.49 1161.4094913 -17.919491292
80  1211.31 1231.4814350 -20.171434994
81   784.74  781.6509211   3.089078856
82   283.70  278.5575509   5.142449086
83   684.38  676.9823336   7.397666406
84   719.46  710.3790828   9.080917151
85   292.23  305.1587486 -12.928748608
86   775.68  771.3494280   4.330572004
87   130.77  139.1130193  -8.343019287
88   801.60  783.6016326  17.998367404
89   323.55  336.4136176 -12.863617624
90   726.90  733.5795764  -6.679576423
91   661.12  668.8686304  -7.748630352
92   771.11  784.9909764 -13.880976412
93  1016.14 1024.4260261  -8.286026106
94   237.69  229.2574825   8.432517505
95   325.89  317.1122786   8.777721357
96   636.22  639.4993132  -3.279313202
97   272.12  262.1913101   9.928689931
98   696.65  694.1006299   2.549370094
99   434.53  436.8392736  -2.309273619
100  593.86  586.9172289   6.942771103

Finally, we print the predictions, the real prices and the error. As you can see, the neuronal network did quite well in this case (remember, we have used a very clean dataset).

I hope this helps you with the basics of neuronal networks in R.

Happy coding!

 

Managing R libraries

One of the nicest features of R is how easy it is to install dependencies.

Simply run

install.packages(“somelibrary”)

And then R will contact cran for searching and downloading the desired library. But what if you want to uninstall a library? or what if you want to get a list of all your libraries? The library() function will help you with that. Run it from the R prompt.

> library()

Packages in library ‘/home/moriano/R/x86_64-pc-linux-gnu-library/3.3’:

assertthat              Easy pre and post assertions.
BH                      Boost C++ Header Files
bitops                  Bitwise Operations
caTools                 Tools: moving window statistics, GIF, Base64,
ROC AUC, etc.
DBI                     R Database Interface
dplyr                   A Grammar of Data Manipulation
e1071                   Misc Functions of the Department of Statistics,
Probability Theory Group (Formerly: E1071), TU
Wien
foreach                 Provides Foreach Looping Construct for R
glmnet                  Lasso and Elastic-Net Regularized Generalized
Linear Models
tm                      Text Mining Package
tree                    Classification and Regression Trees

Packages in library ‘/usr/lib/R/site-library’:

Amelia                  Amelia II: A Program for Missing Data
digest                  Create cryptographic hash digests of R objects
evaluate                Parsing and evaluation tools that provide more
details than the default.

Packages in library ‘/usr/lib/R/library’:

base                    The R Base Package
boot                    Bootstrap Functions (originally by Angelo Canty
for S)
class                   Functions for Classification
cluster                 Cluster Analysis Extended Rousseeuw et al.
codetools               Code Analysis Tools for R

Now lets try to remove a library, for example “tm” in order to do so, you will need to run the following command from your shell (NOT inside the R prompt, just in bash)

$ R CMD REMOVE library_name

For example

$ R CMD REMOVE bitops
Removing from library ‘/home/moriano/R/x86_64-pc-linux-gnu-library/3.3’

Now if you run “library()” again you will see that the “bitops” package does not appear anymore.

To reinstall the package we can of course just run

install.packages(“bitops”)

But instead of that, we will install the package manually, further more, we will compile the package and install it, this process is surprisingly simple. First lets go to the bitops page in cran, then we will download the source files.

$ wget https://cran.r-project.org/src/contrib/bitops_1.0-6.tar.gz
$ tar -xvvf bitops_1.0-6.tar.gz

Now all that is left is to simply compile the package using R

$ R CMD build bitops
* checking for file ‘bitops/DESCRIPTION’ … OK
* preparing ‘bitops’:
* checking DESCRIPTION meta-information … OK
* cleaning src
* checking whether ‘INDEX’ is up-to-date … NO
* use ‘–force’ to remove the existing ‘INDEX’
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* building ‘bitops_1.0-6.tar.gz’

As you will see, the output of the compilation is a tar.gz file, that file is similar to a .deb or .rpm package file in linux, and contains all required files to get the package installed in the R system, so lets install it.

$ R CMD INSTALL bitops_1.0-6.tar.gz
* installing to library ‘/home/moriano/R/x86_64-pc-linux-gnu-library/3.3’
* installing *source* package ‘bitops’ …
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -g -O2 -fstack-protector –param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -c bit-ops.c -o bit-ops.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG      -fpic  -g -O2 -fstack-protector –param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -c cksum.c -o cksum.o
gcc -std=gnu99 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o bitops.so bit-ops.o cksum.o -L/usr/lib/R/lib -lR
installing to /home/moriano/R/x86_64-pc-linux-gnu-library/3.3/bitops/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (bitops)

And that is it, you have learned how to list, remove, compile and install packages in R. Normally you will not need this operation as cran has a very comprehensive package list, but sometimes you might need to get a package not available in cran, or maybe the version in cran is not completely up to date.

Happy coding.

 

 

 

Polynomial linear regression with R

Linear regression is the basic technique in machine learning, but it is still quite useful and surprisinly easy to use with R.

In this scenario we are going to use linear regression for predicting prices. Our dataset is basically a csv file with 100 examples and 3 columns: x, y and price, of course price is what we want to predict.

Without further detail, here is the code

all_data <- read.csv("sampleData1.csv")
# Quick exploration of the dataset
print(paste("Number of rows", nrow(all_data)))
head(all_data)
 
# First, lets split the set, 70% is for traing, 30% for testing
train_data <- all_data[0:70, ]
test_data <- all_data[71:100, ]
 
# Train a simple model
model <- lm(price ~ x + y, train_data)
 
# Predict and add two extra columns 
predictions_linear <- predict(model, test_data)
 
results <- data.frame(price = test_data$price, 
                      predictions_linear = predictions_linear, 
                      error_linear = test_data$price - predictions_linear)
 
# Lets train now a model using a polynomial level of 2
model_square <- lm(price ~ poly(x + y, 2), train_data)
predictions_square <- predict(model_square, test_data)
results$predictions_square <- predictions_square
results$error_square <- test_data$price - predictions_square
 
# And now lets try polynomial of level 3
model_cube <- lm(price ~ poly(x + y, 3), train_data)
predictions_cube <- predict(model_cube, test_data)
results$predictions_cube <- predictions_cube
results$error_cube <- test_data$price - predictions_cube
 
# Finally, lets compute the total error for each of the cases, the total error
# Is equal to the sum of the absolute values of all the errors. After that
# the lower sum will give us the better approach.
 
error_linear <- sum(abs(results$error_linear))
error_square <- sum(abs(results$error_square))
error_cube <- sum(abs(results$error_cube))
 
print(paste("Linear error is ", error_linear))
print(paste("Square error is", error_square))
print(paste("Cube error is ", error_cube))

Created by Pretty R at inside-R.org

 

Note that the full example (including the csv file) is located into my git repository

As always, I will try to go line by line explaining what is happening

all_data <- read.csv("sampleData1.csv")
# Quick exploration of the dataset
print(paste("Number of rows", nrow(all_data)))
head(all_data)

Here we are simply loading our csv file and figuring out how many rows it has. We will see that it has a total of 101 rows (which is 100 examples, as the first row are just the column names).

# First, lets split the set, 70% is for traing, 30% for testing
train_data <- all_data[0:70, ]
test_data <- all_data[71:100, ]

Now this is machine larning 101🙂 lets split our data into training data (70%) and testing data (30%) this is very important, our training algorithm will only use the training data, and will never see the testing data, we will use the testing data to see how good our algorithm is able to predict the prices.

# Train a simple model
model <- lm(price ~ x + y, train_data)

# Predict and add two extra columns 
predictions_linear <- predict(model, test_data)

With those two simple lines we have trained our model and we have used it to predict new prices. As always remember that the expresion “price ~ x + y” simply means “I want to predict the column “price” based on columns “x” and “y”.

results <- data.frame(price = test_data$price, 
                      predictions_linear = predictions_linear, 
                      error_linear = test_data$price - predictions_linear)

Here I am just creating a dataframe (if your are new to R repeat with me, dataframe = table) with 3 columns : the price (that is the REAL price), my prediction using the linear approach and my error which is equal to the real price minus the prediction.

# Lets train now a model using a polynomial level of 2
model_square <- lm(price ~ poly(x + y, 2), train_data)
predictions_square <- predict(model_square, test_data)
results$predictions_square <- predictions_square
results$error_square <- test_data$price - predictions_square

Here I am doing something a bit more interesting, I am still using a linear model, however I am feeding it a squared function instead of a linear one, more specifically the line

poly(x + y, 2)

Will simply produce something like x + x^2 + y + y^2. The great thing about the poly function is that you can easily use larger exponents.After we have trained our model using a squared function, the rest of it is the same, prediction new results is trivial thanks to the “predict” function, and after that we simply add two new columns to our dataframe : predictions_square and error_square.

# And now lets try polynomial of level 3
model_cube &lt;- lm(price ~ poly(x + y, 3), train_data)
predictions_cube &lt;- predict(model_cube, test_data)
results$predictions_cube &lt;- predictions_cube
results$error_cube &lt;- test_data$price - predictions_cube

This code does exactly the same as the previous one but it uses a cube function rather than a squared one.

# Finally, lets compute the total error for each of the cases, the total error
# Is equal to the sum of the absolute values of all the errors. After that
# the lower sum will give us the better approach.

error_linear &amp;lt;- sum(abs(results$error_linear))
error_square &amp;lt;- sum(abs(results$error_square))
error_cube &amp;lt;- sum(abs(results$error_cube))

print(paste("Linear error is ", error_linear))
print(paste("Square error is", error_square))
print(paste("Cube error is ", error_cube))

Finally we simply use our existing data to figure out how well our predictions are doing. Before we jump into conclussions, let me show you how our predictions data frame looks like:

> head(results[,c(1,2,4,6)], n=30)
      price predictions_linear predictions_square predictions_cube
71    98.47          -61.95071           112.4797         107.1003
72   819.63          852.17016           764.9323         754.3423
73   174.44          125.00726           197.0873         206.7146
74   483.13          578.61854           402.5327         405.6330
75   534.24          606.16416           484.1695         482.9430
76   572.31          663.34225           685.3131         676.2093
77   957.61          945.44089           927.6515         917.3632
78   518.29          588.84889           532.6702         529.0892
79  1143.49         1069.59361          1184.8944        1184.6149
80  1211.31         1109.86428          1251.2045        1255.4081
81   784.74          821.07992           714.5863         704.8163
82   283.70          317.59670           312.2590         319.9665
83   684.38          741.79241           666.1870         657.5915
84   719.46          768.50069           704.7507         695.1893
85   292.23          352.22724           247.0298         256.9060
86   775.68          812.73250           675.7111         666.8554
87   130.77           55.72887           130.4234         131.8579
88   801.60          822.95852           734.4911         724.3470
89   323.55          394.79302           272.8459         282.0804
90   726.90          783.93736           628.8693         621.4259
91   661.12          737.62303           754.7074         744.2499
92   771.11          827.13656           871.4650         860.5522
93  1016.14          985.71156           985.7849         976.7296
94   237.69          237.68011           242.1002         252.0532
95   325.89          372.05452           330.3185         337.1851
96   636.22          708.19448           532.6702         529.0892
97   272.12          291.51318           272.8459         282.0804
98   696.65          754.10096           628.8693         621.4259
99   434.53          520.41216           516.1918         513.3850
100  593.86          664.37918           541.0262         537.0639

Created by Pretty R at inside-R.org

It is quite exciting to see that with so little data (we only used 70 training examples) we can already get a not-so-bad prediction.

Finally, lets print out our total errors to determine which funcion (single exponent, square, or cube) has done better.

[1] "Number of rows 100"
[1] "Linear error is  1771.68054888136"
[1] "Square error is 1493.32560142226"
[1] "Cube error is  1577.4449039908"

Created by Pretty R at inside-R.org

And this is all. Note that this article does not consider normalization neither regularization, however it should show enough so you want more. I strongly suggest you to try R if you are interested in machine learning.

 

The simplest machine learning code in R

Ok, so you are trying to learn R and you want an equivalent of “Hello world” but for machine learning. Most of the tutorial explain how to create a software to perform digit recognition, although that is very interesting (way more than what I am going to show you here), it is definitely a “Hello world” type of program.

What we will do here is a simple linear regression. So lets dive into R code.

plot(cars)
model <- lm(dist ~ speed, data=cars)
abline(model)
new_data <- data.frame(speed = 21)
prediction <- predict(model, new_data)
print(prediction)

Created by Pretty R at inside-R.org

Well, this will output two things, first a chart

basic

And it will also output a value

            1
65.00149

Now, this might not look very exciting, right? Wrong! it is extremely exciting, that line you see there is the BEST line (of a non-polynomial approach, more about that on next posts) that can predict the value of the distance given a speed, notice that there are INFINITE lines, but we managed to get the best one in a FINITE number of steps.

Lets analyse the code now.

  1. plot(cars) : The variable “cars” is simply a dataset that comes with R, it essentially contains 50 rows and two columns, the columns are speed and distance, and they measure the relation as in number of meters to stop a car given its speed (the faster it goes, the bigger the speed).  Of course the plot function will plot all the values in a nice chart.
  2. model <- lm(dist ~ speed, data=cars) : This is a very cool line, and probably the reason why you are here, essentially this is the Linear Regression algorithm in a single line, let me split it for you:
    1. lm() is a function to generate a linear model (training a linear regression algorithm), it will use the square distance as a cost function (if that sounds weird to you, I suggest you studying the machine learning course by Andrew Ng, it is free).
    2. dist ~ speed is just what is called a “formula” in R, it essentially means “I want to predict the ‘dist’ for that, lets use the values we have in ‘speed
    3. data=cars feeds the linear regression with data so it can be trained
    4. Oh! I almost forgot “<-” is the way you assign variables in R
  3. abline(model) : This simply plots our model in the existing plot, essentially this line is the one that puts our beautiful prediction line in the chart
  4. new_data <- data.frame(speed = 21) : Here I am creating a dataframe variable (If you are new to R, just repeat with me, a dataframe is simply a table, period). It will contain a single column with a single value. In this case the column will be “speed” and the value 21. Note that in our second line, we trained the algorithm to predict a distance given a speed.
  5. prediction <- predict(model, new_data) : Now things get a bit interesting, given my model and my new data, we will predict which ‘dist’ value corresponds to our given ‘speed’ value.
  6. print(prediction) : and here we simply print the result.

Now, just think for a moment how cool is this, with a few lines of code we have managed to get a relatively good result, if you look at the chart, predicting 65 as the stop distance for a speed of 21 seems not too bad. Of course this is not perfect, but given that we have written only 6 lines of code is not too bad.

And this is all for now, I hope this helps you dive into the R language and Machine Learning. Note that this example does not dive into using more than one variable to predict a given value (but doing it is trivial), neither takes the approach of using polynomial approaches (which again, is trivial). I have simply tried to provide the simplest machine learning and R code possible, and this is it🙂 .

Why I decided to learn R

I have all my professional live coding in general-purpose languages such as Java or Python, and I love those languages, each one has different pros and cons (I like the robustness of Java and also the simplicity of python). General-purpose languages are great, as their names indicate, you can use them for pretty much anything.

So then why the hell did I learn R? Well, the main reason was my interest on machine learning, during 2015 I did the machine learning course at coursera by Andrew Ng, first of all I have to say thanks to coursera.org and Andrew Ng, the course is fantastic and really puts you in a good position to develop yourself as a machine learning develop.

Ok, fine, but then why on earth R, after all, you can do it with Java, right? Well, the answer is yes, but just because you can do something does not mean it is the right thing to do: Sure you could code machine learning algorithms in Java (there are also libraries for it), and in fact, that is how I started, after all I feel more comfortable in Java, however the vast majority of the documentation I found on Internet was constantly mentioning either Python + scikit-learn or R.

And I decided to go down the R path, why R instead of python? Well the main reason is that I am curious, and because I already knew python I wanted to try something new, and boy R is definitely nothing like python or Java.

R is NOT a general purpose language, it is design for data analysis and statistics mainly, that means that many of the expected features do not work as expected, also it has a fantastic set of functions for data analysis, for example is very easy to operate with csv files and load them into what is called a data frame (a data frame is essentially a table where you can easily manipulate rows and columns), it is also trivial to get statistical information from a data frame, for example 90% percentiles, standard variations, correlations, mean, average… Plotting is another great feature that is in R out of the box, and trust me it helps, specially when you are dealing with data you are not familiar with.

Going into the machine learning area, R seems to be pretty much a de-facto standard (although python is growing!), and the most common algorithms are available as mature libraries (regressions, random forests, support vector machines… you name it) which makes it really easy to center in the data rather than in the code.

As a con, I would say that R has a bit of a steep learning curve (not as much as people say, but still it feels a bit weird at the beginning), I normally joke saying that R is the vi of programming languages (but ey!, you still know how to use vi, right?)

So If you are into machine learning, even if you are a newcomer (as I am by the way), do not be afraid and jump into R, it would feel weird first, but then it gets very very simple.