Title: | Data Mining and R Programming for Beginners |
---|---|
Description: | Contains functions to simplify the use of data mining methods (classification, regression, clustering, etc.), for students and beginners in R programming. Various R packages are used and wrappers are built around the main functions, to standardize the use of data mining methods (input/output): it brings a certain loss of flexibility, but also a gain of simplicity. The package name came from the French "Fouille de Données en Master 2 Informatique Décisionnelle". |
Authors: | Alexandre Blansché [aut, cre] |
Maintainer: | Alexandre Blansché <[email protected]> |
License: | GPL-3 |
Version: | 0.9.9 |
Built: | 2024-11-03 04:52:53 UTC |
Source: | https://github.com/cran/fdm2id |
Longitude and latitude of 500 car accident during year 2014 (source: www.data.gov.uk).
accident2014
accident2014
The dataset has 500 instances described by 2 variables (coordinates).
Ensemble learning, through AdaBoost Algorithm.
ADABOOST( x, y, learningmethod, nsamples = 100, fuzzy = FALSE, tune = FALSE, seed = NULL, ... )
ADABOOST( x, y, learningmethod, nsamples = 100, fuzzy = FALSE, tune = FALSE, seed = NULL, ... )
x |
The dataset (description/predictors), a |
y |
The target (class labels or numeric values), a |
learningmethod |
The boosted method. |
nsamples |
The number of samplings. |
fuzzy |
Indicates whether or not fuzzy classification should be used or not. |
tune |
If true, the function returns paramters instead of a classification model. |
seed |
A specified seed for random number generation. |
... |
Other specific parameters for the leaning method. |
The classification model.
## Not run: require (datasets) data (iris) ADABOOST (iris [, -5], iris [, 5], NB) ## End(Not run)
## Not run: require (datasets) data (iris) ADABOOST (iris [, -5], iris [, 5], NB) ## End(Not run)
This dataset has been extracted from the WHO database and depict the alcool habits in the 27 european contries (in 2010).
alcohol
alcohol
The dataset has 27 instances described by 4 variables. The variables are the average amount of alcool of different types per year par inhabitent.
This function builds a classification model using the association rules method APRIORI.
APRIORI( train, labels, supp = 0.05, conf = 0.8, prune = FALSE, tune = FALSE, ... )
APRIORI( train, labels, supp = 0.05, conf = 0.8, prune = FALSE, tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
supp |
The minimal support of an item set (numeric value). |
conf |
The minimal confidence of an item set (numeric value). |
prune |
A logical indicating whether to prune redundant rules or not (default: |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model, as an object of class apriori
.
predict.apriori
, apriori-class
, apriori
require ("datasets") data (iris) d = discretizeDF (iris, default = list (method = "interval", breaks = 3, labels = c ("small", "medium", "large"))) APRIORI (d [, -5], d [, 5], supp = .1, conf = .9, prune = TRUE)
require ("datasets") data (iris) d = discretizeDF (iris, default = list (method = "interval", breaks = 3, labels = c ("small", "medium", "large"))) APRIORI (d [, -5], d [, 5], supp = .1, conf = .9, prune = TRUE)
This class contains the classification model obtained by the APRIORI association rules method.
rules
The set of rules obtained by APRIORI.
transactions
The training set as a transaction
object.
train
The training set (description). A matrix
or data.frame
.
labels
Class labels of the training set. Either a factor
or an integer vector
.
supp
The minimal support of an item set (numeric value).
conf
The minimal confidence of an item set (numeric value).
APRIORI
, predict.apriori
, print.apriori
,
summary.apriori
, apriori
This function is a data augmentation technique. It duplicates rows and add gaussian noise to the duplicates.
augmentation(dataset, target, n = 5, sigma = 0.1, seed = NULL)
augmentation(dataset, target, n = 5, sigma = 0.1, seed = NULL)
dataset |
The dataset to be split ( |
target |
The column index of the target variable (class label or response variable). |
n |
The scaling factor (as an integer value). |
sigma |
The baseline variance for the noise generation. |
seed |
A specified seed for random number generation. |
An augmented dataset.
require (datasets) data (iris) d = augmentation (iris, 5) summary (iris) summary (d)
require (datasets) data (iris) d = augmentation (iris, 5) summary (iris) summary (d)
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.
autompg
autompg
The dataset has 392 instances described by 8 variables. The seven first variables are numeric variables. The last variable is qualitative (car origin).
https://archive.ics.uci.edu/ml/datasets/auto+mpg
Ensemble learning, through Bagging Algorithm.
BAGGING( x, y, learningmethod, nsamples = 100, bag.size = nrow(x), seed = NULL, ... )
BAGGING( x, y, learningmethod, nsamples = 100, bag.size = nrow(x), seed = NULL, ... )
x |
The dataset (description/predictors), a |
y |
The target (class labels or numeric values), a |
learningmethod |
The boosted method. |
nsamples |
The number of samplings. |
bag.size |
The size of the samples. |
seed |
A specified seed for random number generation. |
... |
Other specific parameters for the leaning method. |
The classification model.
## Not run: require (datasets) data (iris) BAGGING (iris [, -5], iris [, 5], NB) ## End(Not run)
## Not run: require (datasets) data (iris) BAGGING (iris [, -5], iris [, 5], NB) ## End(Not run)
Data were collected on the genus of flea beetle Chaetocnema, which contains three species: concinna, heikertingeri, and heptapotamica. Measurements were made on the width and angle of the aedeagus of each beetle. The goal of the original study was to form a classification rule to distinguish the three species.
beetles
beetles
The dataset has 74 instances described by 3 variables. The variables are as follows:
Width
The maximal width of aedeagus in the forpart (in microns).
Angle
The front angle of the aedeagus (1 unit = 7.5 degrees).
Shot.put
Species of flea beetle from the genus Chaetocnema.
Lubischew, A.A. (1962) On the use of discriminant functions in taxonomy. Biometrics, 18, 455-477.
Tutorial data set (vector).
birth
birth
The dataset is a names vector of nine values (birth years).
This class contains the classification model obtained by the CDA method.
models
List of models.
x
The learning set.
y
The target values.
ADABOOST
, BAGGING
, predict.boosting
Produce a box-and-whisker plot for clustering results.
boxclus(d, clusters, legendpos = "topleft", ...)
boxclus(d, clusters, legendpos = "topleft", ...)
d |
The dataset ( |
clusters |
Cluster labels of the training set ( |
legendpos |
Position of the legend |
... |
Other parameters. |
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) boxclus (iris [, -5], km$cluster)
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) boxclus (iris [, -5], km$cluster)
Longitude and latitude and population of 18 major cities in the Great Britain.
britpop
britpop
The dataset has 18 instances described by 3 variables.
Performs Correspondence Analysis (CA) including supplementary row and/or column points.
CA( d, ncp = 5, row.sup = NULL, col.sup = NULL, quanti.sup = NULL, quali.sup = NULL, row.w = NULL )
CA( d, ncp = 5, row.sup = NULL, col.sup = NULL, quanti.sup = NULL, quali.sup = NULL, row.w = NULL )
d |
A ddata frame or a table with n rows and p columns, i.e. a contingency table. |
ncp |
The number of dimensions kept in the results (by default 5). |
row.sup |
A vector indicating the indexes of the supplementary rows. |
col.sup |
A vector indicating the indexes of the supplementary columns. |
quanti.sup |
A vector indicating the indexes of the supplementary continuous variables. |
quali.sup |
A vector indicating the indexes of the categorical supplementary variables. |
row.w |
An optional row weights (by default, a vector of 1 for uniform row weights); the weights are given only for the active individuals. |
The CA on the dataset.
CA
, MCA
, PCA
, plot.factorial
, factorial-class
data (children, package = "FactoMineR") CA (children, row.sup = 15:18, col.sup = 6:8)
data (children, package = "FactoMineR") CA (children, row.sup = 15:18, col.sup = 6:8)
This function builds a classification model using CART.
CART( train, labels, minsplit = 1, maxdepth = log2(length(labels)), cp = NULL, tune = FALSE, ... )
CART( train, labels, minsplit = 1, maxdepth = log2(length(labels)), cp = NULL, tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
minsplit |
The minimum leaf size during the learning. |
maxdepth |
Set the maximum depth of any node of the final tree, with the root node counted as depth 0. |
cp |
The complexity parameter of the tree. Cross-validation is used to determine optimal cp if NULL. |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
cartdepth
, cartinfo
, cartleafs
, cartnodes
, cartplot
, rpart
require (datasets) data (iris) CART (iris [, -5], iris [, 5])
require (datasets) data (iris) CART (iris [, -5], iris [, 5])
Return the dept of a decision tree.
cartdepth(model)
cartdepth(model)
model |
The decision tree. |
The depth.
CART
, cartinfo
, cartleafs
, cartnodes
, cartplot
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartdepth (model)
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartdepth (model)
Return various information on a CART model.
cartinfo(model)
cartinfo(model)
model |
The decision tree. |
Various information organized into a vector.
CART
, cartdepth
, cartleafs
, cartnodes
, cartplot
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartinfo (model)
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartinfo (model)
Return the number of leafs of a decision tree.
cartleafs(model)
cartleafs(model)
model |
The decision tree. |
The number of leafs.
CART
, cartdepth
, cartinfo
, cartnodes
, cartplot
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartleafs (model)
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartleafs (model)
Return the number of nodes of a decision tree.
cartnodes(model)
cartnodes(model)
model |
The decision tree. |
The number of nodes.
CART
, cartdepth
, cartinfo
, cartleafs
, cartplot
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartnodes (model)
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartnodes (model)
Plot a decision tree obtained by CART.
cartplot(model, ...)
cartplot(model, ...)
model |
The decision tree. |
... |
Other parameters. |
CART
, cartdepth
, cartinfo
, cartleafs
, cartnodes
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartplot (model)
require (datasets) data (iris) model = CART (iris [, -5], iris [, 5]) cartplot (model)
This function builds a classification model using Canonical Discriminant Analysis.
CDA(train, labels, tune = FALSE, ...)
CDA(train, labels, tune = FALSE, ...)
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model, as an object of class glmnet
.
plot.cda
, predict.cda
, cda-class
require (datasets) data (iris) CDA (iris [, -5], iris [, 5])
require (datasets) data (iris) CDA (iris [, -5], iris [, 5])
This class contains the classification model obtained by the CDA method.
proj
The projection of the dataset into the canonical base. A data.frame
.
transform
The transformation matrix between. A matrix
.
centers
Coordinates of the class centers. A matrix
.
within
The intr-class covarianc matrix. A matrix
.
eig
The eigen-values. A matrix
.
dim
The number of dimensions of the canonical base (numeric value).
nb.classes
The number of clusters (numeric value).
train
The training set (description). A data.frame
.
labels
Class labels of the training set. Either a factor
or an integer vector
.
model
The prediction model.
Close the graphics device driver
closegraphics()
closegraphics()
exportgraphics
, toggleexport
, dev.off
## Not run: data (iris) exportgraphics ("export.pdf") plotdata (iris [, -5], iris [, 5]) closegraphics() ## End(Not run)
## Not run: data (iris) exportgraphics ("export.pdf") plotdata (iris [, -5], iris [, 5]) closegraphics() ## End(Not run)
Comparison of two sets of clusters
compare(clus, gt, eval = "accuracy", comp = c("max", "pairwise", "cluster"))
compare(clus, gt, eval = "accuracy", comp = c("max", "pairwise", "cluster"))
clus |
The extracted clusters. |
gt |
The real clusters. |
eval |
The evluation criterion. |
comp |
Indicates whether a "max" or a "pairwise" evaluation should be used, or the evaluation for each individual "cluster". |
A numeric value indicating how much the two sets of clusters are similar.
compare.accuracy
, compare.jaccard
, compare.kappa
, intern
, stability
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) compare (km$cluster, iris [, 5]) ## Not run: compare (km$cluster, iris [, 5], eval = c ("accuracy", "kappa"), comp = "pairwise") ## End(Not run)
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) compare (km$cluster, iris [, 5]) ## Not run: compare (km$cluster, iris [, 5], eval = c ("accuracy", "kappa"), comp = "pairwise") ## End(Not run)
Comparison of two sets of clusters, using accuracy
compare.accuracy(clus, gt, comp = c("max", "pairwise", "cluster"))
compare.accuracy(clus, gt, comp = c("max", "pairwise", "cluster"))
clus |
The extracted clusters. |
gt |
The real clusters. |
comp |
Indicates whether a "max" or a "pairwise" evaluation should be used, or the evaluation for each individual "cluster". |
A numeric value indicating how much the two sets of clusters are similar.
compare.jaccard
, compare.kappa
, compare
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) compare.accuracy (km$cluster, iris [, 5])
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) compare.accuracy (km$cluster, iris [, 5])
Comparison of two sets of clusters, using Jaccard index
compare.jaccard(clus, gt, comp = c("max", "pairwise", "cluster"))
compare.jaccard(clus, gt, comp = c("max", "pairwise", "cluster"))
clus |
The extracted clusters. |
gt |
The real clusters. |
comp |
Indicates whether a "max" or a "pairwise" evaluation should be used, or the evaluation for each individual "cluster". |
A numeric value indicating how much the two sets of clusters are similar.
compare.accuracy
, compare.kappa
, compare
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) compare.jaccard (km$cluster, iris [, 5])
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) compare.jaccard (km$cluster, iris [, 5])
Comparison of two sets of clusters, using kappa
compare.kappa(clus, gt, comp = c("max", "pairwise", "cluster"))
compare.kappa(clus, gt, comp = c("max", "pairwise", "cluster"))
clus |
The extracted clusters. |
gt |
The real clusters. |
comp |
Indicates whether a "max" or a "pairwise" evaluation should be used, or the evaluation for each individual "cluster". |
A numeric value indicating how much the two sets of clusters are similar.
compare.accuracy
, compare.jaccard
, compare
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) compare.kappa (km$cluster, iris [, 5])
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) compare.kappa (km$cluster, iris [, 5])
Plot a confusion matrix.
confusion(predictions, gt, norm = TRUE, graph = TRUE)
confusion(predictions, gt, norm = TRUE, graph = TRUE)
predictions |
The prediction. |
gt |
The ground truth. |
norm |
Whether or not the confusion matrix is normalized |
graph |
Whether or not a graphic is displayed. |
The confusion matrix.
evaluation
, performance
, splitdata
require ("datasets") data (iris) d = splitdata (iris, 5) model = NB (d$train.x, d$train.y) pred = predict (model, d$test.x) confusion (d$test.y, pred)
require ("datasets") data (iris) d = splitdata (iris, 5) model = NB (d$train.x, d$train.y) pred = predict (model, d$test.x) confusion (d$test.y, pred)
This data set contains measurements from quantitative NIR spectroscopy. The example studied arises from an experiment done to test the feasibility of NIR spectroscopy to measure the composition of biscuit dough pieces (formed but unbaked biscuits). Two similar sample sets were made up, with the standard recipe varied to provide a large range for each of the four constituents under investigation: fat, sucrose, dry flour, and water. The calculated percentages of these four ingredients represent the 4 responses. There are 40 samples in the calibration or training set (with sample 23 being an outlier). There are a further 32 samples in the separate prediction or validation set (with example 21 considered as an outlier). An NIR reflectance spectrum is available for each dough piece. The spectral data consist of 700 points measured from 1100 to 2498 nanometers (nm) in steps of 2 nm.
cookies cookies.desc.train cookies.desc.test cookies.y.train cookies.y.test
cookies cookies.desc.train cookies.desc.test cookies.y.train cookies.y.test
The cookies.desc.* datasets contains the 700 columns that correspond to the NIR reflectance spectrum. The cookies.y.* datasets contains four columns that correspond to the four constituents fat, sucrose, dry flour, and water. The cookies.*.train contains 40 rows that correspond to the calibration data. The cookies.*.test contains 32 rows that correspond to the prediction data.
P. J. Brown and T. Fearn and M. Vannucci (2001) "Bayesian wavelet regression on curves with applications to a spectroscopic calibration problem", Journal of the American Statistical Association, 96(454), pp. 398-408.
Plot the Cook's distance of a linear regression model.
cookplot(model, index = NULL, labels = NULL)
cookplot(model, index = NULL, labels = NULL)
model |
The model to be plotted. |
index |
The index of the variable used for for the x-axis. |
labels |
The labels of the instances. |
require (datasets) data (trees) model = LINREG (trees [, -3], trees [, 3]) cookplot (model)
require (datasets) data (trees) model = LINREG (trees [, -3], trees [, 3]) cookplot (model)
This function plots Cost Curves of several classification predictions.
cost.curves(predictions, gt, methods.names = NULL)
cost.curves(predictions, gt, methods.names = NULL)
predictions |
The predictions of a classification model ( |
gt |
Actual labels of the dataset ( |
methods.names |
The name of the compared methods ( |
The evaluation of the predictions (numeric value).
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset model.nb = NB (d [, -5], d [, 5]) model.lda = LDA (d [, -5], d [, 5]) pred.nb = predict (model.nb, d [, -5]) pred.lda = predict (model.lda, d [, -5]) cost.curves (cbind (pred.nb, pred.lda), d [, 5], c ("NB", "LDA"))
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset model.nb = NB (d [, -5], d [, 5]) model.lda = LDA (d [, -5], d [, 5]) pred.nb = predict (model.nb, d [, -5]) pred.lda = predict (model.lda, d [, -5]) cost.curves (cbind (pred.nb, pred.lda), d [, 5], c ("NB", "LDA"))
This is a fake dataset simulating a bank database about loan clients.
credit
credit
The dataset has 66 instances described by 11 qualitative variables.
Generate a random dataset shaped like a square divided by a custom function
data.diag( n = 200, min = 0, max = 1, f = function(x) x, levels = NULL, graph = TRUE, seed = NULL )
data.diag( n = 200, min = 0, max = 1, f = function(x) x, levels = NULL, graph = TRUE, seed = NULL )
n |
Number of observations in the dataset. |
min |
Minimum value on each variables. |
max |
Maximum value on each variables. |
f |
The fucntion that separate the classes. |
levels |
Name of each class. |
graph |
A logical indicating whether or not a graphic should be plotted. |
seed |
A specified seed for random number generation. |
A randomly generated dataset.
data.parabol
, data.target1
, data.target2
, data.twomoons
, data.xor
data.diag ()
data.diag ()
Generate a random multidimentional gaussian mixture.
data.gauss( n = 1000, k = 2, prob = rep(1/k, k), mu = cbind(rep(0, k), seq(from = 0, by = 3, length.out = k)), cov = rep(list(matrix(c(6, 0.9, 0.9, 0.3), ncol = 2, nrow = 2)), k), levels = NULL, graph = TRUE, seed = NULL )
data.gauss( n = 1000, k = 2, prob = rep(1/k, k), mu = cbind(rep(0, k), seq(from = 0, by = 3, length.out = k)), cov = rep(list(matrix(c(6, 0.9, 0.9, 0.3), ncol = 2, nrow = 2)), k), levels = NULL, graph = TRUE, seed = NULL )
n |
Number of observations. |
k |
The number of classes. |
prob |
The a priori probability of each class. |
mu |
The means of the gaussian distributions. |
cov |
The covariance of the gaussian distributions. |
levels |
Name of each class. |
graph |
A logical indicating whether or not a graphic should be plotted. |
seed |
A specified seed for random number generation. |
A randomly generated dataset.
data.diag
, data.parabol
, data.target2
, data.twomoons
, data.xor
data.gauss ()
data.gauss ()
Generate a random dataset shaped like a parabol and a gaussian distribution
data.parabol( n = c(500, 100), xlim = c(-3, 3), center = c(0, 4), coeff = 0.5, sigma = c(0.5, 0.5), levels = NULL, graph = TRUE, seed = NULL )
data.parabol( n = c(500, 100), xlim = c(-3, 3), center = c(0, 4), coeff = 0.5, sigma = c(0.5, 0.5), levels = NULL, graph = TRUE, seed = NULL )
n |
Number of observations in each class. |
xlim |
Minimum and maximum on the x axis. |
center |
Coordinates of the center of the gaussian distribution. |
coeff |
Coefficient of the parabol. |
sigma |
Variance in each class. |
levels |
Name of each class. |
graph |
A logical indicating whether or not a graphic should be plotted. |
seed |
A specified seed for random number generation. |
A randomly generated dataset.
data.diag
, data.target1
, data.target2
, data.twomoons
, data.xor
data.parabol ()
data.parabol ()
Generate a random dataset shaped like a target.
data.target1( r = 1:3, n = 200, sigma = 0.1, levels = NULL, graph = TRUE, seed = NULL )
data.target1( r = 1:3, n = 200, sigma = 0.1, levels = NULL, graph = TRUE, seed = NULL )
r |
Radius of each class. |
n |
Number of observations in each class. |
sigma |
Variance in each class. |
levels |
Name of each class. |
graph |
A logical indicating whether or not a graphic should be plotted. |
seed |
A specified seed for random number generation. |
A randomly generated dataset.
data.diag
, data.parabol
, data.target2
, data.twomoons
, data.xor
data.target1 ()
data.target1 ()
Generate a random dataset shaped like a target.
data.target2( minr = c(0, 2), maxr = minr + 1, initn = 1000, levels = NULL, graph = TRUE, seed = NULL )
data.target2( minr = c(0, 2), maxr = minr + 1, initn = 1000, levels = NULL, graph = TRUE, seed = NULL )
minr |
Minimum radius of each class. |
maxr |
Maximum radius of each class. |
initn |
Number of observations at the beginning of the generation process. |
levels |
Name of each class. |
graph |
A logical indicating whether or not a graphic should be plotted. |
seed |
A specified seed for random number generation. |
A randomly generated dataset.
data.diag
, data.parabol
, data.target1
, data.twomoons
, data.xor
data.target2 ()
data.target2 ()
Generate a random dataset shaped like two moons.
data.twomoons( r = 1, n = 200, sigma = 0.1, levels = NULL, graph = TRUE, seed = NULL )
data.twomoons( r = 1, n = 200, sigma = 0.1, levels = NULL, graph = TRUE, seed = NULL )
r |
Radius of each class. |
n |
Number of observations in each class. |
sigma |
Variance in each class. |
levels |
Name of each class. |
graph |
A logical indicating whether or not a graphic should be plotted. |
seed |
A specified seed for random number generation. |
A randomly generated dataset.
data.diag
, data.parabol
, data.target1
, data.target2
, data.xor
data.twomoons ()
data.twomoons ()
Generate "XOR" dataset.
data.xor( n = 100, ndim = 2, sigma = 0.25, levels = NULL, graph = TRUE, seed = NULL )
data.xor( n = 100, ndim = 2, sigma = 0.25, levels = NULL, graph = TRUE, seed = NULL )
n |
Number of observations in each cluster. |
ndim |
The number of dimensions (2^ndim clusters are formed, grouped into two classes). |
sigma |
The variance. |
levels |
Name of each class. |
graph |
A logical indicating whether or not a graphic should be plotted. |
seed |
A specified seed for random number generation. |
A randomly generated dataset.
data.diag
, data.gauss
, data.parabol
, data.target2
, data.twomoons
data.xor ()
data.xor ()
Synthetic dataset.
data1
data1
240 observations described by 4 variables and grouped into 16 classes.
Alexandre Blansché [email protected]
Synthetic dataset.
data2
data2
500 observations described by 10 variables and grouped into 3 classes.
Alexandre Blansché [email protected]
Synthetic dataset.
data3
data3
300 observations described by 3 variables and grouped into 3 classes.
Alexandre Blansché [email protected]
This class contains a dataset divided into four parts: the training set and test set, description and class labels.
train.x
the training set (description), as a data.frame
or a matrix
.
train.y
the training set (target), as a vector
or a factor
.
test.x
the training set (description), as a data.frame
or a matrix
.
test.y
the training set (target), as a vector
or a factor
.
This class contains the model obtained by the DBSCAN method.
cluster
A vector of integers indicating the cluster to which each point is allocated.
eps
Reachability distance (parameter).
MinPts
Reachability minimum no. of points (parameter).
isseed
A logical vector indicating whether a point is a seed (not border, not noise).
data
The dataset that has been used to fit the map (as a matrix
).
Run the DBSCAN algorithm for clustering.
DBSCAN(d, minpts, epsilonDist, ...)
DBSCAN(d, minpts, epsilonDist, ...)
d |
The dataset ( |
minpts |
Reachability minimum no. of points. |
epsilonDist |
Reachability distance. |
... |
Other parameters. |
A clustering model obtained by DBSCAN.
dbscan
, dbs-class
, distplot
, predict.dbs
require (datasets) data (iris) DBSCAN (iris [, -5], minpts = 5, epsilonDist = 1)
require (datasets) data (iris) DBSCAN (iris [, -5], minpts = 5, epsilonDist = 1)
The dataset contains results from two athletics competitions. The 2004 Olympic Games in Athens and the 2004 Decastar.
decathlon
decathlon
The dataset has 41 instances described by 13 variables. The variables are as follows:
100m
In seconds.
Long.jump
In meters.
Shot.put
In meters.
High.jump
In meters.
400m
In seconds.
110m.h
In seconds.
Discus.throw
In meters.
Pole.vault
In meters.
Javelin.throw
In meters.
1500m
In seconds.
Rank
The rank at the competition.
Points
The number of points obtained by the athlete.
Competition
Olympics
or Decastar
.
https://husson.github.io/data.html
Plot the distance to the k's nearest neighbours of each object in decreasing order. Mostly used to determine the eps
parameter for the dbscan
function.
distplot(k, d, h = -1)
distplot(k, d, h = -1)
k |
The |
d |
The dataset ( |
h |
The y-coordinate at which a horizontal line should be drawn. |
require (datasets) data (iris) distplot (5, iris [, -5], h = .65)
require (datasets) data (iris) distplot (5, iris [, -5], h = .65)
Run the EM algorithm for clustering.
EM(d, clusters, model = "VVV", ...)
EM(d, clusters, model = "VVV", ...)
d |
The dataset ( |
clusters |
Either an integer (the number of clusters) or a ( |
model |
A character string indicating the model. The help file for |
... |
Other parameters. |
A clustering model obtained by EM.
require (datasets) data (iris) EM (iris [, -5], 3) # Default initialization km = KMEANS (iris [, -5], k = 3) EM (iris [, -5], km$cluster) # Initialization with another clustering method
require (datasets) data (iris) EM (iris [, -5], 3) # Default initialization km = KMEANS (iris [, -5], k = 3) EM (iris [, -5], km$cluster) # Initialization with another clustering method
This class contains the model obtained by the EM method.
modelName
A character string indicating the model. The help file for mclustModelNames
describes the available models.
prior
Specification of a conjugate prior on the means and variances.
n
The number of observations in the dataset.
d
The number of variables in the dataset.
G
The number of components of the mixture.
z
A matrix whose [i,k]
th entry is the conditional probability of the ith observation belonging to the kth component of the mixture.
parameters
A names list giving the parameters of the model.
control
A list of control parameters for EM.
loglik
The log likelihood for the data in the mixture model.
cluster
A vector of integers (from 1:k
) indicating the cluster to which each point is allocated.
Measuring the height of a tree is not an easy task. Is it possible to estimate the height as a function of the circumference of the trunk?
eucalyptus
eucalyptus
The dataset has 1429 instances (eucalyptus trees) with 2 measurements: the height and the circumference.
http://www.cmap.polytechnique.fr/~lepennec/fr/teaching/
Evaluation predictions of a classification or a regression model.
evaluation( predictions, gt, eval = ifelse(is.factor(gt), "accuracy", "r2"), ... )
evaluation( predictions, gt, eval = ifelse(is.factor(gt), "accuracy", "r2"), ... )
predictions |
The predictions of a classification model ( |
gt |
The ground truth of the dataset ( |
eval |
The evaluation method. |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
confusion
, evaluation.accuracy
, evaluation.fmeasure
, evaluation.fowlkesmallows
, evaluation.goodness
, evaluation.jaccard
, evaluation.kappa
,
evaluation.precision
, evaluation.recall
,
evaluation.msep
, evaluation.r2
, performance
require (datasets) data (iris) d = splitdata (iris, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) # Default evaluation for classification evaluation (pred.nb, d$test.y) # Evaluation with two criteria evaluation (pred.nb, d$test.y, eval = c ("accuracy", "kappa")) data (trees) d = splitdata (trees, 3) model.linreg = LINREG (d$train.x, d$train.y) pred.linreg = predict (model.linreg, d$test.x) # Default evaluation for regression evaluation (pred.linreg, d$test.y)
require (datasets) data (iris) d = splitdata (iris, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) # Default evaluation for classification evaluation (pred.nb, d$test.y) # Evaluation with two criteria evaluation (pred.nb, d$test.y, eval = c ("accuracy", "kappa")) data (trees) d = splitdata (trees, 3) model.linreg = LINREG (d$train.x, d$train.y) pred.linreg = predict (model.linreg, d$test.x) # Default evaluation for regression evaluation (pred.linreg, d$test.y)
Evaluation predictions of a classification model according to accuracy.
evaluation.accuracy(predictions, gt, ...)
evaluation.accuracy(predictions, gt, ...)
predictions |
The predictions of a classification model ( |
gt |
The ground truth ( |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
evaluation.fmeasure
, evaluation.fowlkesmallows
, evaluation.goodness
, evaluation.jaccard
, evaluation.kappa
, evaluation.precision
,
evaluation.precision
, evaluation.recall
,
evaluation
require (datasets) data (iris) d = splitdata (iris, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.accuracy (pred.nb, d$test.y)
require (datasets) data (iris) d = splitdata (iris, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.accuracy (pred.nb, d$test.y)
Evaluation predictions of a regression model according to R2
evaluation.adjr2(predictions, gt, nrow = length(predictions), ncol, ...)
evaluation.adjr2(predictions, gt, nrow = length(predictions), ncol, ...)
predictions |
The predictions of a regression model ( |
gt |
The ground truth ( |
nrow |
Number of observations. |
ncol |
Number of variables |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
require (datasets) data (trees) d = splitdata (trees, 3) model.linreg = LINREG (d$train.x, d$train.y) pred.linreg = predict (model.linreg, d$test.x) evaluation.r2 (pred.linreg, d$test.y)
require (datasets) data (trees) d = splitdata (trees, 3) model.linreg = LINREG (d$train.x, d$train.y) pred.linreg = predict (model.linreg, d$test.x) evaluation.r2 (pred.linreg, d$test.y)
Evaluation predictions of a classification model according to the F-measure index.
evaluation.fmeasure(predictions, gt, beta = 1, positive = levels(gt)[1], ...)
evaluation.fmeasure(predictions, gt, beta = 1, positive = levels(gt)[1], ...)
predictions |
The predictions of a classification model ( |
gt |
The ground truth ( |
beta |
The weight given to precision. |
positive |
The label of the positive class. |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
evaluation.accuracy
, evaluation.fowlkesmallows
, evaluation.goodness
, evaluation.jaccard
, evaluation.kappa
, evaluation.precision
,
evaluation.precision
, evaluation.recall
,
evaluation
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.fmeasure (pred.nb, d$test.y)
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.fmeasure (pred.nb, d$test.y)
Evaluation predictions of a classification model according to the Fowlkes–Mallows index.
evaluation.fowlkesmallows(predictions, gt, positive = levels(gt)[1], ...)
evaluation.fowlkesmallows(predictions, gt, positive = levels(gt)[1], ...)
predictions |
The predictions of a classification model ( |
gt |
The ground truth ( |
positive |
The label of the positive class. |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
evaluation.accuracy
, evaluation.fmeasure
, evaluation.goodness
, evaluation.jaccard
, evaluation.kappa
, evaluation.precision
,
evaluation.precision
, evaluation.recall
,
evaluation
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.fowlkesmallows (pred.nb, d$test.y)
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.fowlkesmallows (pred.nb, d$test.y)
Evaluation predictions of a classification model according to Goodness index.
evaluation.goodness(predictions, gt, beta = 1, positive = levels(gt)[1], ...)
evaluation.goodness(predictions, gt, beta = 1, positive = levels(gt)[1], ...)
predictions |
The predictions of a classification model ( |
gt |
The ground truth ( |
beta |
The weight given to precision. |
positive |
The label of the positive class. |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
evaluation.accuracy
, evaluation.fmeasure
, evaluation.fowlkesmallows
, evaluation.jaccard
, evaluation.kappa
, evaluation.precision
,
evaluation.precision
, evaluation.recall
,
evaluation
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.goodness (pred.nb, d$test.y)
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.goodness (pred.nb, d$test.y)
Evaluation predictions of a classification model according to Jaccard index.
evaluation.jaccard(predictions, gt, positive = levels(gt)[1], ...)
evaluation.jaccard(predictions, gt, positive = levels(gt)[1], ...)
predictions |
The predictions of a classification model ( |
gt |
The ground truth ( |
positive |
The label of the positive class. |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
evaluation.accuracy
, evaluation.fmeasure
, evaluation.fowlkesmallows
, evaluation.goodness
, evaluation.kappa
, evaluation.precision
,
evaluation.precision
, evaluation.recall
,
evaluation
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.jaccard (pred.nb, d$test.y)
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.jaccard (pred.nb, d$test.y)
Evaluation predictions of a classification model according to kappa.
evaluation.kappa(predictions, gt, ...)
evaluation.kappa(predictions, gt, ...)
predictions |
The predictions of a classification model ( |
gt |
The ground truth ( |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
evaluation.accuracy
, evaluation.fmeasure
, evaluation.fowlkesmallows
, evaluation.goodness
, evaluation.jaccard
, evaluation.kappa
, evaluation.precision
,
evaluation.precision
, evaluation.recall
,
evaluation
require (datasets) data (iris) d = splitdata (iris, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.kappa (pred.nb, d$test.y)
require (datasets) data (iris) d = splitdata (iris, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.kappa (pred.nb, d$test.y)
Evaluation predictions of a regression model according to MSEP
evaluation.msep(predictions, gt, ...)
evaluation.msep(predictions, gt, ...)
predictions |
The predictions of a regression model ( |
gt |
The ground truth ( |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
require (datasets) data (trees) d = splitdata (trees, 3) model.lin = LINREG (d$train.x, d$train.y) pred.lin = predict (model.lin, d$test.x) evaluation.msep (pred.lin, d$test.y)
require (datasets) data (trees) d = splitdata (trees, 3) model.lin = LINREG (d$train.x, d$train.y) pred.lin = predict (model.lin, d$test.x) evaluation.msep (pred.lin, d$test.y)
Evaluation predictions of a classification model according to precision. Works only for two classes problems.
evaluation.precision(predictions, gt, positive = levels(gt)[1], ...)
evaluation.precision(predictions, gt, positive = levels(gt)[1], ...)
predictions |
The predictions of a classification model ( |
gt |
The ground truth ( |
positive |
The label of the positive class. |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
evaluation.accuracy
, evaluation.fmeasure
, evaluation.fowlkesmallows
, evaluation.goodness
, evaluation.jaccard
, evaluation.kappa
,
evaluation.recall
,evaluation
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.precision (pred.nb, d$test.y)
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.precision (pred.nb, d$test.y)
Evaluation predictions of a regression model according to R2
evaluation.r2(predictions, gt, ...)
evaluation.r2(predictions, gt, ...)
predictions |
The predictions of a regression model ( |
gt |
The ground truth ( |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
require (datasets) data (trees) d = splitdata (trees, 3) model.linreg = LINREG (d$train.x, d$train.y) pred.linreg = predict (model.linreg, d$test.x) evaluation.r2 (pred.linreg, d$test.y)
require (datasets) data (trees) d = splitdata (trees, 3) model.linreg = LINREG (d$train.x, d$train.y) pred.linreg = predict (model.linreg, d$test.x) evaluation.r2 (pred.linreg, d$test.y)
Evaluation predictions of a classification model according to recall. Works only for two classes problems.
evaluation.recall(predictions, gt, positive = levels(gt)[1], ...)
evaluation.recall(predictions, gt, positive = levels(gt)[1], ...)
predictions |
The predictions of a classification model ( |
gt |
The ground truth ( |
positive |
The label of the positive class. |
... |
Other parameters. |
The evaluation of the predictions (numeric value).
evaluation.accuracy
, evaluation.fmeasure
, evaluation.fowlkesmallows
, evaluation.goodness
, evaluation.jaccard
, evaluation.kappa
,
evaluation.precision
, evaluation
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.recall (pred.nb, d$test.y)
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset d = splitdata (d, 5) model.nb = NB (d$train.x, d$train.y) pred.nb = predict (model.nb, d$test.x) evaluation.recall (pred.nb, d$test.y)
Starts the graphics device driver
exportgraphics(file, type = tail(strsplit(file, split = "\\.")[[1]], 1), ...)
exportgraphics(file, type = tail(strsplit(file, split = "\\.")[[1]], 1), ...)
file |
A character string giving the name of the file. |
type |
The type of graphics device. |
... |
Other parameters. |
closegraphics
, toggleexport
, Devices
## Not run: data (iris) exportgraphics ("export.pdf") plotdata (iris [, -5], iris [, 5]) closegraphics() ## End(Not run)
## Not run: data (iris) exportgraphics ("export.pdf") plotdata (iris [, -5], iris [, 5]) closegraphics() ## End(Not run)
Toggle graphic exports on and off
exportgraphics.off() exportgraphics.on() toggleexport(export = NULL) toggleexport.off() toggleexport.on()
exportgraphics.off() exportgraphics.on() toggleexport(export = NULL) toggleexport.off() toggleexport.on()
export |
If |
## Not run: data (iris) toggleexport (FALSE) exportgraphics ("export.pdf") plotdata (iris [, -5], iris [, 5]) closegraphics() toggleexport (TRUE) exportgraphics ("export.pdf") plotdata (iris [, -5], iris [, 5]) closegraphics() ## End(Not run)
## Not run: data (iris) toggleexport (FALSE) exportgraphics ("export.pdf") plotdata (iris [, -5], iris [, 5]) closegraphics() toggleexport (TRUE) exportgraphics ("export.pdf") plotdata (iris [, -5], iris [, 5]) closegraphics() ## End(Not run)
This class contains the classification model obtained by the CDA method.
CA
, MCA
, PCA
, plot.factorial
Apply a classification method after a subset of features has been selected.
FEATURESELECTION( train, labels, algorithm = c("ranking", "forward", "backward", "exhaustive"), unieval = if (algorithm[1] == "ranking") c("fisher", "fstat", "relief", "inertiaratio") else NULL, uninb = NULL, unithreshold = NULL, multieval = if (algorithm[1] == "ranking") NULL else c("cfs", "fstat", "inertiaratio", "wrapper"), wrapmethod = NULL, mainmethod = wrapmethod, tune = FALSE, ... )
FEATURESELECTION( train, labels, algorithm = c("ranking", "forward", "backward", "exhaustive"), unieval = if (algorithm[1] == "ranking") c("fisher", "fstat", "relief", "inertiaratio") else NULL, uninb = NULL, unithreshold = NULL, multieval = if (algorithm[1] == "ranking") NULL else c("cfs", "fstat", "inertiaratio", "wrapper"), wrapmethod = NULL, mainmethod = wrapmethod, tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
algorithm |
The feature selection algorithm. |
unieval |
The (univariate) evaluation criterion. |
uninb |
The number of selected feature (univariate evaluation). |
unithreshold |
The threshold for selecting feature (univariate evaluation). |
multieval |
The (multivariate) evaluation criterion. |
wrapmethod |
The classification method used for the wrapper evaluation. |
mainmethod |
The final method used for data classification. If a wrapper evaluation is used, the same classification method should be used. |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
selectfeatures
, predict.selection
, selection-class
## Not run: require (datasets) data (iris) FEATURESELECTION (iris [, -5], iris [, 5], uninb = 2, mainmethod = LDA) ## End(Not run)
## Not run: require (datasets) data (iris) FEATURESELECTION (iris [, -5], iris [, 5], uninb = 2, mainmethod = LDA) ## End(Not run)
This function facilitate the selection of a subset from a set of rules.
filter.rules( rules, pattern = NULL, left = pattern, right = pattern, removeMatches = FALSE )
filter.rules( rules, pattern = NULL, left = pattern, right = pattern, removeMatches = FALSE )
rules |
A set of rules. |
pattern |
A pattern to match (antecedent and consequent): a character string. |
left |
A pattern to match (antecedent only): a character string. |
right |
A pattern to match (consequent only): a character string. |
removeMatches |
A logical indicating whether to remove matching rules ( |
The filtered set of rules.
require ("arules") data ("Adult") r = apriori (Adult) filter.rules (r, right = "marital-status=") subset (r, subset = rhs %pin% "marital-status=")
require ("arules") data ("Adult") r = apriori (Adult) filter.rules (r, right = "marital-status=") subset (r, subset = rhs %pin% "marital-status=")
Most frequent words of the corpus.
frequentwords( corpus, nb, mincount = 5, minphrasecount = NULL, ngram = 1, lang = "en", stopwords = lang )
frequentwords( corpus, nb, mincount = 5, minphrasecount = NULL, ngram = 1, lang = "en", stopwords = lang )
corpus |
The corpus of documents (a vector of characters) or the vocabulary of the documents (result of function |
nb |
The number of words to be returned. |
mincount |
Minimum word count to be considered as frequent. |
minphrasecount |
Minimum collocation of words count to be considered as frequent. |
ngram |
maximum size of n-grams. |
lang |
The language of the documents (NULL if no stemming). |
stopwords |
Stopwords, or the language of the documents. NULL if stop words should not be removed. |
The most frequent words of the corpus.
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") frequentwords (text, 100) vocab = getvocab (text) frequentwords (vocab, 100) ## End(Not run)
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") frequentwords (text, 100) vocab = getvocab (text) frequentwords (vocab, 100) ## End(Not run)
This function remove every redundant rules, keeping only the most general ones.
general.rules(r)
general.rules(r)
r |
A set of rules. |
A set of rules, without redundancy.
require ("arules") data ("Adult") r = apriori (Adult) inspect (general.rules (r))
require ("arules") data ("Adult") r = apriori (Adult) inspect (general.rules (r))
Extract words and phrases from a corpus of documents.
getvocab( corpus, mincount = 5, minphrasecount = NULL, ngram = 1, lang = "en", stopwords = lang, ... )
getvocab( corpus, mincount = 5, minphrasecount = NULL, ngram = 1, lang = "en", stopwords = lang, ... )
corpus |
The corpus of documents (a vector of characters). |
mincount |
Minimum word count to be considered as frequent. |
minphrasecount |
Minimum collocation of words count to be considered as frequent. |
ngram |
maximum size of n-grams. |
lang |
The language of the documents (NULL if no stemming). |
stopwords |
Stopwords, or the language of the documents. NULL if stop words should not be removed. |
... |
Other parameters. |
The vocabulary used in the corpus of documents.
plotzipf
, stopwords
, create_vocabulary
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") vocab1 = getvocab (text) # With stemming nrow (vocab1) vocab2 = getvocab (text, lang = NULL) # Without stemming nrow (vocab2) ## End(Not run)
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") vocab1 = getvocab (text) # With stemming nrow (vocab1) vocab2 = getvocab (text, lang = NULL) # Without stemming nrow (vocab2) ## End(Not run)
This function builds a classification model using Gradient Boosting
GRADIENTBOOSTING( train, labels, ntree = 500, learningrate = 0.3, tune = FALSE, ... )
GRADIENTBOOSTING( train, labels, ntree = 500, learningrate = 0.3, tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
ntree |
The number of trees in the forest. |
learningrate |
The learning rate (between 0 and 1). |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
## Not run: require (datasets) data (iris) GRADIENTBOOSTING (iris [, -5], iris [, 5]) ## End(Not run)
## Not run: require (datasets) data (iris) GRADIENTBOOSTING (iris [, -5], iris [, 5]) ## End(Not run)
Run the HCA method for clustering.
HCA(d, method = c("ward", "single"), k = NULL, ...)
HCA(d, method = c("ward", "single"), k = NULL, ...)
d |
The dataset ( |
method |
Character string defining the clustering method. |
k |
The number of cluster. |
... |
Other parameters. |
The cluster hierarchy (hca
object).
require (datasets) data (iris) HCA (iris [, -5], method = "ward", k = 3)
require (datasets) data (iris) HCA (iris [, -5], method = "ward", k = 3)
Evaluation a clustering algorithm according to internal criteria.
intern(clus, d, eval = "intraclass", type = c("global", "cluster"))
intern(clus, d, eval = "intraclass", type = c("global", "cluster"))
clus |
The extracted clusters. |
d |
The dataset. |
eval |
The evaluation criteria. |
type |
Indicates whether a "global" or a "cluster"-wise evaluation should be used. |
The evaluation of the clustering.
compare
, stability
, intern.dunn
, intern.interclass
, intern.intraclass
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) intern (km$clus, iris [, -5]) intern (km$clus, iris [, -5], type = "cluster") intern (km$clus, iris [, -5], eval = c ("intraclass", "interclass")) intern (km$clus, iris [, -5], eval = c ("intraclass", "interclass"), type = "cluster")
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) intern (km$clus, iris [, -5]) intern (km$clus, iris [, -5], type = "cluster") intern (km$clus, iris [, -5], eval = c ("intraclass", "interclass")) intern (km$clus, iris [, -5], eval = c ("intraclass", "interclass"), type = "cluster")
Evaluation a clustering algorithm according to Dunn's index.
intern.dunn(clus, d, type = c("global"))
intern.dunn(clus, d, type = c("global"))
clus |
The extracted clusters. |
d |
The dataset. |
type |
Indicates whether a "global" or a "cluster"-wise evaluation should be used. |
The evaluation of the clustering.
intern
, intern.interclass
, intern.intraclass
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) intern.dunn (km$clus, iris [, -5])
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) intern.dunn (km$clus, iris [, -5])
Evaluation a clustering algorithm according to interclass inertia.
intern.interclass(clus, d, type = c("global", "cluster"))
intern.interclass(clus, d, type = c("global", "cluster"))
clus |
The extracted clusters. |
d |
The dataset. |
type |
Indicates whether a "global" or a "cluster"-wise evaluation should be used. |
The evaluation of the clustering.
intern
, intern.dunn
, intern.intraclass
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) intern.interclass (km$clus, iris [, -5])
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) intern.interclass (km$clus, iris [, -5])
Evaluation a clustering algorithm according to intraclass inertia.
intern.intraclass(clus, d, type = c("global", "cluster"))
intern.intraclass(clus, d, type = c("global", "cluster"))
clus |
The extracted clusters. |
d |
The dataset. |
type |
Indicates whether a "global" or a "cluster"-wise evaluation should be used. |
The evaluation of the clustering.
intern
, intern.dunn
, intern.interclass
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) intern.intraclass (km$clus, iris [, -5])
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) intern.intraclass (km$clus, iris [, -5])
This is a dataset from the UCI repository. This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. See the paper for more details. The targets were free electrons in the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere. Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay system. Instances in this databse are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal. One attribute with constant value has been removed.
ionosphere
ionosphere
The dataset has 351 instances described by 34. The last variable is the class.
https://archive.ics.uci.edu/ml/datasets/ionosphere
Apply the Kaiser rule to determine the appropriate number of PCA axes.
kaiser(pca)
kaiser(pca)
pca |
The PCA result (object of class |
require (datasets) data (iris) pca = PCA (iris, quali.sup = 5) kaiser (pca)
require (datasets) data (iris) pca = PCA (iris, quali.sup = 5) kaiser (pca)
This function builds a kernel regression model.
KERREG(x, y, bandwidth = 1, tune = FALSE, ...)
KERREG(x, y, bandwidth = 1, tune = FALSE, ...)
x |
Predictor |
y |
Response |
bandwidth |
The bandwidth parameter. |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model, as an object of class model-class
.
require (datasets) data (trees) KERREG (trees [, -3], trees [, 3])
require (datasets) data (trees) KERREG (trees [, -3], trees [, 3])
Run K-means for clustering.
KMEANS( d, k = 9, criterion = c("none", "pseudo-F"), graph = FALSE, nstart = 10, ... )
KMEANS( d, k = 9, criterion = c("none", "pseudo-F"), graph = FALSE, nstart = 10, ... )
d |
The dataset ( |
k |
The number of cluster. |
criterion |
The criterion for cluster number selection. If |
graph |
A logical indicating whether or not a graphic should be plotted (cluster number selection). |
nstart |
Define how many random sets should be chosen. |
... |
Other parameters. |
The clustering (kmeans
object).
require (datasets) data (iris) KMEANS (iris [, -5], k = 3) KMEANS (iris [, -5], criterion = "pseudo-F") # With automatic detection of the nmber of clusters
require (datasets) data (iris) KMEANS (iris [, -5], k = 3) KMEANS (iris [, -5], criterion = "pseudo-F") # With automatic detection of the nmber of clusters
Estimate the optimal number of cluster of the K-means clustering method.
kmeans.getk( d, max = 9, criterion = "pseudo-F", graph = TRUE, nstart = 10, seed = NULL )
kmeans.getk( d, max = 9, criterion = "pseudo-F", graph = TRUE, nstart = 10, seed = NULL )
d |
The dataset ( |
max |
The maximum number of clusters. Values from 2 to |
criterion |
The criterion to be optimized. |
graph |
A logical indicating whether or not a graphic should be plotted. |
nstart |
The number of random sets chosen for |
seed |
A specified seed for random number generation. |
The optimal number of cluster of the K-means clustering method according to the chosen criterion.
require (datasets) data (iris) kmeans.getk (iris [, -5])
require (datasets) data (iris) kmeans.getk (iris [, -5])
This function builds a classification model using Logistic Regression.
KNN(train, labels, k = 1:10, tune = FALSE, ...)
KNN(train, labels, k = 1:10, tune = FALSE, ...)
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
k |
The k parameter. |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
require (datasets) data (iris) KNN (iris [, -5], iris [, 5])
require (datasets) data (iris) KNN (iris [, -5], iris [, 5])
This class contains the classification model obtained by the k-NN method.
train
The training set (description). A data.frame
.
labels
Class labels of the training set. Either a factor
or an integer vector
.
k
The k
parameter.
This function builds a classification model using Linear Discriminant Analysis.
LDA(train, labels, tune = FALSE, ...)
LDA(train, labels, tune = FALSE, ...)
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
require (datasets) data (iris) LDA (iris [, -5], iris [, 5])
require (datasets) data (iris) LDA (iris [, -5], iris [, 5])
Plot the leverage points of a linear regression model.
leverageplot(model, index = NULL, labels = NULL)
leverageplot(model, index = NULL, labels = NULL)
model |
The model to be plotted. |
index |
The index of the variable used for for the x-axis. |
labels |
The labels of the instances. |
require (datasets) data (trees) model = LINREG (trees [, -3], trees [, 3]) leverageplot (model)
require (datasets) data (trees) model = LINREG (trees [, -3], trees [, 3]) leverageplot (model)
This function builds a linear regression model. Standard least square method, variable selection, factorial methods are available.
LINREG( x, y, quali = c("none", "intercept", "slope", "both"), reg = c("linear", "subset", "ridge", "lasso", "elastic", "pcr", "plsr"), regeval = c("r2", "bic", "adjr2", "cp", "msep"), scale = TRUE, lambda = 10^seq(-5, 5, length.out = 101), alpha = 0.5, graph = TRUE, tune = FALSE, ... )
LINREG( x, y, quali = c("none", "intercept", "slope", "both"), reg = c("linear", "subset", "ridge", "lasso", "elastic", "pcr", "plsr"), regeval = c("r2", "bic", "adjr2", "cp", "msep"), scale = TRUE, lambda = 10^seq(-5, 5, length.out = 101), alpha = 0.5, graph = TRUE, tune = FALSE, ... )
x |
Predictor |
y |
Response |
quali |
Indicates how to use the qualitative variables. |
reg |
The algorithm. |
regeval |
The evaluation criterion for subset selection. |
scale |
If true, PCR and PLS use scaled dataset. |
lambda |
The lambda parameter of Ridge, Lasso and Elastic net regression. |
alpha |
The elasticnet mixing parameter. |
graph |
A logical indicating whether or not graphics should be plotted (ridge, LASSO and elastic net). |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model, as an object of class model-class
.
lm
, regsubsets
, mvr
, glmnet
## Not run: require (datasets) # With one independant variable data (cars) LINREG (cars [, -2], cars [, 2]) # With two independant variables data (trees) LINREG (trees [, -3], trees [, 3]) # With non numeric variables data (ToothGrowth) LINREG (ToothGrowth [, -1], ToothGrowth [, 1], quali = "intercept") # Different intersept LINREG (ToothGrowth [, -1], ToothGrowth [, 1], quali = "slope") # Different slope LINREG (ToothGrowth [, -1], ToothGrowth [, 1], quali = "both") # Complete model # With multiple numeric variables data (mtcars) LINREG (mtcars [, -1], mtcars [, 1]) LINREG (mtcars [, -1], mtcars [, 1], reg = "subset", regeval = "adjr2") LINREG (mtcars [, -1], mtcars [, 1], reg = "ridge") LINREG (mtcars [, -1], mtcars [, 1], reg = "lasso") LINREG (mtcars [, -1], mtcars [, 1], reg = "elastic") LINREG (mtcars [, -1], mtcars [, 1], reg = "pcr") LINREG (mtcars [, -1], mtcars [, 1], reg = "plsr") ## End(Not run)
## Not run: require (datasets) # With one independant variable data (cars) LINREG (cars [, -2], cars [, 2]) # With two independant variables data (trees) LINREG (trees [, -3], trees [, 3]) # With non numeric variables data (ToothGrowth) LINREG (ToothGrowth [, -1], ToothGrowth [, 1], quali = "intercept") # Different intersept LINREG (ToothGrowth [, -1], ToothGrowth [, 1], quali = "slope") # Different slope LINREG (ToothGrowth [, -1], ToothGrowth [, 1], quali = "both") # Complete model # With multiple numeric variables data (mtcars) LINREG (mtcars [, -1], mtcars [, 1]) LINREG (mtcars [, -1], mtcars [, 1], reg = "subset", regeval = "adjr2") LINREG (mtcars [, -1], mtcars [, 1], reg = "ridge") LINREG (mtcars [, -1], mtcars [, 1], reg = "lasso") LINREG (mtcars [, -1], mtcars [, 1], reg = "elastic") LINREG (mtcars [, -1], mtcars [, 1], reg = "pcr") LINREG (mtcars [, -1], mtcars [, 1], reg = "plsr") ## End(Not run)
Synthetic dataset.
linsep
linsep
Class A
contains 50 observations and class B
contains 500 observations.
There are two numeric variables: X
and Y
.
Alexandre Blansché [email protected]
(Down)Load a text file (and extract it if it is in a zip file).
loadtext( file = file.choose(), dir = "~/", collapse = TRUE, sep = NULL, categories = NULL )
loadtext( file = file.choose(), dir = "~/", collapse = TRUE, sep = NULL, categories = NULL )
file |
The path or URL of the text file. |
dir |
The (temporary) directory, where the file is downloaded. The file is deleted at the end of this function. |
collapse |
Indicates whether or not lines of each documents should collapse together or not. |
sep |
Separator between text fields. |
categories |
Columns that should be considered as categorial data. |
The text contained in the dowloaded file.
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") ## End(Not run)
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") ## End(Not run)
This function builds a classification model using Logistic Regression.
LR(train, labels, tune = FALSE, ...)
LR(train, labels, tune = FALSE, ...)
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
require (datasets) data (iris) LR (iris [, -5], iris [, 5])
require (datasets) data (iris) LR (iris [, -5], iris [, 5])
Performs Multiple Correspondence Analysis (MCA) with supplementary individuals, supplementary quantitative variables and supplementary categorical variables. Performs also Specific Multiple Correspondence Analysis with supplementary categories and supplementary categorical variables. Missing values are treated as an additional level, categories which are rare can be ventilated.
MCA( d, ncp = 5, ind.sup = NULL, quanti.sup = NULL, quali.sup = NULL, row.w = NULL )
MCA( d, ncp = 5, ind.sup = NULL, quanti.sup = NULL, quali.sup = NULL, row.w = NULL )
d |
A ddata frame or a table with n rows and p columns, i.e. a contingency table. |
ncp |
The number of dimensions kept in the results (by default 5). |
ind.sup |
A vector indicating the indexes of the supplementary individuals. |
quanti.sup |
A vector indicating the indexes of the quantitative supplementary variables. |
quali.sup |
A vector indicating the indexes of the categorical supplementary variables. |
row.w |
An optional row weights (by default, a vector of 1 for uniform row weights); the weights are given only for the active individuals. |
The MCA on the dataset.
MCA
, CA
, PCA
, plot.factorial
, factorial-class
data (tea, package = "FactoMineR") MCA (tea, quanti.sup = 19, quali.sup = 20:36)
data (tea, package = "FactoMineR") MCA (tea, quanti.sup = 19, quali.sup = 20:36)
Run MeanShift for clustering.
MEANSHIFT( d, mskernel = "NORMAL", bandwidth = rep(1, ncol(d)), alpha = 0, iterations = 10, epsilon = 1e-08, epsilonCluster = 1e-04, ... )
MEANSHIFT( d, mskernel = "NORMAL", bandwidth = rep(1, ncol(d)), alpha = 0, iterations = 10, epsilon = 1e-08, epsilonCluster = 1e-04, ... )
d |
The dataset ( |
mskernel |
A string indicating the kernel associated with the kernel density estimate that the mean shift is optimizing over. |
bandwidth |
Used in the kernel density estimate for steepest ascent classification. |
alpha |
A scalar tuning parameter for normal kernels. |
iterations |
The number of iterations to perform mean shift. |
epsilon |
A scalar used to determine when to terminate the iteration of a individual query point. |
epsilonCluster |
A scalar used to determine the minimum distance between distinct clusters. |
... |
Other parameters. |
The clustering (meanshift
object).
## Not run: require (datasets) data (iris) MEANSHIFT (iris [, -5], bandwidth = .75) ## End(Not run)
## Not run: require (datasets) data (iris) MEANSHIFT (iris [, -5], bandwidth = .75) ## End(Not run)
This class contains the model obtained by the MEANSHIFT method.
cluster
A vector of integers indicating the cluster to which each point is allocated.
value
A vector or matrix containing the location of the classified local maxima in the support.
data
The leaning set.
kernel
A string indicating the kernel associated with the kernel density estimate that the mean shift is optimizing over.
bandwidth
Used in the kernel density estimate for steepest ascent classification.
alpha
A scalar tuning parameter for normal kernels.
iterations
The number of iterations to perform mean shift.
epsilon
A scalar used to determine when to terminate the iteration of a individual query point.
epsilonCluster
A scalar used to determine the minimum distance between distinct clusters.
This function builds a classification model using Multilayer Perceptron.
MLP( train, labels, hidden = ifelse(is.vector(train), 2:(1 + nlevels(labels)), 2:(ncol(train) + nlevels(labels))), decay = 10^(-3:-1), methodparameters = NULL, tune = FALSE, ... )
MLP( train, labels, hidden = ifelse(is.vector(train), 2:(1 + nlevels(labels)), 2:(ncol(train) + nlevels(labels))), decay = 10^(-3:-1), methodparameters = NULL, tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
The size of the hidden layer (if a vector, cross-over validation is used to chose the best size). |
|
decay |
The decay (between 0 and 1) of the backpropagation algorithm (if a vector, cross-over validation is used to chose the best size). |
methodparameters |
Object containing the parameters. If given, it replaces |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
## Not run: require (datasets) data (iris) MLP (iris [, -5], iris [, 5], hidden = 4, decay = .1) ## End(Not run)
## Not run: require (datasets) data (iris) MLP (iris [, -5], iris [, 5], hidden = 4, decay = .1) ## End(Not run)
This function builds a regression model using MLP.
MLPREG( x, y, size = 2:(ifelse(is.vector(x), 2, ncol(x))), decay = 10^(-3:-1), params = NULL, tune = FALSE, ... )
MLPREG( x, y, size = 2:(ifelse(is.vector(x), 2, ncol(x))), decay = 10^(-3:-1), params = NULL, tune = FALSE, ... )
x |
Predictor |
y |
Response |
size |
The size of the hidden layer (if a vector, cross-over validation is used to chose the best size). |
decay |
The decay (between 0 and 1) of the backpropagation algorithm (if a vector, cross-over validation is used to chose the best size). |
params |
Object containing the parameters. If given, it replaces |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model, as an object of class model-class
.
## Not run: require (datasets) data (trees) MLPREG (trees [, -3], trees [, 3]) ## End(Not run)
## Not run: require (datasets) data (trees) MLPREG (trees [, -3], trees [, 3]) ## End(Not run)
This is a wrapper class containing the classification model obtained by any classification or regression method.
model
The wrapped model.
method
The name of the method.
Extract from the movie lens dataset. Missing values have been imputed.
movies
movies
A set of 49 movies, rated by 55 users.
https://grouplens.org/datasets/movielens/
This function builds a classification model using Naive Bayes.
NB(train, labels, tune = FALSE, ...)
NB(train, labels, tune = FALSE, ...)
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
require (datasets) data (iris) NB (iris [, -5], iris [, 5])
require (datasets) data (iris) NB (iris [, -5], iris [, 5])
Return the NMF decomposition.
NMF(x, rank = 2, nstart = 10, ...)
NMF(x, rank = 2, nstart = 10, ...)
x |
A numeric dataset (data.frame or matrix). |
rank |
Specification of the factorization rank. |
nstart |
How many random sets should be chosen? |
... |
Other parameters. |
## Not run: install.packages ("BiocManager") BiocManager::install ("Biobase") install.packages ("NMF") require (datasets) data (iris) NMF (iris [, -5]) ## End(Not run)
## Not run: install.packages ("BiocManager") BiocManager::install ("Biobase") install.packages ("NMF") require (datasets) data (iris) NMF (iris [, -5]) ## End(Not run)
This dataset constains measurements on ozone level.
ozone
ozone
Each instance is described by the maximum level of ozone measured during the day. Temperature, clouds, and wind are also recorded.
https://r-stat-sc-donnees.github.io/ozone.txt
This class contains main parameters for various learning methods.
decay
The decay parameter.
hidden
The number of hidden nodes.
epsilon
The epsilon parameter.
gamma
The gamma parameter.
cost
The cost parameter.
Performs Principal Component Analysis (PCA) with supplementary individuals, supplementary quantitative variables and supplementary categorical variables. Missing values are replaced by the column mean.
PCA( d, scale.unit = TRUE, ncp = ncol(d) - length(quanti.sup) - length(quali.sup), ind.sup = NULL, quanti.sup = NULL, quali.sup = NULL, row.w = NULL, col.w = NULL )
PCA( d, scale.unit = TRUE, ncp = ncol(d) - length(quanti.sup) - length(quali.sup), ind.sup = NULL, quanti.sup = NULL, quali.sup = NULL, row.w = NULL, col.w = NULL )
d |
A data frame with n rows (individuals) and p columns (numeric variables). |
scale.unit |
A boolean, if TRUE (value set by default) then data are scaled to unit variance. |
ncp |
The number of dimensions kept in the results (by default 5). |
ind.sup |
A vector indicating the indexes of the supplementary individuals. |
quanti.sup |
A vector indicating the indexes of the quantitative supplementary variables. |
quali.sup |
A vector indicating the indexes of the categorical supplementary variables. |
row.w |
An optional row weights (by default, a vector of 1 for uniform row weights); the weights are given only for the active individuals. |
col.w |
An optional column weights (by default, uniform column weights); the weights are given only for the active variables. |
The PCA on the dataset.
PCA
, CA
, MCA
, plot.factorial
, kaiser
, factorial-class
require (datasets) data (iris) PCA (iris, quali.sup = 5)
require (datasets) data (iris) PCA (iris, quali.sup = 5)
Estimate the performance of classification or regression methods using bootstrap or crossvalidation (accuracy, ROC curves, confusion matrices, ...)
performance( methods, train.x, train.y, test.x = NULL, test.y = NULL, train.size = round(0.7 * nrow(train.x)), type = c("evaluation", "confusion", "roc", "cost", "scatter", "avsp"), protocol = c("bootstrap", "crossvalidation", "loocv", "holdout", "train"), eval = ifelse(is.factor(train.y), "accuracy", "r2"), nruns = 10, nfolds = 10, new = TRUE, lty = 1, seed = NULL, methodparameters = NULL, names = NULL, ... )
performance( methods, train.x, train.y, test.x = NULL, test.y = NULL, train.size = round(0.7 * nrow(train.x)), type = c("evaluation", "confusion", "roc", "cost", "scatter", "avsp"), protocol = c("bootstrap", "crossvalidation", "loocv", "holdout", "train"), eval = ifelse(is.factor(train.y), "accuracy", "r2"), nruns = 10, nfolds = 10, new = TRUE, lty = 1, seed = NULL, methodparameters = NULL, names = NULL, ... )
methods |
The classification or regression methods to be evaluated. |
train.x |
The dataset (description/predictors), a |
train.y |
The target (class labels or numeric values), a |
test.x |
The test dataset (description/predictors), a |
test.y |
The (test) target (class labels or numeric values), a |
train.size |
The size of the training set (holdout estimation). |
type |
The type of evaluation (confusion matrix, ROC curve, ...) |
protocol |
The evaluation protocol (crossvalidation, bootstrap, ...) |
eval |
The evaluation functions. |
nruns |
The number of bootstrap runs. |
nfolds |
The number of folds (crossvalidation estimation). |
new |
A logical value indicating whether a new plot should be be created or not (cost curves or ROC curves). |
lty |
The line type (and color) specified as an integer (cost curves or ROC curves). |
seed |
A specified seed for random number generation (useful for testing different method with the same bootstap samplings). |
methodparameters |
Method parameters (if null tuning is done by cross-validation). |
names |
Method names. |
... |
Other specific parameters for the leaning method. |
The evaluation of the predictions (numeric value).
confusion
, evaluation
, cost.curves
, roc.curves
## Not run: require ("datasets") data (iris) # One method, one evaluation criterion, bootstrap estimation performance (NB, iris [, -5], iris [, 5], seed = 0) # One method, two evaluation criteria, train set estimation performance (NB, iris [, -5], iris [, 5], eval = c ("accuracy", "kappa"), protocol = "train", seed = 0) # Three methods, ROC curves, LOOCV estimation performance (c (NB, LDA, LR), linsep [, -3], linsep [, 3], type = "roc", protocol = "loocv", seed = 0) # List of methods in a variable, confusion matrix, hodout estimation classif = c (NB, LDA, LR) performance (classif, iris [, -5], iris [, 5], type = "confusion", protocol = "holdout", seed = 0, names = c ("NB", "LDA", "LR")) # List of strings (method names), scatterplot evaluation, crossvalidation estimation classif = c ("NB", "LDA", "LR") performance (classif, iris [, -5], iris [, 5], type = "scatter", protocol = "crossvalidation", seed = 0) # Actual vs. predicted data (trees) performance (LINREG, trees [, -3], trees [, 3], type = "avsp") ## End(Not run)
## Not run: require ("datasets") data (iris) # One method, one evaluation criterion, bootstrap estimation performance (NB, iris [, -5], iris [, 5], seed = 0) # One method, two evaluation criteria, train set estimation performance (NB, iris [, -5], iris [, 5], eval = c ("accuracy", "kappa"), protocol = "train", seed = 0) # Three methods, ROC curves, LOOCV estimation performance (c (NB, LDA, LR), linsep [, -3], linsep [, 3], type = "roc", protocol = "loocv", seed = 0) # List of methods in a variable, confusion matrix, hodout estimation classif = c (NB, LDA, LR) performance (classif, iris [, -5], iris [, 5], type = "confusion", protocol = "holdout", seed = 0, names = c ("NB", "LDA", "LR")) # List of strings (method names), scatterplot evaluation, crossvalidation estimation classif = c ("NB", "LDA", "LR") performance (classif, iris [, -5], iris [, 5], type = "scatter", protocol = "crossvalidation", seed = 0) # Actual vs. predicted data (trees) performance (LINREG, trees [, -3], trees [, 3], type = "avsp") ## End(Not run)
Plot the learning set (and test set) on the canonical axes obtained by Canonical Discriminant Analysis (function CDA
).
## S3 method for class 'cda' plot(x, newdata = NULL, axes = 1:2, ...)
## S3 method for class 'cda' plot(x, newdata = NULL, axes = 1:2, ...)
x |
The classification model (object of class |
newdata |
The test set ( |
axes |
The canonical axes to be printed (numeric |
... |
Other parameters. |
require (datasets) data (iris) model = CDA (iris [, -5], iris [, 5]) plot (model)
require (datasets) data (iris) model = CDA (iris [, -5], iris [, 5]) plot (model)
Plot PCA, CA or MCA.
## S3 method for class 'factorial' plot(x, type = c("ind", "cor", "eig"), axes = c(1, 2), ...)
## S3 method for class 'factorial' plot(x, type = c("ind", "cor", "eig"), axes = c(1, 2), ...)
x |
The PCA, CA or MCA result (object of class |
type |
The graph to plot. |
axes |
The factorial axes to be printed (numeric |
... |
Other parameters. |
CA
, MCA
, PCA
, plot.CA
, plot.MCA
, plot.PCA
, factorial-class
require (datasets) data (iris) pca = PCA (iris, quali.sup = 5) plot (pca) plot (pca, type = "cor") plot (pca, type = "eig")
require (datasets) data (iris) pca = PCA (iris, quali.sup = 5) plot (pca) plot (pca, type = "cor") plot (pca, type = "eig")
Plot Kohonen's self-organizing maps.
## S3 method for class 'som' plot(x, type = c("scatter", "mapping"), col = NULL, labels = FALSE, ...)
## S3 method for class 'som' plot(x, type = c("scatter", "mapping"), col = NULL, labels = FALSE, ...)
x |
The Kohonen's map (object of class |
type |
The type of plot. |
col |
Color of the data points |
labels |
A |
... |
Other parameters. |
require (datasets) data (iris) som = SOM (iris [, -5], xdim = 5, ydim = 5, post = "ward", k = 3) plot (som) # Scatter plot (default) plot (som, type = "mapping") # Kohonen map
require (datasets) data (iris) som = SOM (iris [, -5], xdim = 5, ydim = 5, post = "ward", k = 3) plot (som) # Scatter plot (default) plot (som, type = "mapping") # Kohonen map
Plot actual vs. predictions of a regression model.
plotavsp(predictions, gt)
plotavsp(predictions, gt)
predictions |
The predictions of a classification model ( |
gt |
The ground truth of the dataset ( |
confusion
, evaluation.accuracy
, evaluation.fmeasure
, evaluation.fowlkesmallows
, evaluation.goodness
, evaluation.jaccard
, evaluation.kappa
,
evaluation.precision
, evaluation.recall
,
evaluation.msep
, evaluation.r2
, performance
require (datasets) data (trees) model = LINREG (trees [, -3], trees [, 3]) pred = predict (model, trees [, -3]) plotavsp (pred, trees [, 3])
require (datasets) data (trees) model = LINREG (trees [, -3], trees [, 3]) pred = predict (model, trees [, -3]) plotavsp (pred, trees [, 3])
Plot a word cloud based on the word frequencies in the documents.
plotcloud(corpus, k = NULL, stopwords = "en", ...)
plotcloud(corpus, k = NULL, stopwords = "en", ...)
corpus |
The corpus of documents (a vector of characters) or the vocabulary of the documents (result of function |
k |
A categorial variable (vector or factor). |
stopwords |
Stopwords, or the language of the documents. NULL if stop words should not be removed. |
... |
Other parameters. |
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") plotcloud (text) vocab = getvocab (text, mincount = 1, lang = NULL, stopwords = "en") plotcloud (vocab) ## End(Not run)
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") plotcloud (text) vocab = getvocab (text, mincount = 1, lang = NULL, stopwords = "en") plotcloud (vocab) ## End(Not run)
Plot a clustering according to various parameters
plotclus( clustering, d = NULL, type = c("scatter", "boxplot", "tree", "height", "mapping", "words"), centers = FALSE, k = NULL, tailsize = 9, ... )
plotclus( clustering, d = NULL, type = c("scatter", "boxplot", "tree", "height", "mapping", "words"), centers = FALSE, k = NULL, tailsize = 9, ... )
clustering |
The clustering to be plotted. |
d |
The dataset ( |
type |
The type of plot. |
centers |
Indicates whether or not cluster centers should be plotted (used only in scatter plots). |
k |
Number of clusters (used only for hierarchical methods). If not specified an "optimal" value is determined. |
tailsize |
Number of clusters showned (used only for height plots). |
... |
Other parameters. |
treeplot
, scatterplot
, plot.som
, boxclus
## Not run: require (datasets) data (iris) ward = HCA (iris [, -5], method = "ward", k = 3) plotclus (ward, iris [, -5], type = "scatter") # Scatter plot plotclus (ward, iris [, -5], type = "boxplot") # Boxplot plotclus (ward, iris [, -5], type = "tree") # Dendrogram plotclus (ward, iris [, -5], type = "height") # Distances between merging clusters som = SOM (iris [, -5], xdim = 5, ydim = 5, post = "ward", k = 3) plotclus (som, iris [, -5], type = "scatter") # Scatter plot for SOM plotclus (som, iris [, -5], type = "mapping") # Kohonen map ## End(Not run)
## Not run: require (datasets) data (iris) ward = HCA (iris [, -5], method = "ward", k = 3) plotclus (ward, iris [, -5], type = "scatter") # Scatter plot plotclus (ward, iris [, -5], type = "boxplot") # Boxplot plotclus (ward, iris [, -5], type = "tree") # Dendrogram plotclus (ward, iris [, -5], type = "height") # Distances between merging clusters som = SOM (iris [, -5], xdim = 5, ydim = 5, post = "ward", k = 3) plotclus (som, iris [, -5], type = "scatter") # Scatter plot for SOM plotclus (som, iris [, -5], type = "mapping") # Kohonen map ## End(Not run)
Plot a dataset.
plotdata( d, k = NULL, type = c("pairs", "scatter", "parallel", "boxplot", "histogram", "barplot", "pie", "heatmap", "heatmapc", "pca", "cda", "svd", "nmf", "tsne", "som", "words"), legendpos = "topleft", alpha = 200, asp = 1, labels = FALSE, ... )
plotdata( d, k = NULL, type = c("pairs", "scatter", "parallel", "boxplot", "histogram", "barplot", "pie", "heatmap", "heatmapc", "pca", "cda", "svd", "nmf", "tsne", "som", "words"), legendpos = "topleft", alpha = 200, asp = 1, labels = FALSE, ... )
d |
A numeric dataset (data.frame or matrix). |
k |
A categorial variable (vector or factor). |
type |
The type of graphic to be plotted. |
legendpos |
Position of the legend |
alpha |
Color opacity (0-255). |
asp |
Aspect ratio (default: 1). |
labels |
Indicates whether or not labels (row names) should be showned on the (scatter) plot. |
... |
Other parameters. |
require (datasets) data (iris) # Without classification plotdata (iris [, -5]) # Défault (pairs) # With classification plotdata (iris [, -5], iris [, 5]) # Défault (pairs) plotdata (iris, 5) # Column number plotdata (iris) # Automatic detection of the classification (if only one factor column) plotdata (iris, type = "scatter") # Scatter plot (PCA axis) plotdata (iris, type = "parallel") # Parallel coordinates plotdata (iris, type = "boxplot") # Boxplot plotdata (iris, type = "histogram") # Histograms plotdata (iris, type = "heatmap") # Heatmap plotdata (iris, type = "heatmapc") # Heatmap (and hierarchalcal clustering) plotdata (iris, type = "pca") # Scatter plot (PCA axis) plotdata (iris, type = "cda") # Scatter plot (CDA axis) plotdata (iris, type = "svd") # Scatter plot (SVD axis) plotdata (iris, type = "som") # Kohonen map # With only one variable plotdata (iris [, 1], iris [, 5]) # Défault (data vs. index) plotdata (iris [, 1], iris [, 5], type = "scatter") # Scatter plot (data vs. index) plotdata (iris [, 1], iris [, 5], type = "boxplot") # Boxplot # With two variables plotdata (iris [, 3:4], iris [, 5]) # Défault (scatter plot) plotdata (iris [, 3:4], iris [, 5], type = "scatter") # Scatter plot data (titanic) plotdata (titanic, type = "barplot") # Barplots plotdata (titanic, type = "pie") # Pie charts
require (datasets) data (iris) # Without classification plotdata (iris [, -5]) # Défault (pairs) # With classification plotdata (iris [, -5], iris [, 5]) # Défault (pairs) plotdata (iris, 5) # Column number plotdata (iris) # Automatic detection of the classification (if only one factor column) plotdata (iris, type = "scatter") # Scatter plot (PCA axis) plotdata (iris, type = "parallel") # Parallel coordinates plotdata (iris, type = "boxplot") # Boxplot plotdata (iris, type = "histogram") # Histograms plotdata (iris, type = "heatmap") # Heatmap plotdata (iris, type = "heatmapc") # Heatmap (and hierarchalcal clustering) plotdata (iris, type = "pca") # Scatter plot (PCA axis) plotdata (iris, type = "cda") # Scatter plot (CDA axis) plotdata (iris, type = "svd") # Scatter plot (SVD axis) plotdata (iris, type = "som") # Kohonen map # With only one variable plotdata (iris [, 1], iris [, 5]) # Défault (data vs. index) plotdata (iris [, 1], iris [, 5], type = "scatter") # Scatter plot (data vs. index) plotdata (iris [, 1], iris [, 5], type = "boxplot") # Boxplot # With two variables plotdata (iris [, 3:4], iris [, 5]) # Défault (scatter plot) plotdata (iris [, 3:4], iris [, 5], type = "scatter") # Scatter plot data (titanic) plotdata (titanic, type = "barplot") # Barplots plotdata (titanic, type = "pie") # Pie charts
Plot the frequency of words in a document agains the ranks of those words. It also plot the Zipf law.
plotzipf(corpus)
plotzipf(corpus)
corpus |
The corpus of documents (a vector of characters) or the vocabulary of the documents (result of function |
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") plotzipf (text) vocab = getvocab (text, mincount = 1, lang = NULL) plotzipf (vocab) ## End(Not run)
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") plotzipf (text) vocab = getvocab (text, mincount = 1, lang = NULL) plotzipf (vocab) ## End(Not run)
This function builds a polynomial regression model.
POLYREG(x, y, degree = 2, tune = FALSE, ...)
POLYREG(x, y, degree = 2, tune = FALSE, ...)
x |
Predictor |
y |
Response |
degree |
The polynom degree. |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model, as an object of class model-class
.
## Not run: require (datasets) data (trees) POLYREG (trees [, -3], trees [, 3]) ## End(Not run)
## Not run: require (datasets) data (trees) POLYREG (trees [, -3], trees [, 3]) ## End(Not run)
This function predicts values based upon a model trained by apriori.classif
.
Observations that do not match any of the rules are labelled as "unmatched".
## S3 method for class 'apriori' predict(object, test, unmatched = "Unknown", ...)
## S3 method for class 'apriori' predict(object, test, unmatched = "Unknown", ...)
object |
The classification model (of class |
test |
The test set (a |
unmatched |
The class label given to the unmatched observations (a character string). |
... |
Other parameters. |
A vector of predicted values (factor
).
APRIORI
, apriori-class
, apriori
require ("datasets") data (iris) d = discretizeDF (iris, default = list (method = "interval", breaks = 3, labels = c ("small", "medium", "large"))) model = APRIORI (d [, -5], d [, 5], supp = .1, conf = .9, prune = TRUE) predict (model, d [, -5])
require ("datasets") data (iris) d = discretizeDF (iris, default = list (method = "interval", breaks = 3, labels = c ("small", "medium", "large"))) model = APRIORI (d [, -5], d [, 5], supp = .1, conf = .9, prune = TRUE) predict (model, d [, -5])
This function predicts values based upon a model trained by a boosting method.
## S3 method for class 'boosting' predict(object, test, fuzzy = FALSE, ...)
## S3 method for class 'boosting' predict(object, test, fuzzy = FALSE, ...)
object |
The classification model (of class |
test |
The test set (a |
fuzzy |
A boolean indicating whether fuzzy classification is used or not. |
... |
Other parameters. |
A vector of predicted values (factor
).
ADABOOST
, BAGGING
, boosting-class
## Not run: require (datasets) data (iris) d = splitdata (iris, 5) model = BAGGING (d$train.x, d$train.y, NB) predict (model, d$test.x) model = ADABOOST (d$train.x, d$train.y, NB) predict (model, d$test.x) ## End(Not run)
## Not run: require (datasets) data (iris) d = splitdata (iris, 5) model = BAGGING (d$train.x, d$train.y, NB) predict (model, d$test.x) model = ADABOOST (d$train.x, d$train.y, NB) predict (model, d$test.x) ## End(Not run)
This function predicts values based upon a model trained by CDA
.
## S3 method for class 'cda' predict(object, test, fuzzy = FALSE, ...)
## S3 method for class 'cda' predict(object, test, fuzzy = FALSE, ...)
object |
The classification model (of class |
test |
The test set (a |
fuzzy |
A boolean indicating whether fuzzy classification is used or not. |
... |
Other parameters. |
A vector of predicted values (factor
).
require (datasets) data (iris) d = splitdata (iris, 5) model = CDA (d$train.x, d$train.y) predict (model, d$test.x)
require (datasets) data (iris) d = splitdata (iris, 5) model = CDA (d$train.x, d$train.y) predict (model, d$test.x)
Return the closest DBSCAN cluster for a new dataset.
## S3 method for class 'dbs' predict(object, newdata, ...)
## S3 method for class 'dbs' predict(object, newdata, ...)
object |
The classification model (of class |
newdata |
A new dataset (a |
... |
Other parameters. |
require (datasets) data (iris) d = splitdata (iris, 5) model = DBSCAN (d$train.x, minpts = 5, eps = 0.65) predict (model, d$test.x)
require (datasets) data (iris) d = splitdata (iris, 5) model = DBSCAN (d$train.x, minpts = 5, eps = 0.65) predict (model, d$test.x)
Return the closest EM cluster for a new dataset.
## S3 method for class 'em' predict(object, newdata, ...)
## S3 method for class 'em' predict(object, newdata, ...)
object |
The classification model (of class |
newdata |
A new dataset (a |
... |
Other parameters. |
require (datasets) data (iris) d = splitdata (iris, 5) model = EM (d$train.x, 3) predict (model, d$test.x)
require (datasets) data (iris) d = splitdata (iris, 5) model = EM (d$train.x, 3) predict (model, d$test.x)
Return the closest K-means cluster for a new dataset.
## S3 method for class 'kmeans' predict(object, newdata, ...)
## S3 method for class 'kmeans' predict(object, newdata, ...)
object |
The classification model (created by |
newdata |
A new dataset (a |
... |
Other parameters. |
require (datasets) data (iris) d = splitdata (iris, 5) model = KMEANS (d$train.x, k = 3) predict (model, d$test.x)
require (datasets) data (iris) d = splitdata (iris, 5) model = KMEANS (d$train.x, k = 3) predict (model, d$test.x)
This function predicts values based upon a model trained by KNN
.
## S3 method for class 'knn' predict(object, test, fuzzy = FALSE, ...)
## S3 method for class 'knn' predict(object, test, fuzzy = FALSE, ...)
object |
The classification model (of class |
test |
The test set (a |
fuzzy |
A boolean indicating whether fuzzy classification is used or not. |
... |
Other parameters. |
A vector of predicted values (factor
).
require (datasets) data (iris) d = splitdata (iris, 5) model = KNN (d$train.x, d$train.y) predict (model, d$test.x)
require (datasets) data (iris) d = splitdata (iris, 5) model = KNN (d$train.x, d$train.y) predict (model, d$test.x)
Return the closest MeanShift cluster for a new dataset.
## S3 method for class 'meanshift' predict(object, newdata, ...)
## S3 method for class 'meanshift' predict(object, newdata, ...)
object |
The classification model (created by |
newdata |
A new dataset (a |
... |
Other parameters. |
## Not run: require (datasets) data (iris) d = splitdata (iris, 5) model = MEANSHIFT (d$train.x, bandwidth = .75) predict (model, d$test.x) ## End(Not run)
## Not run: require (datasets) data (iris) d = splitdata (iris, 5) model = MEANSHIFT (d$train.x, bandwidth = .75) predict (model, d$test.x) ## End(Not run)
This function predicts values based upon a model trained by any classification or regression model.
## S3 method for class 'model' predict(object, test, fuzzy = FALSE, ...)
## S3 method for class 'model' predict(object, test, fuzzy = FALSE, ...)
object |
The classification model (of class |
test |
The test set (a |
fuzzy |
A boolean indicating whether fuzzy classification is used or not. |
... |
Other parameters. |
A vector of predicted values (factor
).
require (datasets) data (iris) d = splitdata (iris, 5) model = LDA (d$train.x, d$train.y) predict (model, d$test.x)
require (datasets) data (iris) d = splitdata (iris, 5) model = LDA (d$train.x, d$train.y) predict (model, d$test.x)
This function predicts values based upon a model trained by any classification or regression model.
## S3 method for class 'selection' predict(object, test, fuzzy = FALSE, ...)
## S3 method for class 'selection' predict(object, test, fuzzy = FALSE, ...)
object |
The classification model (of class |
test |
The test set (a |
fuzzy |
A boolean indicating whether fuzzy classification is used or not. |
... |
Other parameters. |
A vector of predicted values (factor
).
FEATURESELECTION
, selection-class
## Not run: require (datasets) data (iris) d = splitdata (iris, 5) model = FEATURESELECTION (d$train.x, d$train.y, uninb = 2, mainmethod = LDA) predict (model, d$test.x) ## End(Not run)
## Not run: require (datasets) data (iris) d = splitdata (iris, 5) model = FEATURESELECTION (d$train.x, d$train.y, uninb = 2, mainmethod = LDA) predict (model, d$test.x) ## End(Not run)
This function predicts values based upon a model trained for text mining.
## S3 method for class 'textmining' predict(object, test, fuzzy = FALSE, ...)
## S3 method for class 'textmining' predict(object, test, fuzzy = FALSE, ...)
object |
The classification model (of class |
test |
The test set (a |
fuzzy |
A boolean indicating whether fuzzy classification is used or not. |
... |
Other parameters. |
A vector of predicted values (factor
).
## Not run: require (text2vec) data ("movie_review") d = movie_review [, 2:3] d [, 1] = factor (d [, 1]) d = splitdata (d, 1) model = TEXTMINING (d$train.x, NB, labels = d$train.y, mincount = 50) pred = predict (model, d$test.x) evaluation (pred, d$test.y) ## End(Not run)
## Not run: require (text2vec) data ("movie_review") d = movie_review [, 2:3] d [, 1] = factor (d [, 1]) d = splitdata (d, 1) model = TEXTMINING (d$train.x, NB, labels = d$train.y, mincount = 50) pred = predict (model, d$test.x) evaluation (pred, d$test.y) ## End(Not run)
Print the set of rules in the classification model.
## S3 method for class 'apriori' print(x, ...)
## S3 method for class 'apriori' print(x, ...)
x |
The model to be printed. |
... |
Other parameters. |
APRIORI
, predict.apriori
, summary.apriori
,
apriori-class
, apriori
require ("datasets") data (iris) d = discretizeDF (iris, default = list (method = "interval", breaks = 3, labels = c ("small", "medium", "large"))) model = APRIORI (d [, -5], d [, 5], supp = .1, conf = .9, prune = TRUE) print (model)
require ("datasets") data (iris) d = discretizeDF (iris, default = list (method = "interval", breaks = 3, labels = c ("small", "medium", "large"))) model = APRIORI (d [, -5], d [, 5], supp = .1, conf = .9, prune = TRUE) print (model)
Print PCA, CA or MCA.
## S3 method for class 'factorial' print(x, ...)
## S3 method for class 'factorial' print(x, ...)
x |
The PCA, CA or MCA result (object of class |
... |
Other parameters. |
CA
, MCA
, PCA
, print.CA
, print.MCA
, print.PCA
, factorial-class
require (datasets) data (iris) pca = PCA (iris, quali.sup = 5) print (pca)
require (datasets) data (iris) pca = PCA (iris, quali.sup = 5) print (pca)
Compute the pseudo-F of a clustering result obtained by the K-means method.
pseudoF(clustering)
pseudoF(clustering)
clustering |
The clustering result (obtained by the function |
The pseudo-F of the clustering result.
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) pseudoF (km)
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) pseudoF (km)
This function builds a classification model using Quadratic Discriminant Analysis.
QDA(train, labels, tune = FALSE, ...)
QDA(train, labels, tune = FALSE, ...)
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
require (datasets) data (iris) QDA (iris [, -5], iris [, 5])
require (datasets) data (iris) QDA (iris [, -5], iris [, 5])
Search for documents similar to the query.
query.docs(docvectors, query, vectorizer, nres = 5)
query.docs(docvectors, query, vectorizer, nres = 5)
docvectors |
The vectorized documents. |
query |
The query (vectorized or raw text). |
vectorizer |
The vectorizer taht has been used to vectorize the documents. |
nres |
The number of results. |
The indices of the documents the most similar to the query.
## Not run: require (text2vec) data (movie_review) vectorizer = vectorize.docs (corpus = movie_review$review, minphrasecount = 50, returndata = FALSE) docs = vectorize.docs (corpus = movie_review$review, vectorizer = vectorizer) query.docs (docs, movie_review$review [1], vectorizer) query.docs (docs, docs [1, ], vectorizer) ## End(Not run)
## Not run: require (text2vec) data (movie_review) vectorizer = vectorize.docs (corpus = movie_review$review, minphrasecount = 50, returndata = FALSE) docs = vectorize.docs (corpus = movie_review$review, vectorizer = vectorizer) query.docs (docs, movie_review$review [1], vectorizer) query.docs (docs, docs [1, ], vectorizer) ## End(Not run)
Search for words similar to the query.
query.words(wordvectors, origin, sub = NULL, add = NULL, nres = 5, lang = "en")
query.words(wordvectors, origin, sub = NULL, add = NULL, nres = 5, lang = "en")
wordvectors |
The vectorized words |
origin |
The query (character). |
sub |
Words to be substrated to the origin. |
add |
Words to be Added to the origin. |
nres |
The number of results. |
lang |
The language of the words (NULL if no stemming). |
The Words the most similar to the query.
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") words = vectorize.words (text, minphrasecount = 50) query.words (words, origin = "paris", sub = "france", add = "germany") query.words (words, origin = "berlin", sub = "germany", add = "france") query.words (words, origin = "new_zealand") ## End(Not run)
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") words = vectorize.words (text, minphrasecount = 50) query.words (words, origin = "paris", sub = "france", add = "germany") query.words (words, origin = "berlin", sub = "germany", add = "france") query.words (words, origin = "new_zealand") ## End(Not run)
This function builds a classification model using Random Forest
RANDOMFOREST( train, labels, ntree = 500, nvar = if (!is.null(labels) && !is.factor(labels)) max(floor(ncol(train)/3), 1) else floor(sqrt(ncol(train))), tune = FALSE, ... )
RANDOMFOREST( train, labels, ntree = 500, nvar = if (!is.null(labels) && !is.factor(labels)) max(floor(ncol(train)/3), 1) else floor(sqrt(ncol(train))), tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
ntree |
The number of trees in the forest. |
nvar |
Number of variables randomly sampled as candidates at each split. |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
## Not run: require (datasets) data (iris) RANDOMFOREST (iris [, -5], iris [, 5]) ## End(Not run)
## Not run: require (datasets) data (iris) RANDOMFOREST (iris [, -5], iris [, 5]) ## End(Not run)
Artificial dataset for simple regression tasks.
reg1 reg1.train reg1.test
reg1 reg1.train reg1.test
50 instances and 3 variables. X
, a numeric, K
, a factor, and Y
, a numeric (the target variable).
Alexandre Blansché [email protected]
Artificial dataset for simple regression tasks.
reg2 reg2.train reg2.test
reg2 reg2.train reg2.test
50 instances and 2 variables. X
and Y
(the target variable) are both numeric variables.
Alexandre Blansché [email protected]
Plot a regresion model on a 2-D plot. The predictor x
should be one-dimensional.
regplot(model, x, y, margin = 0.1, ...)
regplot(model, x, y, margin = 0.1, ...)
model |
The model to be plotted. |
x |
The predictor |
y |
The response |
margin |
A margin parameter. |
... |
Other graphical parameters |
require (datasets) data (cars) model = POLYREG (cars [, -2], cars [, 2]) regplot (model, cars [, -2], cars [, 2])
require (datasets) data (cars) model = POLYREG (cars [, -2], cars [, 2]) regplot (model, cars [, -2], cars [, 2])
Plot the studentized residuals of a linear regression model.
resplot(model, index = NULL, labels = NULL)
resplot(model, index = NULL, labels = NULL)
model |
The model to be plotted. |
index |
The index of the variable used for for the x-axis. |
labels |
The labels of the instances. |
require (datasets) data (trees) model = LINREG (trees [, -3], trees [, 3]) resplot (model) # Ordered by index resplot (model, index = 0) # Ordered by variable "Volume" (dependant variable) resplot (model, index = 1) # Ordered by variable "Girth" (independant variable) resplot (model, index = 2) # Ordered by variable "Height" (independant variable)
require (datasets) data (trees) model = LINREG (trees [, -3], trees [, 3]) resplot (model) # Ordered by index resplot (model, index = 0) # Ordered by variable "Volume" (dependant variable) resplot (model, index = 1) # Ordered by variable "Girth" (independant variable) resplot (model, index = 2) # Ordered by variable "Height" (independant variable)
This function plots ROC Curves of several classification predictions.
roc.curves(predictions, gt, methods.names = NULL)
roc.curves(predictions, gt, methods.names = NULL)
predictions |
The predictions of a classification model ( |
gt |
Actual labels of the dataset ( |
methods.names |
The name of the compared methods ( |
The evaluation of the predictions (numeric value).
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset model.nb = NB (d [, -5], d [, 5]) model.lda = LDA (d [, -5], d [, 5]) pred.nb = predict (model.nb, d [, -5]) pred.lda = predict (model.lda, d [, -5]) roc.curves (cbind (pred.nb, pred.lda), d [, 5], c ("NB", "LDA"))
require (datasets) data (iris) d = iris levels (d [, 5]) = c ("+", "+", "-") # Building a two classes dataset model.nb = NB (d [, -5], d [, 5]) model.lda = LDA (d [, -5], d [, 5]) pred.nb = predict (model.nb, d [, -5]) pred.lda = predict (model.lda, d [, -5]) roc.curves (cbind (pred.nb, pred.lda), d [, 5], c ("NB", "LDA"))
Rotation on two variables of a numeric dataset
rotation(d, angle, axis = 1:2, range = 2 * pi)
rotation(d, angle, axis = 1:2, range = 2 * pi)
d |
The dataset. |
angle |
The angle of the rotation. |
axis |
The axis. |
range |
The range of the angle (360, 2*pi, 100, ...) |
A rotated data matrix.
d = data.parabol () d [, -3] = rotation (d [, -3], 45, range = 360) plotdata (d [, -3], d [, 3])
d = data.parabol () d [, -3] = rotation (d [, -3], 45, range = 360) plotdata (d [, -3], d [, 3])
Return the running time of a function
runningtime(FUN, ...)
runningtime(FUN, ...)
FUN |
The function to be evaluated. |
... |
The parameters to be passes to function |
The running time of function FUN
.
sqrt (x = 1:100) runningtime (sqrt, x = 1:100)
sqrt (x = 1:100) runningtime (sqrt, x = 1:100)
Produce a scatter plot for clustering results. If the dataset has more than two dimensions, the scatter plot will show the two first PCA axes.
scatterplot( d, clusters, centers = NULL, labels = FALSE, ellipses = FALSE, legend = c("auto1", "auto2"), ... )
scatterplot( d, clusters, centers = NULL, labels = FALSE, ellipses = FALSE, legend = c("auto1", "auto2"), ... )
d |
The dataset ( |
clusters |
Cluster labels of the training set ( |
centers |
Coordinates of the cluster centers. |
labels |
Indicates whether or not labels (row names) should be showned on the plot. |
ellipses |
Indicates whether or not ellipses should be drawned around clusters. |
legend |
Indicates where the legend is placed on the graphics. |
... |
Other parameters. |
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) scatterplot (iris [, -5], km$cluster)
require (datasets) data (iris) km = KMEANS (iris [, -5], k = 3) scatterplot (iris [, -5], km$cluster)
Select a subset of features for a classification task.
selectfeatures( train, labels, algorithm = c("ranking", "forward", "backward", "exhaustive"), unieval = if (algorithm[1] == "ranking") c("fisher", "fstat", "relief", "inertiaratio") else NULL, uninb = NULL, unithreshold = NULL, multieval = if (algorithm[1] == "ranking") NULL else c("mrmr", "cfs", "fstat", "inertiaratio", "wrapper"), wrapmethod = NULL, keep = FALSE, ... )
selectfeatures( train, labels, algorithm = c("ranking", "forward", "backward", "exhaustive"), unieval = if (algorithm[1] == "ranking") c("fisher", "fstat", "relief", "inertiaratio") else NULL, uninb = NULL, unithreshold = NULL, multieval = if (algorithm[1] == "ranking") NULL else c("mrmr", "cfs", "fstat", "inertiaratio", "wrapper"), wrapmethod = NULL, keep = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
algorithm |
The feature selection algorithm. |
unieval |
The (univariate) evaluation criterion. |
uninb |
The number of selected feature (univariate evaluation). |
unithreshold |
The threshold for selecting feature (univariate evaluation). |
multieval |
The (multivariate) evaluation criterion. |
wrapmethod |
The classification method used for the wrapper evaluation. |
keep |
If true, the dataset is kept in the returned result. |
... |
Other parameters. |
FEATURESELECTION
, selection-class
## Not run: require (datasets) data (iris) selectfeatures (iris [, -5], iris [, 5], algorithm = "forward", multieval = "fstat") selectfeatures (iris [, -5], iris [, 5], algorithm = "ranking", uninb = 2) selectfeatures (iris [, -5], iris [, 5], algorithm = "ranking", multieval = "wrapper", wrapmethod = LDA) ## End(Not run)
## Not run: require (datasets) data (iris) selectfeatures (iris [, -5], iris [, 5], algorithm = "forward", multieval = "fstat") selectfeatures (iris [, -5], iris [, 5], algorithm = "ranking", uninb = 2) selectfeatures (iris [, -5], iris [, 5], algorithm = "ranking", multieval = "wrapper", wrapmethod = LDA) ## End(Not run)
This class contains the result of feature selection algorithms.
selection
A vector of integers indicating the selected features.
unieval
The evaluation of the features (univariate).
multieval
The evaluation of the selected features (multivariate).
algorithm
The algorithm used to select features.
univariate
The evaluation criterion (univariate).
nbfeatures
The number of features to be kept.
threshold
The threshold to decide whether a feature is kept or not..
multivariate
The evaluation criterion (multivariate).
dataset
The dataset described by the selected features only.
model
The classification model.
FEATURESELECTION
, predict.selection
, selectfeatures
This dataset has been used in a study on snoring in Angers hospital.
snore
snore
The dataset has 100 instances described by 7 variables. The variables are as follows:
Age
In years.
Weights
In kg.
Height
In cm.
Alcool
Number of glass of alcool per day.
Sex
M for male or F for female.
Snore
Snoring diagnosis (Y or N).
Tobacco
Y or N.
http://forge.info.univ-angers.fr/~gh/Datasets/datasets.htm
Run the SOM algorithm for clustering.
SOM( d, xdim = floor(sqrt(nrow(d))), ydim = floor(sqrt(nrow(d))), rlen = 10000, post = c("none", "single", "ward"), k = NULL, ... )
SOM( d, xdim = floor(sqrt(nrow(d))), ydim = floor(sqrt(nrow(d))), rlen = 10000, post = c("none", "single", "ward"), k = NULL, ... )
d |
The dataset ( |
xdim , ydim
|
The dimensions of the grid. |
rlen |
The number of iterations. |
post |
The post-treatement method: |
k |
The number of cluster (only used if |
... |
Other parameters. |
The fitted Kohonen's map as an object of class som
.
require (datasets) data (iris) SOM (iris [, -5], xdim = 5, ydim = 5, post = "ward", k = 3)
require (datasets) data (iris) SOM (iris [, -5], xdim = 5, ydim = 5, post = "ward", k = 3)
This class contains the model obtained by the SOM method.
som
An object of class kohonen
representing the fitted map.
nodes
A vector
of integer indicating the cluster to which each node is allocated.
cluster
A vector
of integer indicating the cluster to which each observation is allocated.
data
The dataset that has been used to fit the map (as a matrix
).
Run a Spectral clustering algorithm.
SPECTRAL(d, k, sigma = 1, graph = TRUE, ...)
SPECTRAL(d, k, sigma = 1, graph = TRUE, ...)
d |
The dataset ( |
k |
The number of cluster. |
sigma |
Width of the gaussian used to build the affinity matrix. |
graph |
A logical indicating whether or not a graphic should be plotted (projection on the spectral space of the affinity matrix). |
... |
Other parameters. |
## Not run: require (datasets) data (iris) SPECTRAL (iris [, -5], k = 3) ## End(Not run)
## Not run: require (datasets) data (iris) SPECTRAL (iris [, -5], k = 3) ## End(Not run)
This class contains the model obtained by Spectral clustering.
cluster
A vector
of integer indicating the cluster to which each observation is allocated.
proj
The projection of the dataset in the spectral space.
centers
The cluster centers (on the spectral space).
The data have been organized in two different but related classification tasks. The first task consists in classifying patients as belonging to one out of three categories: Normal, Disk Hernia or Spondylolisthesis. For the second task, the categories Disk Hernia and Spondylolisthesis were merged into a single category labelled as 'abnormal'. Thus, the second task consists in classifying patients as belonging to one out of two categories: Normal or Abnormal.
spine spine.train spine.test
spine spine.train spine.test
The dataset has 310 instances described by 8 variables.
Variables V1 to V6 are biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine.
The variable Classif2 is the classification into two classes AB
and NO
.
The variable Classif3 is the classification into 3 classes DH
, SL
and NO
.
spine.train
contains 217 instances and spine.test
contains 93.
http://archive.ics.uci.edu/ml/datasets/vertebral+column
This function splits a dataset into training set and test set. Return an object of class dataset-class
.
splitdata(dataset, target, size = round(0.7 * nrow(dataset)), seed = NULL)
splitdata(dataset, target, size = round(0.7 * nrow(dataset)), seed = NULL)
dataset |
The dataset to be split ( |
target |
The column index of the target variable (class label or response variable). |
size |
The size of the training set (as an integer value). |
seed |
A specified seed for random number generation. |
An object of class dataset-class
.
require (datasets) data (iris) d = splitdata (iris, 5) str (d)
require (datasets) data (iris) d = splitdata (iris, 5) str (d)
Evaluation a clustering algorithm according to stability, through a bootstrap procedure.
stability( clusteringmethods, d, originals = NULL, eval = "jaccard", type = c("cluster", "global"), nsampling = 10, seed = NULL, names = NULL, graph = FALSE, ... )
stability( clusteringmethods, d, originals = NULL, eval = "jaccard", type = c("cluster", "global"), nsampling = 10, seed = NULL, names = NULL, graph = FALSE, ... )
clusteringmethods |
The clustering methods to be evaluated. |
d |
The dataset. |
originals |
The original clustering. |
eval |
The evaluation criteria. |
type |
The comparison method. |
nsampling |
The number of bootstrap runs. |
seed |
A specified seed for random number generation (useful for testing different method with the same bootstap samplings). |
names |
Method names. |
graph |
Indicates wether or not a graphic is potted for each sample. |
... |
Parameters to be passed to the clustering algorithms. |
The evaluation of the clustering algorithm(s) (numeric values).
## Not run: require (datasets) data (iris) stability (KMEANS, iris [, -5], seed = 0, k = 3) stability (KMEANS, iris [, -5], seed = 0, k = 3, eval = c ("jaccard", "accuracy"), type = "global") stability (KMEANS, iris [, -5], seed = 0, k = 3, type = "cluster") stability (KMEANS, iris [, -5], seed = 0, k = 3, eval = c ("jaccard", "accuracy"), type = "cluster") stability (c (KMEANS, HCA), iris [, -5], seed = 0, k = 3) stability (c (KMEANS, HCA), iris [, -5], seed = 0, k = 3, eval = c ("jaccard", "accuracy"), type = "global") stability (c (KMEANS, HCA), iris [, -5], seed = 0, k = 3, type = "cluster") stability (c (KMEANS, HCA), iris [, -5], seed = 0, k = 3, eval = c ("jaccard", "accuracy"), type = "cluster") stability (KMEANS, iris [, -5], originals = KMEANS (iris [, -5], k = 3)$cluster, seed = 0, k = 3) stability (KMEANS, iris [, -5], originals = KMEANS (iris [, -5], k = 3), seed = 0, k = 3) ## End(Not run)
## Not run: require (datasets) data (iris) stability (KMEANS, iris [, -5], seed = 0, k = 3) stability (KMEANS, iris [, -5], seed = 0, k = 3, eval = c ("jaccard", "accuracy"), type = "global") stability (KMEANS, iris [, -5], seed = 0, k = 3, type = "cluster") stability (KMEANS, iris [, -5], seed = 0, k = 3, eval = c ("jaccard", "accuracy"), type = "cluster") stability (c (KMEANS, HCA), iris [, -5], seed = 0, k = 3) stability (c (KMEANS, HCA), iris [, -5], seed = 0, k = 3, eval = c ("jaccard", "accuracy"), type = "global") stability (c (KMEANS, HCA), iris [, -5], seed = 0, k = 3, type = "cluster") stability (c (KMEANS, HCA), iris [, -5], seed = 0, k = 3, eval = c ("jaccard", "accuracy"), type = "cluster") stability (KMEANS, iris [, -5], originals = KMEANS (iris [, -5], k = 3)$cluster, seed = 0, k = 3) stability (KMEANS, iris [, -5], originals = KMEANS (iris [, -5], k = 3), seed = 0, k = 3) ## End(Not run)
This function builds a classification model using CART with maxdepth = 1.
STUMP(train, labels, randomvar = TRUE, tune = FALSE, ...)
STUMP(train, labels, randomvar = TRUE, tune = FALSE, ...)
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
randomvar |
If true, the model uses a random variable. |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other parameters. |
The classification model.
require (datasets) data (iris) STUMP (iris [, -5], iris [, 5])
require (datasets) data (iris) STUMP (iris [, -5], iris [, 5])
Print summary of the set of rules in the classification model obtained by APRIORI.
## S3 method for class 'apriori' summary(object, ...)
## S3 method for class 'apriori' summary(object, ...)
object |
The model to be printed. |
... |
Other parameters. |
APRIORI
, predict.apriori
, print.apriori
,
apriori-class
, apriori
require ("datasets") data (iris) d = discretizeDF (iris, default = list (method = "interval", breaks = 3, labels = c ("small", "medium", "large"))) model = APRIORI (d [, -5], d [, 5], supp = .1, conf = .9, prune = TRUE) summary (model)
require ("datasets") data (iris) d = discretizeDF (iris, default = list (method = "interval", breaks = 3, labels = c ("small", "medium", "large"))) model = APRIORI (d [, -5], d [, 5], supp = .1, conf = .9, prune = TRUE) summary (model)
Return the SVD decomposition.
SVD(x, ndim = min(nrow(x), ncol(x)), ...)
SVD(x, ndim = min(nrow(x), ncol(x)), ...)
x |
A numeric dataset (data.frame or matrix). |
ndim |
The number of dimensions. |
... |
Other parameters. |
require (datasets) data (iris) SVD (iris [, -5])
require (datasets) data (iris) SVD (iris [, -5])
This function builds a classification model using Support Vector Machine.
SVM( train, labels, gamma = 2^(-3:3), cost = 2^(-3:3), kernel = c("radial", "linear"), methodparameters = NULL, tune = FALSE, ... )
SVM( train, labels, gamma = 2^(-3:3), cost = 2^(-3:3), kernel = c("radial", "linear"), methodparameters = NULL, tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
gamma |
The gamma parameter (if a vector, cross-over validation is used to chose the best size). |
cost |
The cost parameter (if a vector, cross-over validation is used to chose the best size). |
kernel |
The kernel type. |
methodparameters |
Object containing the parameters. If given, it replaces |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other arguments. |
The classification model.
## Not run: require (datasets) data (iris) SVM (iris [, -5], iris [, 5], kernel = "linear", cost = 1) SVM (iris [, -5], iris [, 5], kernel = "radial", gamma = 1, cost = 1) ## End(Not run)
## Not run: require (datasets) data (iris) SVM (iris [, -5], iris [, 5], kernel = "linear", cost = 1) SVM (iris [, -5], iris [, 5], kernel = "radial", gamma = 1, cost = 1) ## End(Not run)
This function builds a classification model using Support Vector Machine with a linear kernel.
SVMl( train, labels, cost = 2^(-3:3), methodparameters = NULL, tune = FALSE, ... )
SVMl( train, labels, cost = 2^(-3:3), methodparameters = NULL, tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
cost |
The cost parameter (if a vector, cross-over validation is used to chose the best size). |
methodparameters |
Object containing the parameters. If given, it replaces |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other arguments. |
The classification model.
## Not run: require (datasets) data (iris) SVMl (iris [, -5], iris [, 5], cost = 1) ## End(Not run)
## Not run: require (datasets) data (iris) SVMl (iris [, -5], iris [, 5], cost = 1) ## End(Not run)
This function builds a classification model using Support Vector Machine with a radial kernel.
SVMr( train, labels, gamma = 2^(-3:3), cost = 2^(-3:3), methodparameters = NULL, tune = FALSE, ... )
SVMr( train, labels, gamma = 2^(-3:3), cost = 2^(-3:3), methodparameters = NULL, tune = FALSE, ... )
train |
The training set (description), as a |
labels |
Class labels of the training set ( |
gamma |
The gamma parameter (if a vector, cross-over validation is used to chose the best size). |
cost |
The cost parameter (if a vector, cross-over validation is used to chose the best size). |
methodparameters |
Object containing the parameters. If given, it replaces |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other arguments. |
The classification model.
## Not run: require (datasets) data (iris) SVMr (iris [, -5], iris [, 5], gamma = 1, cost = 1) ## End(Not run)
## Not run: require (datasets) data (iris) SVMr (iris [, -5], iris [, 5], gamma = 1, cost = 1) ## End(Not run)
This function builds a regression model using Support Vector Machine.
SVR( x, y, gamma = 2^(-3:3), cost = 2^(-3:3), kernel = c("radial", "linear"), epsilon = c(0.1, 0.5, 1), params = NULL, tune = FALSE, ... )
SVR( x, y, gamma = 2^(-3:3), cost = 2^(-3:3), kernel = c("radial", "linear"), epsilon = c(0.1, 0.5, 1), params = NULL, tune = FALSE, ... )
x |
Predictor |
y |
Response |
gamma |
The gamma parameter (if a vector, cross-over validation is used to chose the best size). |
cost |
The cost parameter (if a vector, cross-over validation is used to chose the best size). |
kernel |
The kernel type. |
epsilon |
The epsilon parameter (if a vector, cross-over validation is used to chose the best size). |
params |
Object containing the parameters. If given, it replaces |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other arguments. |
The classification model.
## Not run: require (datasets) data (trees) SVR (trees [, -3], trees [, 3], kernel = "linear", cost = 1) SVR (trees [, -3], trees [, 3], kernel = "radial", gamma = 1, cost = 1) ## End(Not run)
## Not run: require (datasets) data (trees) SVR (trees [, -3], trees [, 3], kernel = "linear", cost = 1) SVR (trees [, -3], trees [, 3], kernel = "radial", gamma = 1, cost = 1) ## End(Not run)
This function builds a regression model using Support Vector Machine with a linear kernel.
SVRl( x, y, cost = 2^(-3:3), epsilon = c(0.1, 0.5, 1), params = NULL, tune = FALSE, ... )
SVRl( x, y, cost = 2^(-3:3), epsilon = c(0.1, 0.5, 1), params = NULL, tune = FALSE, ... )
x |
Predictor |
y |
Response |
cost |
The cost parameter (if a vector, cross-over validation is used to chose the best size). |
epsilon |
The epsilon parameter (if a vector, cross-over validation is used to chose the best size). |
params |
Object containing the parameters. If given, it replaces |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other arguments. |
The classification model.
## Not run: require (datasets) data (trees) SVRl (trees [, -3], trees [, 3], cost = 1) ## End(Not run)
## Not run: require (datasets) data (trees) SVRl (trees [, -3], trees [, 3], cost = 1) ## End(Not run)
This function builds a regression model using Support Vector Machine with a radial kernel.
SVRr( x, y, gamma = 2^(-3:3), cost = 2^(-3:3), epsilon = c(0.1, 0.5, 1), params = NULL, tune = FALSE, ... )
SVRr( x, y, gamma = 2^(-3:3), cost = 2^(-3:3), epsilon = c(0.1, 0.5, 1), params = NULL, tune = FALSE, ... )
x |
Predictor |
y |
Response |
gamma |
The gamma parameter (if a vector, cross-over validation is used to chose the best size). |
cost |
The cost parameter (if a vector, cross-over validation is used to chose the best size). |
epsilon |
The epsilon parameter (if a vector, cross-over validation is used to chose the best size). |
params |
Object containing the parameters. If given, it replaces |
tune |
If true, the function returns paramters instead of a classification model. |
... |
Other arguments. |
The classification model.
## Not run: require (datasets) data (trees) SVRr (trees [, -3], trees [, 3], gamma = 1, cost = 1) ## End(Not run)
## Not run: require (datasets) data (trees) SVRr (trees [, -3], trees [, 3], gamma = 1, cost = 1) ## End(Not run)
The data contains temperature measurement and geographic coordinates of 35 european cities.
temperature
temperature
The dataset has 35 instances described by 17 variables. Average temperature of the 12 month. Mean and amplitude of the temperature. Latitude and longitude of the city. Localisation in Europe.
Apply data mining function on vectorized text
TEXTMINING(corpus, miningmethod, vector = c("docs", "words"), ...)
TEXTMINING(corpus, miningmethod, vector = c("docs", "words"), ...)
corpus |
The corpus. |
miningmethod |
The data mining method. |
vector |
Indicates the type of vectorization, documents (TF-IDF) or words (GloVe). |
... |
Parameters passed to the vectorisation and to the data mining method. |
The result of the data mining method.
predict.textmining
, textmining-class
, vectorize.docs
, vectorize.words
## Not run: require (text2vec) data ("movie_review") d = movie_review [, 2:3] d [, 1] = factor (d [, 1]) d = splitdata (d, 1) model = TEXTMINING (d$train.x, NB, labels = d$train.y, mincount = 50) pred = predict (model, d$test.x) evaluation (pred, d$test.y) text = loadtext ("http://mattmahoney.net/dc/text8.zip") clusters = TEXTMINING (text, HCA, vector = "words", k = 9, maxwords = 100) plotclus (clusters$res, text, type = "tree", labels = TRUE) ## End(Not run)
## Not run: require (text2vec) data ("movie_review") d = movie_review [, 2:3] d [, 1] = factor (d [, 1]) d = splitdata (d, 1) model = TEXTMINING (d$train.x, NB, labels = d$train.y, mincount = 50) pred = predict (model, d$test.x) evaluation (pred, d$test.y) text = loadtext ("http://mattmahoney.net/dc/text8.zip") clusters = TEXTMINING (text, HCA, vector = "words", k = 9, maxwords = 100) plotclus (clusters$res, text, type = "tree", labels = TRUE) ## End(Not run)
Object used for text mining.
vectorizer
The vectorizer.
vectors
The vectorized dataset.
res
The result of the text mining method.
This dataset from the British Board of Trade depict the fate of the passengers and crew during the RMS Titanic disaster.
titanic
titanic
The dataset has 2201 instances described by 4 variables. The variables are as follows:
Category
1st, 2nd, 3rd Class or Crew.
Age
Adult or Child.
Sex
Female or Male.
Fate
Casualty or Survivor.
British Board of Trade (1990), Report on the Loss of the ‘Titanic’ (S.S.). British Board of Trade Inquiry Report (reprint). Gloucester, UK: Allan Sutton Publishing.
Draws a dendrogram.
treeplot( clustering, labels = FALSE, k = NULL, split = TRUE, horiz = FALSE, ... )
treeplot( clustering, labels = FALSE, k = NULL, split = TRUE, horiz = FALSE, ... )
clustering |
The dendrogram to be plotted (result of |
labels |
Indicates whether or not labels (row names) should be showned on the plot. |
k |
Number of clusters. If not specified an "optimal" value is determined. |
split |
Indicates wheather or not the clusters should be highlighted in the graphics. |
horiz |
Indicates if the dendrogram should be drawn horizontally or not. |
... |
Other parameters. |
dendrogram
, HCA
, hclust
, agnes
require (datasets) data (iris) hca = HCA (iris [, -5], method = "ward", k = 3) treeplot (hca)
require (datasets) data (iris) hca = HCA (iris [, -5], method = "ward", k = 3) treeplot (hca)
Return the t-SNE dimensionality reduction.
TSNE(x, perplexity = 30, nstart = 10, ...)
TSNE(x, perplexity = 30, nstart = 10, ...)
x |
A numeric dataset (data.frame or matrix). |
perplexity |
Specification of the perplexity. |
nstart |
How many random sets should be chosen? |
... |
Other parameters. |
require (datasets) data (iris) TSNE (iris [, -5])
require (datasets) data (iris) TSNE (iris [, -5])
The dataset presents a french university demographics.
universite
universite
The dataset has 10 instances (university departments) described by 12 variables. The fist six variables are the number of female and male student studying for bachelor degree (Licence), master degree (Master) and doctorate (Doctorat). The six last variables are obtained by combining the first ones.
https://husson.github.io/data.html
Vectorize a corpus of documents.
vectorize.docs( vectorizer = NULL, corpus = NULL, lang = "en", stopwords = lang, ngram = 1, mincount = 10, minphrasecount = NULL, transform = c("tfidf", "lsa", "l1", "none"), latentdim = 50, returndata = TRUE, ... )
vectorize.docs( vectorizer = NULL, corpus = NULL, lang = "en", stopwords = lang, ngram = 1, mincount = 10, minphrasecount = NULL, transform = c("tfidf", "lsa", "l1", "none"), latentdim = 50, returndata = TRUE, ... )
vectorizer |
The document vectorizer. |
corpus |
The corpus of documents (a vector of characters). |
lang |
The language of the documents (NULL if no stemming). |
stopwords |
Stopwords, or the language of the documents. NULL if stop words should not be removed. |
ngram |
maximum size of n-grams. |
mincount |
Minimum word count to be considered as frequent. |
minphrasecount |
Minimum collocation of words count to be considered as frequent. |
transform |
Transformation (TF-IDF, LSA, L1 normanization, or nothing). |
latentdim |
Number of latent dimensions if LSA transformation is performed. |
returndata |
If true, the vectorized documents are returned. If false, a "vectorizer" is returned. |
... |
Other parameters. |
The vectorized documents.
query.docs
, stopwords
, vectorizers
## Not run: require (text2vec) data ("movie_review") # Clustering docs = vectorize.docs (corpus = movie_review$review, transform = "tfidf") km = KMEANS (docs [sample (nrow (docs), 100), ], k = 10) # Classification d = movie_review [, 2:3] d [, 1] = factor (d [, 1]) d = splitdata (d, 1) vectorizer = vectorize.docs (corpus = d$train.x, returndata = FALSE, mincount = 50) train = vectorize.docs (corpus = d$train.x, vectorizer = vectorizer) test = vectorize.docs (corpus = d$test.x, vectorizer = vectorizer) model = NB (as.matrix (train), d$train.y) pred = predict (model, as.matrix (test)) evaluation (pred, d$test.y) ## End(Not run)
## Not run: require (text2vec) data ("movie_review") # Clustering docs = vectorize.docs (corpus = movie_review$review, transform = "tfidf") km = KMEANS (docs [sample (nrow (docs), 100), ], k = 10) # Classification d = movie_review [, 2:3] d [, 1] = factor (d [, 1]) d = splitdata (d, 1) vectorizer = vectorize.docs (corpus = d$train.x, returndata = FALSE, mincount = 50) train = vectorize.docs (corpus = d$train.x, vectorizer = vectorizer) test = vectorize.docs (corpus = d$test.x, vectorizer = vectorizer) model = NB (as.matrix (train), d$train.y) pred = predict (model, as.matrix (test)) evaluation (pred, d$test.y) ## End(Not run)
Vectorize words from a corpus of documents.
vectorize.words( corpus = NULL, ndim = 50, maxwords = NULL, mincount = 5, minphrasecount = NULL, window = 5, maxcooc = 10, maxiter = 10, epsilon = 0.01, lang = "en", stopwords = lang, ... )
vectorize.words( corpus = NULL, ndim = 50, maxwords = NULL, mincount = 5, minphrasecount = NULL, window = 5, maxcooc = 10, maxiter = 10, epsilon = 0.01, lang = "en", stopwords = lang, ... )
corpus |
The corpus of documents (a vector of characters). |
ndim |
The number of dimensions of the vector space. |
maxwords |
The maximum number of words. |
mincount |
Minimum word count to be considered as frequent. |
minphrasecount |
Minimum collocation of words count to be considered as frequent. |
window |
Window for term-co-occurence matrix construction. |
maxcooc |
Maximum number of co-occurrences to use in the weighting function. |
maxiter |
The maximum number of iteration to fit the GloVe model. |
epsilon |
Defines early stopping strategy when fit the GloVe model. |
lang |
The language of the documents (NULL if no stemming). |
stopwords |
Stopwords, or the language of the documents. NULL if stop words should not be removed. |
... |
Other parameters. |
The vectorized words.
query.words
, stopwords
, vectorizers
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") words = vectorize.words (text, minphrasecount = 50) query.words (words, origin = "paris", sub = "france", add = "germany") query.words (words, origin = "berlin", sub = "germany", add = "france") query.words (words, origin = "new_zealand") ## End(Not run)
## Not run: text = loadtext ("http://mattmahoney.net/dc/text8.zip") words = vectorize.words (text, minphrasecount = 50) query.words (words, origin = "paris", sub = "france", add = "germany") query.words (words, origin = "berlin", sub = "germany", add = "france") query.words (words, origin = "new_zealand") ## End(Not run)
This class contains a vectorization model for textual documents.
vectorizer
The vectorizer.
transform
The transformation to be applied after vectorization (normalization, TF-IDF).
phrases
The phrase detection method.
tfidf
The TF-IDF transformation.
lsa
The LSA transformation.
tokens
The token from the original document.
Excerpt of the Letter Recognition Data Set (UCI repository).
vowels vowels.train vowels.test
vowels vowels.train vowels.test
The dataset has 4664 instances described by 17 variables. The first variable is the classification into 6 classes (letter A, E, I, O, U and Y).
vowels.train
contains 233 instances and vowels.test
contains 4431.
https://archive.ics.uci.edu/ml/datasets/letter+recognition
The data contains kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. The images were recorded on 13x18 cm X-ray KODAK plates. Source : Institute of Agrophysics of the Polish Academy of Sciences in Lublin.
wheat
wheat
The dataset has 210 instances described by 8 variables: area, perimeter, compactness, length, width, asymmetry coefficient, groove length and variery.
https://archive.ics.uci.edu/ml/datasets/seeds
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
wine
wine
There are 178 observations and 14 variables.
The first variable is the class label (1
, 2
, 3
).
https://archive.ics.uci.edu/ml/datasets/wine
Animal description based on various features.
zoo
zoo
The dataset has 101 instances described by 17 qualitative variables.
https://archive.ics.uci.edu/ml/datasets/zoo