Preprocessing Functions

AddColBinnedToBinary()

Bin the values of a selected continuous column into 2 bins (halves) and add the bin assignments as a new column

AddColBinnedToQuartiles()

Bin the values of a selected continuous column into 4 bins (quartiles) and add the bin assignments as a new column

AddPCsToEnd()

Perform PCA

captureSessionInfo()

Capture session info

ConvertDataToPercentiles()

Use percentiles to assess for outliers in multidimensional data

CorAssoTestMultipleWithErrorHandling()

Takes multiple vectors and do correlation/association testing with all of them

correlation.association.test()

Given two numerical data vector, determine the correlation

describeNumericalColumns()

Describe each numerical feature. Mean, stddev, median, skewness (symmetry), kurtosis (flatness), pass normality?

describeNumericalColumnsWithLevels()

For each level, describe each numerical feature. Mean, sd, median, skewness (symmetry), kurtosis (flatness), pass normality?

DownSampleDataframe()

Down sample an imbalanced dataset to get a balanced dataset

generate.descriptive.plots()

Use histograms and boxplots to get an general idea of what data looks like

generate.descriptive.plots.save.pdf()

Use histograms and boxplots to get an general idea of what data looks like

GenerateElbowPlotPCA()

Create elbow plot to see how much total variance is explained by the components

GeneratePC1andPC2PlotsWithAndWithoutOutliers()

Generate PC1 vs PC2 plots with and without outliers.

Log2TargetDensityPlotComparison()

Do Log2 transformation on a column, and then compare with and without log2 transformation

LookAtPCFeatureLoadings()

Principal component feature loadings

MultipleColumnsNormalCheckThenBoxCox()

Checks multiple columns in a dataframe to see if each is normally distributed. If not, then box-cox transform

NormalCheckThenBoxCoxTransform()

Checks if the data is normally distributed using Shapiro test. If not normal, then boxcox transform.

RanomlySelectOneRowForEach()

Randomly select one row

RecodeIdentifier()

Recode the identifier column of a dataset

RemoveColWithAllZeros()

Remove columns with all zeros

RemoveRowsBasedOnCol()

Remove rows from the dataframe if the row contains a value in the specified columns

RemoveSamplesWithInstability()

Remove samples that have multiple values for a single column and those values are unstable

SplitIntoTrainTest()

Split into train and test

StabilityTestingAcrossVisits()

Assess stability of values that correspond to a single identifier

SubsetDataByContinuousCol()

Subset data by two bounds on a continuous column

TwoSampleTTest()

Performs two sample t-test on multiple features

ZScoreChallengeOutliers()

Remove outliers based on Z score of a particular variable

Clustering Functions

CalcOptimalNumClustersForKMeans()

Generate plots to help decide optimal number of clusters for Kmeans

generate.2D.clustering.with.labeled.subgroup()

Make a 2D scatter plot that shows the data as represented by PC1 and PC2

generate.3D.clustering.with.labeled.subgroup()

Make a 3D scatter plot that shows the data as represented by PC1, PC2, and PC3 and color labels clusters

generate.plots.comparing.clusters()

Compare clusters

GenerateParcoordForClusters()

Generate parallel plot to show each observation and which cluster they belong in.

HierarchicalClustering()

Automated hierarchical clustering with labeling of observations and groups

Classification Functions

CVPredictionsRandomForest()

Create random forest cross-validated model

CVRandomForestClassificationMatrixForPheatmap()

Generate a random forest model under cross validation (CV) for different subsets of the data and display results in a pheatmap to easily compare the different subsets

eval.classification.results()

Determine the performance of classification

find.best.number.of.trees()

Using the classification error rate for each number of trees, find the optimal number of trees to use for random forest classifier

GenerateExampleDataMachinelearnr()

Produce example data set for demonstrating package functions

LOOCVPredictionsRandomForestAutomaticMtryAndNtree()

Create random forest leave-one-out-cross-validated model

LOOCVRandomForestClassificationMatrixForPheatmap()

Generate a random forest model under leave-one-out-cross-validation (LOOCV) for different subsets of the data and display results in a pheatmap to easily compare the different subsets

RandomForestAutomaticMtryAndNtree()

Create random forest classification model after optimizing mtry and ntree

RandomForestClassificationGiniMatrixForPheatmap()

Generate a random forest model for different subsets of the data and display results into a matrix

RandomForestClassificationPercentileMatrixForPheatmap()

Generate a random forest model for different subsets of the data and display results in a pheatmap to easily compare the different subsets