Overview

ITS is considering options to help best with advanced analytics and computational solutions at UConn for large sized datasets based on the following appliances

Netezza users can access the algorithms using R and some Netezza packages, and can store structured data in a database. Users who do not use Netezza Advanced Analytics can still use it as a relational database to store research data. Data can be managed via R, or Aginity Workbench for PureData System for Analytics.

Netezza (a.k.a Pure Data for Analytics)

The following advanced analytics algorithms are available on Netezza

  • Decision Trees, K-Means, Bayes Net, Naive Bayes, KNN
  • Clustering
  • PCA
  • Regression Tree, Linear Regression, Generalized Linear
  • Time Series, Spectral Analysis
  • One- and two-way ANOVA
  • Simple statistics and support for hypothesis testing

Additionally, Netezza provides fast in-database libraries for R, which can operate on large matrices, and manage database operations via data frames for database tables.

HPC (High Performance Computing Cluster)

Apache Spark on HPC provides access to many algorithms in the MLlib Library

ML algorithms include:

  • Classification: logistic regression, naive Bayes,...
  • Regression: generalized linear regression, survival regression,...
  • Decision trees, random forests, and gradient-boosted trees
  • Recommendation: alternating least squares (ALS)
  • Clustering: K-means, Gaussian mixtures (GMMs),...
  • Topic modeling: latent Dirichlet allocation (LDA)
  • Frequent itemsets, association rules, and sequential pattern mining

ML workflow utilities include:

  • Feature transformations: standardization, normalization, hashing,...
  • ML Pipeline construction
  • Model evaluation and hyper-parameter tuning
  • ML persistence: saving and loading models and Pipelines

Other utilities include:

  • Distributed linear algebra: SVD, PCA,...
  • Statistics: summary statistics, hypothesis testing,...

HPC users can access the algorithms using Python or R. Data can be in the form of files, or external databases, including Netezza.

For more information regarding HPC (accounts, software, documentation, etc.), see https://hpc.uconn.edu/

Sample Use Cases

  1. Use R on a PC. Data is read from files and stored in a Netezza database, and is used for in-database analytics modeling later.
  2. Same as above; data is read from Netezza to HPC and used in Apache Spark, as well as any software available on HPC (See https://wiki.hpc.uconn.edu/index.php/HPC_Software)
  3. Data is stored in Netezza, and queried and used for analysis on PCs