Data Exploration in Data acquisition and understanding uses IDEAR tool. Similar steps could be done outside Azure but need some derivations.
1. Read and Summarize the Data
Read data into Pandas data frame, and infer column types (numerical or categorical)
Print the first n rows of data
Print the dimensions, column names and types of the data
2. Extract Descriptive Statistics of Each Column
In this section, descriptive statistics of numerical and categorical columns are extracted and printed separately.
3. Explore Individual Variables
In this section, you will explore variables individually, including the target variable, numerical variables, and categorical variables. For numerical variables, histogram, probability density plot, QQ-plot, and box-plot will be plotted. Normality test will be conducted. For categorical variables, histogram and pie chart will be plotted.
Categorical column exploration will be based on the entire dataset, and numerical column exploration will be based on the sampled dataset.
3.1. Explore the target variable
3.2. Explore individual numeric variables and test for normality (on sampled data)
3.3. Explore individual categorical variables (sorted by frequencies)
4. Explore Interactions Between Variables
Investigating the interactions and association between variables is an important analysis for understanding the dataset and for determining whether a dataset is relevant for the machine learning task, even before building machine learning models. In this section, we show how to evaluate and visualize inter-variable associations and the subsections corresponds to the IDEAR panes:
4.1. Rank variables
The strength of linear relationships between variables in the dataset is calculated with a selected reference variable. By default, the reference variable is the target variable.
The associations between categorical and numerical variables are computed using the eta-squared metric.
The associations between categorical variables are computed using the Cramer’ V metric.
If you notice that certain variables have significantly stronger associations with the target variable than others, they might be target leakers that already contain information from the target variable. Think it twice, or consult someone who has domain expertise if this situation arises.
4.2. Explore interactions between categorical variables
A mosaic plot shows the proportion of one categorical variable within the classes of another using tiles whose size is proportional to the cell frequency of a 2-way contingency table. The two categorical variables are selected from the drop-down menu boxes. The tiles are colored according to Standardized Pearson residuals (see the previous link). This helps you understand whether two categorical variables are dependent or not.
4.3. Explore interactions between numerical variables (on sampled data)
A scatter plot shows the association between pairs of numerical variables in the dataset.
4.4. Explore correlation matrix between numerical variables
An all-by-all pair-wise correlation plot shows the association between all pairs of numerical variables the dataset. You can choose one of the three correlation methods: pearson, kendall, and spearman.
4.5. Explore interactions between numerical and categorical variables
The association between a numerical and a categorical variable can be evaluated using a box plot. ANOVA is conducted to test the null hypothesis that the mean values of the numerical variable are the same across the levels of the categorical variable. The p-value of the ANOVA test is shown. If the categorical variable is the target variable for a classification problem, this function indicates whether the numerical variable helps differentiate the different levels of the target variable.
4.6. Explore interactions between two numerical variables and a categorical variable (on sampled data)
A scatter plot of two numerical variables are plotted, and points are legended by the selected categorical variable. This plot helps understand whether the categorical variable can be separated by the two numerical variables. If the categorical variable is the target variable, it can help access whether these two numerical variables can differentiate the levels of the target variable. If you see clear clustering pattern, where one cluster is dominated by one single level of the target variable, that is a good indicator that these two numerical variables are good predictors.
5. Visualize Numerical Data by Projecting to Principal Component Spaces
When the dimension of the data is high, data visualization is challenging. But visualizing the data can help us understand the clustering pattern in the data. For classification tasks, if you see separated clusters in the data that are dominated by different classes of the target variable, you may estimate that this classification task might not be so challenging. Otherwise, the classification task might not be easy. This can be used to infer the quality of your feature set.