{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"| - |\n",
"|---------------------------------------------------------------------------|\n",
"| [Exercise 8 (explained variance)](<#Exercise-8-(explained-variance)>) |\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML: Principal component analysis\n",
"\n",
"## Principal component analysis\n",
"\n",
"Principal component analysis is an unsupervised learning method that tries to detect the directions in which the vector formed data varies most. It first finds the direction of highest variance, and then proceeds to discover directions of highest variance that are orthogonal to those direction already found. So, for n dimensional data, it returns, by default, n orthogonal directions and the corresponding variances. These directions are called *pricipal axes*, and if we project a data point to these axes, we get the *principal components* of each axis.\n",
"\n",
"To use another terminology, the set of principal axes forms a base for the vector space where the data points reside, and the principal components are the coordinates of the data points in this new coordinate system. The `PCA` class in the scikit-learn library has a `transform` method, which transforms data to this new coordinate system.\n",
"\n",
"Let's look at an example where the data is from multi-variate Gaussian distribution."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import math\n",
"from sklearn.decomposition import PCA\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we sample data from this distribution, and then we plot it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rng=np.random.RandomState(0)\n",
"X=rng.randn(2,400)\n",
"scale=np.array([[1,0], [0,0.4]]) # Standard deviations are 1 and 0.4\n",
"rotate=np.array([[1,-1], [1,1]]) / math.sqrt(2)\n",
"transform = np.dot(rotate, scale)\n",
"X=np.dot(transform, X)\n",
"#X=np.dot(scale, X)\n",
"#X=np.dot(rotate, X)\n",
"X=X.T\n",
"plt.axis('equal')\n",
"plt.scatter(X[:,0], X[:,1]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first apply the PCA to the data to obtain the principal axes and their variances."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"def arrow(v1, v2, ax):\n",
" arrowprops=dict(arrowstyle='->',\n",
" linewidth=2,\n",
" shrinkA=0, shrinkB=0)\n",
" ax.annotate(\"\", v2, v1, arrowprops=arrowprops)\n",
"pca=PCA(2)\n",
"pca.fit(X)\n",
"print(\"Principal axes:\", pca.components_)\n",
"print(\"Explained variance:\", pca.explained_variance_)\n",
"print(\"Mean:\", pca.mean_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we draw vectors whose directions reflect those of the principal axes, and whose lengths are the corresponding variances. Then we plot the data in this new coordinate system."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Z=pca.transform(X)\n",
"fig, axes = plt.subplots(1,2, figsize=(12,4))\n",
"axes[0].axis('equal')\n",
"axes[0].scatter(X[:,0], X[:,1])\n",
"axes[1].axis('equal')\n",
"axes[1].set_xlim(-3,3)\n",
"axes[1].scatter(Z[:,0], Z[:,1])\n",
"for l, v in zip(pca.explained_variance_, pca.components_):\n",
" arrow([0,0], v*l*3, axes[0])\n",
"for l, v in zip([1.0,0.16], [np.array([1.0,0.0]),np.array([0.0,1.0])]):\n",
" arrow([0,0], v*l*3, axes[1])\n",
"axes[0].set_title(\"Original\")\n",
"axes[1].set_title(\"Transformed\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may have noticed that we gave the PCA constructor a parameter with value 2. This parameter gives the number of principal axes we want. If the parameter value is n, then the algorithm returns only n components with the highest variances and drops those components with lower variance. So, this algorithm can be used as a dimensionality reduction technique. The components with low variance are assumed not to contain any important information.\n",
"\n",
"Let's use PCA to project the above data to one dimension."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pca=PCA(n_components=1)\n",
"pca.fit(X)\n",
"Z=pca.transform(X)\n",
"print(pca.components_)\n",
"plt.axis('equal')\n",
"plt.scatter(Z[:,0],np.zeros(400), marker=\"d\", alpha=0.1);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dimensionality reduction can be used, for example, to project high-dimensional data in two or three dimensions to allow visualization of data. Additionally, dimensionality reduction can be used as a preprocessing method to obtain only the important features from the data. These important features can then be used, for example, for regression or classification."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example of feature extraction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_diabetes\n",
"X, y = load_diabetes(True)\n",
"print(X.shape)\n",
"print(y.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pca=PCA(2)\n",
"pca.fit(X)\n",
"print(pca.explained_variance_)\n",
"Z=pca.transform(X)\n",
"plt.axis('equal')\n",
"plt.scatter(Z[:,0], Z[:,1]);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rng=np.random.RandomState(0)\n",
"X=rng.randn(3,400)\n",
"p=rng.rand(10,3) # Random projection into 10d\n",
"X=np.dot(p, X)\n",
"print(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pca=PCA()\n",
"pca.fit(X)\n",
"v=pca.explained_variance_\n",
"print(v)\n",
"plt.plot(np.arange(1,11), np.cumsum(v));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"####