فصل 8

PCA

مسائل ماشین لرنینگ معمولا فیچرهای زیادی دارند. این زیاد بودن تعداد فیچرها، فرایند آموزش را کند میکند. میخواهیم با PCA این مسئله را حل کنیم

Fortunately, in real-world problems, it is often possible to reduce the
number of features considerably, turning an intractable problem into a
tractable one.
برای مثال در فصل 3 مسئله تشخیص اعداد،

For example, consider the MNIST images (introduced in
Chapter 3):
حاشیه تصاویر همواره سفید است. بنابراین میتوان به راحتی آن را از دیتاست حذف نمود بدون چیز با ارزشی را از دست داده باشیم.

the pixels on the image borders are almost always white, so
you could completely drop these pixels from the training set without
losing much information.

Figure 7-6 confirms that these pixels are utterly
unimportant for the classification task.


Additionally, two neighboring
pixels are often highly correlated: if you merge them into a single pixel
(e.g., by taking the mean of the two pixel intensities), you will not lose
much information

علاوه به این کاهش ابعاد در مصور سازی داده مفید است.

Apart from speeding up training, dimensionality reduction is also
extremely useful for data visualization (or DataViz).

Reducing the number
of dimensions down to two (or three) makes it possible to plot a condensed
view of a high-dimensional training set on a graph and often gain some
important insights by visually detecting patterns, such as clusters.
Moreover, DataViz is essential to communicate your conclusions to people
who are not data scientists—in particular, decision makers who will use
your results.
برای نمایش داستان به افرادی که دیتا ساینس نیستند

In this chapter we will discuss the curse of dimensionality and get a sense
of what goes on in high-dimensional space.
در این فصل خواهیم گفت که در فضای با تعداد ابعاد زیاد چه میگذرد.

In this chapter we will discuss the curse of dimensionality and get a sense
of what goes on in high-dimensional space. Then, we will consider the two
main approaches to dimensionality reduction (projection and Manifold
Learning), and we will go through three of the most popular
dimensionality reduction techniques: PCA, Kernel PCA, and LLE.

سپس دو رویکرد مهم در کاهش ابعاد را بررسی خواهیم کرد. projection , manifold learning

سپس 3 تکنیک مهم را بررسی خواهیم کرد: PCA, Kernel PCA, and LLE

The Curse of Dimensionality
We are so used to living in three dimensions that our intuition fails us
when we try to imagine a high-dimensional space.
چون در سه بعد زندگی میکنیم؛ تصور ابعاد بیشتر اندکی دشوار است.

Even a basic 4D hypercube is incredibly hard to picture in our minds (see Figure 8-1), let alone a 200-dimensional ellipsoid bent in a 1,000-dimensional space.


تصور 4 بعدی سخت است ، چه برسد به 1000 بعد

It turns out that many things behave very differently in high-dimensional
space.
رفتار اشیاء در ابعاد بالاتر تغییر میکند
For example, if you pick a random point in a unit square (a 1 × 1
square), it will have only about a 0.4% chance of being located less than0.001 from a border (in other words, it is very unlikely that a random point
will be “extreme” along any dimension).
برای مثال اگر یک مربع یک در یک در نظر بگیرید، احتمال اینکه فاصله آن از خط محیط کمتر از 0.001 باشد، 0.4 درصد است. بعبارت دیگر احتمال خیلی کمی دارد که این نقطه به شدت در یک بعد باشد.


But in a 10,000-dimensional unit
hypercube, this probability is greater than 99.999999%. Most points in a
high-dimensional hypercube are very close to the border.

اما هنگامی که تعداد ابعاد به 10000 برسد، این احتمال بیش از 99.9999 درصد است.

Here is a more troublesome difference: if you pick two points randomly in
a unit square, the distance between these two points will be, on average,
roughly 0.52.
اگر دو نقطه در یک مربع یک در یک انتخاب کنیم، فاصله آنها به طور میانگین 0.52 خواهد بود.

If you pick two random points in a unit 3D cube, the average
distance will be roughly 0.66.
در سه بعد، این عدد به 0.66 میرسد
But what about two points picked randomly
in a 1,000,000-dimensional hypercube?
در یک میلیون بعد چطور؟ –>408.25

The average distance, believe it or
not, will be about 408.25 (roughly √1, 000, 000/6)! This is counterintuitive: how can two points be so far apart when they both lie
within the same unit hypercube?


Well, there’s just plenty of space in high
dimensions.
بنابراین با افزایش ابعاد، مقدار فضا افزایش می یابد

As a result, high-dimensional datasets are at risk of being
very sparse:
و در نتیجه دیتاستهایی که ابعاد زیادی دارند، در معرض ریسک پراکندگی هستند.

most training instances are likely to be far away from each
other.
نمونه های آموزش احتمال دارد که بسیار دور از هم باشند

This also means that a new instance will likely be far away from any
training instance, making predictions much less reliable than in lower
dimensions, since they will be based on much larger extrapolations.
و این یعنی ممکن است نمونه جدید از نمونه های فعلی بسیار دور باشند و این باعث میشود که پیش بینی ها نسبت به حالتی که تعداد ابعاد کم است، قابل اعتماد نباشند.

In
short, the more dimensions the training set has, the greater the risk of
overfitting it.
به طور خلاصه هرچه تعداد ابعاد بیشتر باشد احتمال overfit شدن بیشتر است.

In theory, one solution to the curse of dimensionality could be to increase
the size of the training set to reach a sufficient density of training instances.
یک روش حل مسئله dimensionality افزایش تعداد نمونه های موجود است.

Unfortunately, in practice, the number of training instances
required to reach a given density grows exponentially with the number of
dimensions.

With just 100 features (significantly fewer than in the MNIST
problem), you would need more training instances than atoms in the
observable universe in order for training instances to be within 0.1 of each
other on average, assuming they were spread out uniformly across all
dimensions.
رویکردهای اصلی در کاهش ابعاد

Before we dive into specific dimensionality reduction algorithms, let’s
take a look at the two main approaches to reducing dimensionality:
projection and Manifold Learning
دو مورد اصلی :
projection and Manifold Learning

Projection
روش 1

In most real-world problems, training instances are not spread out
uniformly across all dimensions.
در بسیاری از مسائل واقعی، نمونه های اموزشی به طور یکنواخت در همه ابعاد پراکنده نشده اند.

Many features are almost constant, while
others are highly correlated (as discussed earlier for MNIST).

As a result,all training instances lie within (or close to) a much lower-dimensional subspace of the high-dimensional space.
نمونه ها در پایین ترین بعد یا نزدیک به آن متمرکز میشوند.
This sounds very abstract, so let’slook at an example.

In Figure 8-2 you can see a 3D dataset represented by circles
شکل زیر و یک داده 3 بعدی:

داده سه بعدی که رفتارش شبیه به دو بعدی است

Notice that all training instances lie close to a plane:


this is a lowerdimensional (2D) subspace of the high-dimensional (3D) space.

If we project every training instance perpendicularly onto this subspace (as
represented by the short lines connecting the instances to the plane), we get the new 2D dataset shown in Figure 8-3.

نمونه سه بعد ی در تصویر قبل بعد از projection



Ta-da! We have just reduced the dataset’s dimensionality from 3D to 2D.
تعداد ابعاد را از 3 به 2 رساندیم
Note that the axes correspond to new features z and z (the coordinates of the projections on the plane).

However, projection is not always the best approach to dimensionality
reduction.
این روش همواره بهترین روش برای dimension reduction نیست
In many cases the subspace may twist and turn, such as in the
famous Swiss roll toy dataset represented in Figure 8-4.

در بسیاری از دیتا ست ها ممکن است دیتا بچرخد و پیپ بخورد:

Manifold Learning
روش 2

The Swiss roll is an example of a 2D manifold.

Put simply, a 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space.

More generally, a d-dimensional manifold is a part of an n-dimensional space (where d < n) that locally resembles a d-dimensional hyperplane.

In the case of the Swiss roll, d = 2 and n = 3:

it locally resembles a 2D plane, but it is rolled in the third dimension.

Many dimensionality reduction algorithms work by modeling the
manifold on which the training instances lie;

this is called Manifold Learning.

It relies on the manifold assumption, also called the manifold
hypothesis, which holds that most real-world high-dimensional datasets lie
close to a much lower-dimensional manifold.

This assumption is very often empirically observed.

Once again, think about the MNIST dataset:

all handwritten digit images have some similarities. They are made of connected lines, the borders are white, and they are more or less centered.

If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits. In other words, the degrees of freedom available to you if you try to create a digit image are dramatically lower than the degrees of freedom you would have if you were allowed to generate any image you wanted. These constraints tend to squeeze the dataset into a lower dimensional manifold.

The manifold assumption is often accompanied by another implicit
assumption: that the task at hand (e.g., classification or regression) will be
simpler if expressed in the lower-dimensional space of the manifold.

For example, in the top row of Figure 8-6 the Swiss roll is split into two
classes: in the 3D space (on the left), the decision boundary would be
fairly complex, but in the 2D unrolled manifold space (on the right), the
decision boundary is a straight line.

However, this implicit assumption does not always hold. For example, in
the bottom row of Figure 8-6, the decision boundary is located at x = 5.
This decision boundary looks very simple in the original 3D space (a

vertical plane), but it looks more complex in the unrolled manifold (a
collection of four independent line segments).
In short, reducing the dimensionality of your training set before training a
model will usually speed up training
, but it may not always lead to a better
or simpler solution;

it all depends on the dataset. Hopefully you now have a good sense of what the curse of dimensionality is and how dimensionality reduction algorithms can fight it, especially when the manifold assumption holds. The rest of this chapter will go through some of the most popular algorithms.



***

Principal Component Analysis (PCA) is by far the most popular
dimensionality reduction algorithm.
به معنای Principal Component Analysis بوده و تا امروز محبوب ترین الگوریتم کاهش ابعاد دیتاست.


First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it, just like in Figure 8-2.
صفحه ای که داده های بیشتری را در برمیگیرد یا در نزدیکی خود دارد را پیدا میکند و دیتا ها را روی آن پروجکت میکند

داده سه بعدی که رفتارش شبیه به دو بعدی است

preserving the variance

Before you can project the training set onto a lower-dimensional
hyperplane, you first need to choose the right hyperplane.
قبل از پروجکت کردن داده ها روی یک بعد خاص ابتدا باید بعد صحیح را پیدا کنیم
برای مثال داده دو بعدی سمت چپ ف در راستای 3 محور میتواند در نظر گرفته شود.

For example, a simple 2D dataset is represented on the left in Figure 8-7, along with three different axes (i.e., 1D hyperplanes).



On the right is the result of the projection of the dataset onto each of these axes.
در سمت راست نتیجه پروجکت شدن داده ها روی هریک از این سه محور نشان داده شده است.
As you can see, the projection onto the solid line preserves the maximum variance, while the projection onto the dotted line preserves very little variance and the projection onto the dashed line preserves an intermediate amount of
variance
مقدار واریانس روس خط پیوسته بیشترین، روی خط نقطه چین کمترین و در دیگری متوسط است

It seems reasonable to select the axis that preserves the maximum amount
of variance, as it will most likely lose less information than the other
projections.
به نظر منطقی میرسه که محوری را برای projection انتخاب کنیم که بیشترین varicance رو داشته باشه که در نتیجه کمترین داده از بین بره.

Another way to justify this choice is that it is the axis that
minimizes the mean squared distance between the original dataset and its
projection onto that axis.
روش دیگر برای توجیح این انتخاب آن خطی است که فاصله mean squared بین دیتای اصلی و مقدار project شده کمترین باشد
This is the rather simple idea behind PCA.

Principal Components

PCA identifies the axis that accounts for the largest amount of variance in
the training set. In Figure 8-7, it is the solid line.


PCA محوری را پیدا میکند که داده های یا بیشتری واریانس را در بر بگیرد. در شکل 8-7 خط پیوسته پیدا میشود.
سپس خط دومی پیدا میشود که به خط اول عمود و بیشتری واریانس را در جهت خود داشته باشد.

It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance.

In this 2D example there is no choice: it is the dotted
line.
در این مثال ما گزینه دیگری جز خط نقطه چین نیست.
اگر داده با تعداد ابعاد بیشتری داشتیم، خط سومی هم پیدا میشد که به خط اول و دوم عمود باشد (همچنین خطوط چهارم و پنجم و … )
If it were a higher-dimensional dataset, PCA would also find a third
axis, orthogonal to both previous axes, and a fourth, a fifth, and so on—as
many axes as the number of dimensions in the dataset.

محور i ام پیدا شده را pc شماره i از داده گویند. یعنی جزء اصلی Iام دیتاست. مثلا اولین جزء اصلی یا دومین جزء اصلی.

The i axis is called the i principal component (PC) of the data. In
Figure 8-7, the first PC is the axis on which vector c lies, and the second
PC is the axis on which vector c lies.
در شکل 8-7 اولین PC همان محوری است که وکتور c1 روی ان قرار میگیرد.و PC یا عامل اصلی دوم C2 است



In Figure 8-2 the first two PCs are the orthogonal axes on which the two arrows lie, on the plane, and the third PC is the axis orthogonal to that plane.
دو عامل مهم (pc) اول بر هم عمود هستند که دو محور c1 و c2 روی آن قرار میگیرند و عامل مهم سوم به صفحه متشکل از c1 وc2 عمود میشود

داده سه بعدی که رفتارش شبیه به دو بعدی است

برای هر pc روش PCA یک محور با مبدا صفر پیدا میکند که در راستای آن عامل مهم کشده شده است.

For each principal component, PCA finds a zero-centered unit vector pointing in the direction of the PC.

Since two opposing unit vectors lie on the same axis, the direction of the unit vectors returned by PCA is not stable:

if you perturb the training set slightly and run PCA again, the unit vectors may point in the opposite direction as the original vectors. However, they will generally still lie on the same axes. In some cases, a pair of unit vectors may even rotate or swap (if the variances along these two axes are close), but the plane they define will generally remain the same.

So how can you find the principal components of a training set?
سوال :
چگونه میتواند pc های یک دیتا ست را بدست آورد؟

خوشبختانه تکنیک standard matrix factorization وجود دارد که به آن SVD گویند به معنای Singular Value Decomposition که دیتا ست را به pc های آن تجزیه میکند.

Luckily, there is a standard matrix factorization technique called Singular Value Decomposition (SVD) that can decompose the training set matrix X into
the matrix multiplication of three matrices U Σ V , where V contains the
unit vectors that define all the principal components that we are looking
for, as shown in Equation 8-1

کد های زیر از تابع SVD برای پیدا کردن عوامل PC دیتا ست استفاده میکند

The following Python code uses NumPy’s svd() function to obtain all the
principal components of the training set, then extracts the two unit vectors
that define the first two PCs:


warning:

یادمان باشد که ابتدا داده ها را بایدcentered کنیم زیرا این موضوع در PCA یک الزام است. هرچند کتابخانه سایکیتلرن این کار را انجام میدهد اما اگر خودمان بخواهیم این روش را اجرا کنیم باید ابتدا داده ها را centered کنیم.

Projecting Down to d Dimensions

Once you have identified all the principal components, you can reduce the
dimensionality of the dataset down to d dimensions by projecting it onto
the hyperplane defined by the first d principal components.

زمانی که مهمترین pc ها را تشخیص دادیم میتوانیم کاهش ابعاد را انجام دهیم.
کاهش ابعاد به hyperplanای که با d بعد اول تعریف میشود.

Selecting this hyperplane ensures that the projection will preserve as much variance as possible.
انتخاب این hyper plane تضمین میکند که بیشترین واریانس (وابستگی) ممکن در نظر گرفته شده است.

برای مثال در شکل 8-2 تضمین میشود که صفحه ی متشکل از دو pc اول بیشترین مقدار وابستگی داده ای را در خود نمودار میکند.
For example, in Figure 8-2 the 3D dataset is projected down to
the 2D plane defined by the first two principal components, preserving a
large part of the dataset’s variance. As a result, the 2D projection looks
very much like the original 3D dataset

برای بدست آوردن دیتاست کاهش ابعاد یافته (با d بعد) باید ضرب ماتریسی زیر را انجام دهیم. که X ماتریس دیتاست اصلی است و w ماتریسی شامل d ستون اول از V است. V همان ماتریسی است که pc ها یا عوامل مهم در آن به تریتیب از مهم ترین تا کم اهمیت ترین قرار گرفته اند.

To project the training set onto the hyperplane and obtain a reduced dataset
X of dimensionality d, compute the matrix multiplication of the
training set matrix X by the matrix W , defined as the matrix containing
the first d columns of V, as shown in Equation 8-2.

ص 290:

کد ابتدای صفحه دیتا ست را به یک دیتا ست دو بعدی تقلیل میدهد

Explained Variance Ratio

Another useful piece of information is the explained variance ratio of
each principal component, available via the explained_variance_ratio_
variable.

The ratio indicates the proportion of the dataset’s variance that
lies along each principal component.

For example, let’s look at the
explained variance ratios of the first two components of the 3D dataset
represented in Figure 8-2:

این خروجی میگوید 0.84 درصد از واریانس داده مربوط به PC اول است. و 14.6 درصد مربوط به pc دوم است.
بنابر این برای عامل سوم کمتر از 1.2 درصد میماند. یعنی میتوان عامل سوم را حذف نمود.
با این تابع میتوان متوجه شده که تعداد PC منطقی یک دیتا ست چقدر است.

This output tells you that 84.2% of the dataset’s variance lies along the
first PC, and 14.6% lies along the second PC. This leaves less than 1.2%
for the third PC, so it is reasonable to assume that the third PC probably
carries little information.

Choosing the Right Number of Dimensions


Instead of arbitrarily choosing the number of dimensions to reduce down
to, it is simpler to choose the number of dimensions that add up to a
sufficiently large portion of the variance (e.g., 95%).

Unless, of course, you are reducing dimensionality for data visualization—in that case you will want to reduce the dimensionality down to 2 or 3.
The following code performs PCA without reducing dimensionality, then
computes the minimum number of dimensions required to preserve 95%
of the training set’s variance:

کد زیر به ما میگوید که تعداد ابعاد به چه تعداد کاهش یابد با شرط اینکه مقدار واریانس ما 95 درصد باشد:

باید ورودیn_components را برابر با d قرار دهیم و مجددا PCA را اجرا کنیم

You could then set n_components=d and run PCA again.



But there is a much better option: instead of specifying the number of principal
components you want to preserve, you can set n_components to be a float
between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:

روش بهتر این است که n_components را عددی اعشاری ببین 0 تا 1 بگذاریم که نشانگر مقدار واریانس مورد نظر ما باشد/

روش دیگر ترسیم واریانس بعنوان تابعی از تعداد ابعاد است. یک حالت آرنج وجود دارد که افزایش مقدار واریانس در آن متوقف میشود. شکل 8-8 نشان میدهد که با افزایش ابعاد از 150 به بعد دیگر مقدار واریانس تغییر زیادی ندارد.

Yet another option is to plot the explained variance as a function of the
number of dimensions (simply plot cumsum; see Figure 8-8). There will
usually be an elbow in the curve, where the explained variance stops

PCA for Compression


After dimensionality reduction, the training set takes up much less space.
As an example, try applying PCA to the MNIST dataset while preserving
95% of its variance.

You should find that each instance will have just over 150 features, instead of the original 784 features.

So, while most of the variance is preserved, the dataset is now less than 20% of its original size!

This is a reasonable compression ratio, and you can see how this size reduction can speed up a classification algorithm (such as an SVM classifier) tremendously.

It is also possible to decompress the reduced dataset back to 784
dimensions by applying the inverse transformation of the PCA projection.
This won’t give you back the original data, since the projection lost a bit
of information
(within the 5% variance that was dropped), but it will

likely be close to the original data. The mean squared distance between the
original data and the reconstructed data (compressed and then
decompressed) is called the reconstruction error.

The following code compresses the MNIST dataset down to 154
dimensions, then uses the inverse_transform() method to decompress it
back to 784 dimensions:

دیدگاهی بنویسید

نشانی ایمیل شما منتشر نخواهد شد.