Pythonデータサイエンスクックブック

(2順目)

レシピ8.7

2項
scikit-learnが持っているdatasetを使用する。

print(iris['DESCR'])でデータセット概要をプリントアウト
Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:
    データ数は150個あって、その特徴量は4つあって、sepal length(がく片長)、sepal width(がく片幅)、petal length(花びら長)、petal width(花びら幅)
    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
    Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
    Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
    (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
    Structure and Classification Rule for Recognition in Partially Exposed
    Environments".  IEEE Transactions on Pattern Analysis and Machine
    Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
    on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
    conceptual clustering system finds 3 classes in the data.
- Many, many more ...

3項
最初のがく片長とがく片幅をplotする
classは3つあって
Iris-Setosa
Iris-Versicolour
Iris-Virginica

Setosa
f:id:bitop:20160403072733p:plain
出典:https://en.wikipedia.org/wiki/Iris_setosa

Versicolour
f:id:bitop:20160403072837p:plain
出典:https://en.wikipedia.org/wiki/Iris_versicolor

Virginica
f:id:bitop:20160403072938p:plain
出典:https://en.wikipedia.org/wiki/Iris_virginica

f:id:bitop:20160403073130p:plain
4,5項
品種の違いがわかるようになった。
f:id:bitop:20160403073416p:plain
6項
f:id:bitop:20160403073555p:plain

他のPCAも試してみる。まず「SparsePCA」

X_ter = dec.SparsePCA().fit_transform(X)
plt.figure(figsize=(6,3));
plt.scatter(X_ter[:,0], X_ter[:,1], c=y, s=30, cmap=plt.cm.rainbow);

f:id:bitop:20160403074443p:plain
次は「RandomizedPCA」

X_ter = dec.RandomizedPCA().fit_transform(X)
plt.figure(figsize=(6,3));
plt.scatter(X_ter[:,0], X_ter[:,1], c=y, s=30, cmap=plt.cm.rainbow);

f:id:bitop:20160403074639p:plain