Python for Data Analysis: Chapter 2

Python for Data Analysis
Chapter 2

本書の構成
• 2章
• 実際のデータを用いた分析例を紹介
• Python, Pandas, Numpyの使用例
• 重要度としては全体の70%
• 3章~9章
• ２章で使用した要素のまとめ
• ２章を理解してから読むとよく消化できる
• 辞書的な役割
• 10章
• 時系列データの取り扱い例
• Time-stampデータの操作
• 可視化
• 11章
• 金融・経済データの取り扱い例
• 12章
• 発展的なNumpyの操作

本書の構成
• 2章
• 実際のデータを用いた分析例を紹介
• Python, Pandas, Numpyの使用例
• 重要度としては全体の70%
• 3章~9章
• ２章で使用した要素のまとめ
• ２章を理解してから読むとよく消化できる
• 辞書的な役割
• 10章
• 時系列データの取り扱い例
• Time-stampデータの操作
• 可視化
• 11章
• 金融・経済データの取り扱い例
• 12章
• 発展的なNumpyの操作
ここをマスター(理解＋再現)
出来ればほぼ完璧

今回取り扱うデータは三種類
• usa.gov data from bit.ly
• https://github.com/usagov/1.USA.gov-Data
• アメリカのURLアクセス履歴
• Json形式  Pandas DataFrameの変換
• DataFrameを用いた基本的操作
• 棒グラフ，Stackグラフ表示
• Movielens 1M data set
• http://www.grouplens.org/node/73
• 映画のレビューデータ
• 複数データ，DataFrameの結合
• DataFrameを用いた平均値・分散算出，
• US baby names
• http://www.ssa.gov/oact/babynames/limits.html
• アメリカの新生児名前データ
• DataFrameの発展的操作
• 目的を見据えたデータ分析
• 命名のトレンド
• 名前の最後(last letter)の劇的な変化
• ユニセックス化する名前 (男性名  女性名)

準備するもの
1. Python
• 今回バージョンは2を使用
• 必要に応じてPyenv, Virtualenv，vagrant等を使用
2. ライブラリ
• Pandas，matplotlib，Numpy等は各自入れる
3. Jupyter Notebook (Ipython)
• データの可視化，操作性を考慮しJupyterを使用
• ショートカットキーはChapter 3を確認(一部動作しない？)

1. Internetアクセス情報の単純な可視化
https://github.com/usagov/1.USA.gov-Data
• サイトへ行きデータをダウンロード
• 展開し適当な場所へ保存
• Jupyter notebookを起動
• 新しいnotebookを作成
• 生データのチェック↓
import pandas as pd from pandas
import DataFrame, Series
import jsonpath = './usagov_bitly_data2013-05-17-1368832207’
records = [json.loads(line) for line in open(path)]
frame =DataFrame(records)
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown’
{ "a": "Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; HTC_PN071 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko)
Version/4.0 Mobile Safari/534.30", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "gr": "CA", "g": "15r91", "h": "10OBm3W",
"l": "pontifier", "al": "en-US", "hh": "j.mp", "r": "direct", "u": "http://www.nsa.gov/", "t": 1368832205, "hc": 1365701422, "cy":
"Anaheim", "ll": [ 33.816101, -117.979401 ] }
1-1. import，DataFrameへデータの入力，欠損値の処理

tz_counts = clean_tz.value_counts()
import matplotlib.pyplot as plt
%matplotlib inline
tz_counts[:10].plot(kind='barh',rot=0)
1-2. value_counts(), 棒グラフ作成
import numpy as np
cframe = frame[frame.a.notnull()]
operating_system = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')
1-3. 文字列のsplit
results = Series([x.split()[0] for x in frame.a.dropna()])
1-4. NaNをはじく，DataFrameの文字列から該当文字を検索
1-5. グループ化，ピボット変換，ソートして順番を返す
by_tz_os = cframe.groupby(['tz', operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
indexer = agg_counts.sum(1).argsort()
州ごとのアクセス量がわかる

normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh',stacked=True)
1-6. DataFrameから目的の行を抽出，Stackグラフ作成
count_subset = agg_counts.take(indexer)[-10:]
count_subset.plot(kind='barh',stacked=True)
1-7. グラフを%で可視化
Windows使用量を可視化
州ごとのwindows使用比率がわかる

2. 映画のレビュー結果を男女別に整理
http://www.grouplens.org/node/73
• サイトへ行きデータ(users.dat, rating.dat, movies.dat)をダウンロード
• Jupyter notebookを起動
• 新しいnotebookを作成
• 生データのチェック↓
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep = '::', header = None, names = unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header = None, names = rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)
users.dat
1::F::1::10::48067
ratings.dat
1::1193::5::978300760
movies.dat
1::Toy Story (1995)::Animation|Children's|Comedy
2-1. データのインポート，split

data = pd.merge(pd.merge(ratings, users), movies)
print data.ix[1]
mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')
mean_ratings[:5]
2-2. DataFrameのmerge，平均値の算出
ratings = mean_ratings[:10]
ratings.plot(kind='bar')
2-3. ランキングの可視化
2-4. レビュー数250以上のデータのみ抽出
ratings_by_title = data.groupby('title').size()
active_titles = ratings_by_title.index[ratings_by_title >= 250]
mean_ratings = mean_ratings.ix[active_titles]
top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_index(by='diff')
sorted_by_diff[::-1][:15]
2-5. レビュー結果の男女比
gender F M diff
title
Good, The Bad and The Ugly, The
(1966)
3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title = rating_std_by_title.ix[active_titles]
rating_std_by_title.order(ascending=False)[:10]
2-6. レビュー結果の分散
title
Dumb & Dumber (1994) 1.321333
Blair Witch Project, The (1999) 1.316368
Natural Born Killers (1994) 1.307198

years=range(1880, 2015)
pieces=[]
columns=['names', 'sex', 'births']
for year in years:
path = 'names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] =year
pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
3. 新生児の命名データ
http://www.ssa.gov/oact/babynames/limits.html
• サイトへ行きデータをダウンロード
3-1. 生データのチェック→
!head -n 3 names/yob1880.txt
Mary,F,7065
Anna,F,2604
Emma,F,2003
3-2. データのインポート
names1880 = pd.read_csv('names/yob1880.txt', names=['name', 'sex', "births"])
names1880.groupby('sex').births.sum()
3-3. DataFrameの連結

total_births = names.pivot_table('births', index='year', columns='sex', aggfunc='sum')
total_births.plot(title='Total births by sex and year')
3-4. 連結したデータの可視化
names sex births year prop
year sex
1880 F
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3
Elizabet
h
F 1939 1880 0.021309
3-5-1. top1000を作成 (function ver)
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_prop)
pieces = []
for year, group in names.groupby(['year', 'sex']):
pieces.append(group.sort_index(by='births', ascending=False)[:1000])
top1000 = pd.concat(pieces, ignore_index=True)
3-5-2. top1000を作成 (for loop ver)

male = top1000[top1000.sex == 'M']
female = top1000[top1000.sex == 'F']
3-6. 男女で分類
total_births = top1000.pivot_table('births', index='year', columns='names', aggfunc=sum)
total_births[:10]
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")
3-7. 名前別の年代変化

table = top1000.pivot_table('prop', index='year', columns='sex', aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex', yticks=np.linspace(0, 1.2, 13),
xticks=range(1880, 2020, 10))
df = boys[boys.year == 2010]
df[:3]
names
sex births year prop
260877 Jacob M 22082 2010 0.011538
260878 Ethan M 17985 2010 0.009397
260879
Micha
el
M 17308 2010 0.009044
3-8. 名前自体の多様性変化
3-9. 累積和
prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
prop_cumsum[:5]

def get_quantile_count(group, q=0.5):
group = group.sort_index(by='prop', ascending=False)
return group.prop.cumsum().searchsorted(q)+1
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
diversity.head
3-10. 1990年の確率50%以上の名前のみをピックアップ
prop_cumsum.searchsorted(0.5)
df = boys[boys.year == 1900]
in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
in1900.searchsorted(0.5) + 1
3-11. 全ての年代で一般的な名前を男女別にピックアップ
sex F M
year 1910 1960 2010 1910 1960 2010
last letter
a 108397 691245 675901 977 5214 28814
b NaN 694 454 411 3912 39208
c 5 49 953 482 15466 23307
d 6751 3728 2635 22113 262143 44758
e 133601 435048 316288 28665 178810 130073

diversity=diversity.astype(float)
diversity.plot(title="Number of popular names in top 50%")
3-12. 男女別名前の多様性を可視化

get_last_letter = lambda x: x[-1]
last_letters = names.names.map(get_last_letter)
last_letters.name = 'last letter’
table = names.pivot_table('births', index=last_letters, columns=['sex', 'year'], aggfunc=sum)
sex F M
year 1910 1960 2010 1910 1960 2010
last
letter
a 108397 691245 675901 977 5214 28814
b NaN 694 454 411 3912 39208
c 5 49 953 482 15466 23307
d 6751 3728 2635 22113 262143 44758
e 133601 435048 316288 28665 178810 130073
3-13. 名前の最後の文字をまとめたテーブルを作成
3-14. 1910, 1960, 2010年のみ見てみる
subtitle = table.reindex(columns=[1910, 1960, 2010], level='year')
subtitle.head()

letter_prop = subtitle / subtitle.sum().astype(float)
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',legend=False)
3-15. 最終文字の確率を調べ，男女別に可視化

letter_prop = table / table.sum().astype(float)
dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
dny_ts.head()
dny_ts.plot()
last letter d n y
year
1880 0.083057 0.153216 0.075762
1881 0.083240 0.153209 0.077453
1882 0.085339 0.149558 0.077537
1883 0.084059 0.151650 0.079146
1884 0.086120 0.149924 0.08040
3-16. d, n, yに着目し，時系列変化を可視化

all_names = top1000.names.unique()
mask = np.array(['lesl' in x.lower() for x in all_names])
lesly_like = all_names[mask]
lesly_like
filtered = top1000[top1000.names.isin(lesly_like)]
filtered.groupby('names').births.sum()
table = filtered.pivot_table('births', index='year', columns='sex', aggfunc='sum')
table = table.div(table.sum(1), axis=0)
table.tail(n=5)
table.plot(style={'M': 'k-', 'F': 'k--'})
array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)
sex F M
year
2010 1 NaN
2011 1 NaN
2012 1 NaN
2013 1 NaN
2014 1 NaN
3-15. lesl*に該当する名前を探す

Python for Data Analysis: Chapter 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Python for Data Analysis: Chapter 2

Similar to Python for Data Analysis: Chapter 2 (16)

Python for Data Analysis: Chapter 2