13. total_births = names.pivot_table('births', index='year', columns='sex', aggfunc='sum')
total_births.plot(title='Total births by sex and year')
3-4. 連結したデータの可視化
names sex births year prop
year sex
1880 F
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3
Elizabet
h
F 1939 1880 0.021309
3-5-1. top1000を作成 (function ver)
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_prop)
pieces = []
for year, group in names.groupby(['year', 'sex']):
pieces.append(group.sort_index(by='births', ascending=False)[:1000])
top1000 = pd.concat(pieces, ignore_index=True)
3-5-2. top1000を作成 (for loop ver)
14. male = top1000[top1000.sex == 'M']
female = top1000[top1000.sex == 'F']
3-6. 男女で分類
total_births = top1000.pivot_table('births', index='year', columns='names', aggfunc=sum)
total_births[:10]
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")
3-7. 名前別の年代変化
15. table = top1000.pivot_table('prop', index='year', columns='sex', aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex', yticks=np.linspace(0, 1.2, 13),
xticks=range(1880, 2020, 10))
df = boys[boys.year == 2010]
df[:3]
names
sex births year prop
260877 Jacob M 22082 2010 0.011538
260878 Ethan M 17985 2010 0.009397
260879
Micha
el
M 17308 2010 0.009044
3-8. 名前自体の多様性変化
3-9. 累積和
prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
prop_cumsum[:5]
16. def get_quantile_count(group, q=0.5):
group = group.sort_index(by='prop', ascending=False)
return group.prop.cumsum().searchsorted(q)+1
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
diversity.head
3-10. 1990年の確率50%以上の名前のみをピックアップ
prop_cumsum.searchsorted(0.5)
df = boys[boys.year == 1900]
in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
in1900.searchsorted(0.5) + 1
3-11. 全ての年代で一般的な名前を男女別にピックアップ
sex F M
year 1910 1960 2010 1910 1960 2010
last letter
a 108397 691245 675901 977 5214 28814
b NaN 694 454 411 3912 39208
c 5 49 953 482 15466 23307
d 6751 3728 2635 22113 262143 44758
e 133601 435048 316288 28665 178810 130073
20. letter_prop = table / table.sum().astype(float)
dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
dny_ts.head()
dny_ts.plot()
last letter d n y
year
1880 0.083057 0.153216 0.075762
1881 0.083240 0.153209 0.077453
1882 0.085339 0.149558 0.077537
1883 0.084059 0.151650 0.079146
1884 0.086120 0.149924 0.08040
3-16. d, n, yに着目し,時系列変化を可視化
21. all_names = top1000.names.unique()
mask = np.array(['lesl' in x.lower() for x in all_names])
lesly_like = all_names[mask]
lesly_like
filtered = top1000[top1000.names.isin(lesly_like)]
filtered.groupby('names').births.sum()
table = filtered.pivot_table('births', index='year', columns='sex', aggfunc='sum')
table = table.div(table.sum(1), axis=0)
table.tail(n=5)
table.plot(style={'M': 'k-', 'F': 'k--'})
array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)
sex F M
year
2010 1 NaN
2011 1 NaN
2012 1 NaN
2013 1 NaN
2014 1 NaN
3-15. lesl*に該当する名前を探す