「ビックデータ分析・活用のためのSQLレシピ」を読む

4-2-2 ABC分析で売れ筋を判別する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_detail_log", conn)
print(df.head(10))

f:id:bitop:20180729091655p:plain

df1 = df.groupby('category')['price'].sum()

df2 = pd.DataFrame({'category':df1.index,
                'amount':df1.values})

sum_amount = df2['amount'].sum()
df2['cumsum'] = df2['amount'].cumsum()
df2['composition'] = df2['amount'].map(lambda x:100.0 * x/sum_amount)
df2['cumulative_ratio'] = df2['cumsum'].map(lambda x:100.0 * x/sum_amount)
df2['abc_rank'] = df2['cumulative_ratio'].map(lambda x: 'A' if 0 < x < 70 else 'B' if 70 < x < 90 else 'C')
print(df2)

f:id:bitop:20180729091758p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

4-4-6 売上を把握するための大事なポイント

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_log_long", conn)
print(df)

f:id:bitop:20180726072940p:plain

#難しい
df['dt'] = pd.to_datetime(df['dt'])
df = df.set_index('dt')
df['month'] = df.index.month

df_m = df.groupby(df.index)
df1 = pd.DataFrame({'month':df_m['month'].first(),
                'order_id':df_m['order_id'].count(),
                'avg_amount':df_m['purchase_amount'].mean(),
                'monthly':df['purchase_amount'].cumsum()})
print(df1.head(12))

f:id:bitop:20180726073043p:plain

4-2 多面的な軸を使ってデータを集約する

4-2-1 カテゴリ別の売上と小計を計算する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_detail_log", conn)
print(df)

f:id:bitop:20180726073444p:plain

#うまくいかない
dfm = df.set_index(['category','sub_category']).sort_index()
print(dfm['price'])

f:id:bitop:20180726073510p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

4-4-2 移動平均を用いて日別の推移を見る

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_log", conn)
print(df.head(10))

f:id:bitop:20180722084659p:plain

gb = df.groupby('dt')
s = pd.Series(gb['purchase_amount'].sum())
df1 = pd.DataFrame({'total_amount':s,
                'saven_day_avg_amount':s.rolling(window=7,min_periods=1).mean(),
                'saven_day_avg_amount_strict':s.rolling(window=7,min_periods=7).mean()
               })
print(df1)

f:id:bitop:20180722084814p:plain

4-4-3 当月売上の累計を求める

gb = df.groupby('dt')
df1 = pd.DataFrame({'total_amount':pd.Series(gb['purchase_amount'].sum())})
df1.index = pd.to_datetime(df1.index)
df1['agg_amount'] = df1['total_amount'].cumsum()
print(df1)

f:id:bitop:20180722084939p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

4章 売上を把握するためのデータ抽出

4-4 時系列に沿ってデータを集約する

4-4-1 日別の売上を集計する

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_log", conn)
print(df.head(3))

f:id:bitop:20180721075517p:plain

gb = df.groupby('dt')
df1 = pd.DataFrame({'purchase_count':gb['purchase_amount'].count(),
                'total_amount':gb['purchase_amount'].sum(),
                'avg_amount':gb['purchase_amount'].mean()})
print(df1[['purchase_count','total_amount','avg_amount']])

f:id:bitop:20180721075738p:plain

%matplotlib inline
df1['total_amount'].plot(kind='bar',color='r')
df1['avg_amount'].plot(kind='line')

f:id:bitop:20180721075824p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-4-3 条件のフラグを0と1で表現する

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df1 = pd.read_sql("SELECT * FROM mst_users_with_card_number", conn)
df2 = pd.read_sql("SELECT * FROM purchase_log", conn)
print(df1.head(3))
print('-'*50)
print(df2)

f:id:bitop:20180716101507p:plain

df4 = df2.groupby('user_id')
df4 = pd.DataFrame(df4['user_id'].count())
df4 = df4.rename(columns={'user_id':'purchased_count'})
df5 = df1.merge(df4,on='user_id',how='outer')

df4 = df2.groupby('user_id')
df4 = np.sign(pd.DataFrame(df4['user_id'].count()))
df4 = df4.rename(columns={'user_id':'has_purchased'})
df3 = df5.merge(df4,on='user_id',how='outer')

df3['has_card'] = df1['card_number'].map(lambda x : 0 if pd.isnull(x) else 1) 

print(df3[['user_id','card_number','purchased_count','has_card','has_purchased']])

f:id:bitop:20180716101618p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-4 複数のテーブルに対する操作

3-4-1 複数のテーブルを縦に並べる

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df1 = pd.read_sql("SELECT * FROM app1_mst_users", conn)
df1['app_name'] = 'app1' 
print(df1)
df2 = pd.read_sql("SELECT * FROM app2_mst_users", conn)
df2['app_name'] = 'app2' 
print(df2)

f:id:bitop:20180715104712p:plain

df = pd.concat([df1,df2],sort=False)
print(df[['app_name','user_id','name','email']])

f:id:bitop:20180715104759p:plain

3-4-2 複数のテーブルを横に並べる

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df1 = pd.read_sql("SELECT * FROM mst_categories", conn)
print(df1)
df2 = pd.read_sql("SELECT * FROM category_sales", conn)
print(df2)
df3 = pd.read_sql("SELECT * FROM product_sale_ranking", conn)
print(df3)

f:id:bitop:20180715104858p:plain

df4 = pd.merge(df3,pd.merge(df1,df2,on='category_id'),on='category_id')
print(df4[['category_id','name','sales_y','product_id']])

f:id:bitop:20180715105008p:plain

#AND r.rank=1が表現できない
df4 = pd.merge(df3,pd.merge(df1,df2,on='category_id',how='outer'),on='category_id',how='outer')
print(df4)

f:id:bitop:20180715105140p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-3-2 グループの中での順序を扱う

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM popular_products", conn)
print(df)

f:id:bitop:20180714125641p:plain

df['row'] = df['score'].rank(
    ascending=False,
    method='first'
).astype(int)
df['rank'] = df['score'].rank(
   ascending=False,
   method='min'
).astype(int)
df['dense_rank'] = df['score'].rank(
   ascending=False,
   method='dense'
).astype(int)
    df['lag1'] = df['product_id'].shift(1)
    df['lag2'] = df['product_id'].shift(2)
    df['lead1'] = df['product_id'].shift(-1)
    df['lead2'] = df['product_id'].shift(-2)
print(df)

f:id:bitop:20180714125732p:plain

ORDER BY句と集約関数を組み合わせる

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM popular_products", conn)
print(df)

f:id:bitop:20180714125846p:plain

df['row'] = df['score'].rank(
    ascending=False,
    method='first'
).astype(int)
df['cum_score'] = df['score'].cumsum()
df['loacl_avg'] = 
df['score'].rolling(window=3,center=True,min_periods=1).mean()
df['first_value'] = df['product_id'][df['score'].idxmax()]
df['last_value'] = df['product_id'][df['score'].idxmin()]
print(df)

f:id:bitop:20180714125959p:plain

PATITION BY とORDER BYを組み合わせる

いろいろ試してみたができなかった

3-3-3 縦持ちデータを横もちにする

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM daily_kpi", conn)
print(df)

f:id:bitop:20180714130128p:plain

pd.pivot_table(df,index='dt',columns='indicator')

f:id:bitop:20180714130203p:plain

3-3-4 横もちを縦持ちにする

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM quarterly_sales", conn)
print(df)

f:id:bitop:20180714130254p:plain

df.stack()

f:id:bitop:20180714130431p:plain