「ビックデータ分析・活用のためのSQLレシピ」を読む

5-2 ユーザー全体の時系列による状態変化を見つける

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df_user = pd.read_sql("SELECT * FROM mst_users", conn)
print(df_user.head(3))

f:id:bitop:20180812073419p:plain

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df_act = pd.read_sql("SELECT * FROM action_log", conn)
print(df_act.head(3))

f:id:bitop:20180812073503p:plain

5-2-1 登録者数の推移と傾向をみる

df_user.groupby('register_date')['user_id'].count()

f:id:bitop:20180812073630p:plain

df_user['register_date'] = pd.to_datetime(df_user['register_date'])
df_user['year_month'] = df_user['register_date'].dt.year.map(lambda x:str(x)) + "-" +df_user['register_date'].dt.month.map(lambda x:str(x))
df_month = pd.DataFrame(df_user.groupby('year_month')['register_date'].count())
df_month['last_month_count'] = df_month['register_date'].shift(1)
df_month['month_over_month_ratio'] = df_month['register_date'] / df_month['last_month_count'] 
print(df_month)

f:id:bitop:20180812073722p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

5章 ユーザーを把握するためのデータ抽出

5-1 ユーザー全体の特徴・傾向を見つける

5-1-1 ユーザーのアクション数を集計する

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM action_log", conn)
print(df)

f:id:bitop:20180805091114p:plain

#わからなくなった
total_uu = df['session'].nunique()
dg1 = df.groupby('action')
df1 = pd.DataFrame({
               'action_uu':dg1['session'].count(),
               'total_uu':total_uu
               }
              )
print(df1)

f:id:bitop:20180805091202p:plain

5-1-2 年齢別区分を集計する

import pandas as pd
import psycopg2
import numpy as np
import datetime
import math as m

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM mst_users", conn)
print(df)

f:id:bitop:20180805091327p:plain

df['birth_date'] = pd.to_datetime(df['birth_date'])
df['today'] = pd.to_datetime(datetime.date.today())
dff = (df['today'] - df['birth_date'])
dff = dff.dt.total_seconds()/3600/24/365
df['age'] = dff.map(m.floor)
print(df)

f:id:bitop:20180805091455p:plain

def age_category(age):
    if 4 <= age <= 12:
        return 'C'
    if 13 <= age <= 19:
        return 'T'
    if 20 <= age <= 34:
        return '1'
    if 35 <= age <= 49:
        return '2'
    if 50 <= age:
        return '3'

def sex_cut(x):
    if 'C' in x:
        x = 'C'
    if 'T' in x:
        x = 'T'
    return x

df['age_category'] = df['age'].map(age_category)
df['category'] = df['sex'] + df['age_category']
df['category'] = df['category'].map(sex_cut)
print(df[['user_id','sex','age','category']])

f:id:bitop:20180805091623p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

4-2-2 ABC分析で売れ筋を判別する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_detail_log", conn)
print(df.head(10))

f:id:bitop:20180729091655p:plain

df1 = df.groupby('category')['price'].sum()

df2 = pd.DataFrame({'category':df1.index,
                'amount':df1.values})

sum_amount = df2['amount'].sum()
df2['cumsum'] = df2['amount'].cumsum()
df2['composition'] = df2['amount'].map(lambda x:100.0 * x/sum_amount)
df2['cumulative_ratio'] = df2['cumsum'].map(lambda x:100.0 * x/sum_amount)
df2['abc_rank'] = df2['cumulative_ratio'].map(lambda x: 'A' if 0 < x < 70 else 'B' if 70 < x < 90 else 'C')
print(df2)

f:id:bitop:20180729091758p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

4-4-6 売上を把握するための大事なポイント

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_log_long", conn)
print(df)

f:id:bitop:20180726072940p:plain

#難しい
df['dt'] = pd.to_datetime(df['dt'])
df = df.set_index('dt')
df['month'] = df.index.month

df_m = df.groupby(df.index)
df1 = pd.DataFrame({'month':df_m['month'].first(),
                'order_id':df_m['order_id'].count(),
                'avg_amount':df_m['purchase_amount'].mean(),
                'monthly':df['purchase_amount'].cumsum()})
print(df1.head(12))

f:id:bitop:20180726073043p:plain

4-2 多面的な軸を使ってデータを集約する

4-2-1 カテゴリ別の売上と小計を計算する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_detail_log", conn)
print(df)

f:id:bitop:20180726073444p:plain

#うまくいかない
dfm = df.set_index(['category','sub_category']).sort_index()
print(dfm['price'])

f:id:bitop:20180726073510p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

4-4-2 移動平均を用いて日別の推移を見る

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_log", conn)
print(df.head(10))

f:id:bitop:20180722084659p:plain

gb = df.groupby('dt')
s = pd.Series(gb['purchase_amount'].sum())
df1 = pd.DataFrame({'total_amount':s,
                'saven_day_avg_amount':s.rolling(window=7,min_periods=1).mean(),
                'saven_day_avg_amount_strict':s.rolling(window=7,min_periods=7).mean()
               })
print(df1)

f:id:bitop:20180722084814p:plain

4-4-3 当月売上の累計を求める

gb = df.groupby('dt')
df1 = pd.DataFrame({'total_amount':pd.Series(gb['purchase_amount'].sum())})
df1.index = pd.to_datetime(df1.index)
df1['agg_amount'] = df1['total_amount'].cumsum()
print(df1)

f:id:bitop:20180722084939p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

4章 売上を把握するためのデータ抽出

4-4 時系列に沿ってデータを集約する

4-4-1 日別の売上を集計する

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_log", conn)
print(df.head(3))

f:id:bitop:20180721075517p:plain

gb = df.groupby('dt')
df1 = pd.DataFrame({'purchase_count':gb['purchase_amount'].count(),
                'total_amount':gb['purchase_amount'].sum(),
                'avg_amount':gb['purchase_amount'].mean()})
print(df1[['purchase_count','total_amount','avg_amount']])

f:id:bitop:20180721075738p:plain

%matplotlib inline
df1['total_amount'].plot(kind='bar',color='r')
df1['avg_amount'].plot(kind='line')

f:id:bitop:20180721075824p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-4-3 条件のフラグを0と1で表現する

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df1 = pd.read_sql("SELECT * FROM mst_users_with_card_number", conn)
df2 = pd.read_sql("SELECT * FROM purchase_log", conn)
print(df1.head(3))
print('-'*50)
print(df2)

f:id:bitop:20180716101507p:plain

df4 = df2.groupby('user_id')
df4 = pd.DataFrame(df4['user_id'].count())
df4 = df4.rename(columns={'user_id':'purchased_count'})
df5 = df1.merge(df4,on='user_id',how='outer')

df4 = df2.groupby('user_id')
df4 = np.sign(pd.DataFrame(df4['user_id'].count()))
df4 = df4.rename(columns={'user_id':'has_purchased'})
df3 = df5.merge(df4,on='user_id',how='outer')

df3['has_card'] = df1['card_number'].map(lambda x : 0 if pd.isnull(x) else 1) 

print(df3[['user_id','card_number','purchased_count','has_card','has_purchased']])

f:id:bitop:20180716101618p:plain