「ビックデータ分析・活用のためのSQLレシピ」を読む

3-4-3 条件のフラグを0と1で表現する

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df1 = pd.read_sql("SELECT * FROM mst_users_with_card_number", conn)
df2 = pd.read_sql("SELECT * FROM purchase_log", conn)
print(df1.head(3))
print('-'*50)
print(df2)

f:id:bitop:20180716101507p:plain

df4 = df2.groupby('user_id')
df4 = pd.DataFrame(df4['user_id'].count())
df4 = df4.rename(columns={'user_id':'purchased_count'})
df5 = df1.merge(df4,on='user_id',how='outer')

df4 = df2.groupby('user_id')
df4 = np.sign(pd.DataFrame(df4['user_id'].count()))
df4 = df4.rename(columns={'user_id':'has_purchased'})
df3 = df5.merge(df4,on='user_id',how='outer')

df3['has_card'] = df1['card_number'].map(lambda x : 0 if pd.isnull(x) else 1) 

print(df3[['user_id','card_number','purchased_count','has_card','has_purchased']])

f:id:bitop:20180716101618p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-4 複数のテーブルに対する操作

3-4-1 複数のテーブルを縦に並べる

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df1 = pd.read_sql("SELECT * FROM app1_mst_users", conn)
df1['app_name'] = 'app1' 
print(df1)
df2 = pd.read_sql("SELECT * FROM app2_mst_users", conn)
df2['app_name'] = 'app2' 
print(df2)

f:id:bitop:20180715104712p:plain

df = pd.concat([df1,df2],sort=False)
print(df[['app_name','user_id','name','email']])

f:id:bitop:20180715104759p:plain

3-4-2 複数のテーブルを横に並べる

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df1 = pd.read_sql("SELECT * FROM mst_categories", conn)
print(df1)
df2 = pd.read_sql("SELECT * FROM category_sales", conn)
print(df2)
df3 = pd.read_sql("SELECT * FROM product_sale_ranking", conn)
print(df3)

f:id:bitop:20180715104858p:plain

df4 = pd.merge(df3,pd.merge(df1,df2,on='category_id'),on='category_id')
print(df4[['category_id','name','sales_y','product_id']])

f:id:bitop:20180715105008p:plain

#AND r.rank=1が表現できない
df4 = pd.merge(df3,pd.merge(df1,df2,on='category_id',how='outer'),on='category_id',how='outer')
print(df4)

f:id:bitop:20180715105140p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-3-2 グループの中での順序を扱う

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM popular_products", conn)
print(df)

f:id:bitop:20180714125641p:plain

df['row'] = df['score'].rank(
    ascending=False,
    method='first'
).astype(int)
df['rank'] = df['score'].rank(
   ascending=False,
   method='min'
).astype(int)
df['dense_rank'] = df['score'].rank(
   ascending=False,
   method='dense'
).astype(int)
    df['lag1'] = df['product_id'].shift(1)
    df['lag2'] = df['product_id'].shift(2)
    df['lead1'] = df['product_id'].shift(-1)
    df['lead2'] = df['product_id'].shift(-2)
print(df)

f:id:bitop:20180714125732p:plain

ORDER BY句と集約関数を組み合わせる

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM popular_products", conn)
print(df)

f:id:bitop:20180714125846p:plain

df['row'] = df['score'].rank(
    ascending=False,
    method='first'
).astype(int)
df['cum_score'] = df['score'].cumsum()
df['loacl_avg'] = 
df['score'].rolling(window=3,center=True,min_periods=1).mean()
df['first_value'] = df['product_id'][df['score'].idxmax()]
df['last_value'] = df['product_id'][df['score'].idxmin()]
print(df)

f:id:bitop:20180714125959p:plain

PATITION BY とORDER BYを組み合わせる

いろいろ試してみたができなかった

3-3-3 縦持ちデータを横もちにする

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM daily_kpi", conn)
print(df)

f:id:bitop:20180714130128p:plain

pd.pivot_table(df,index='dt',columns='indicator')

f:id:bitop:20180714130203p:plain

3-3-4 横もちを縦持ちにする

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM quarterly_sales", conn)
print(df)

f:id:bitop:20180714130254p:plain

df.stack()

f:id:bitop:20180714130431p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-2-6 IPアドレスを扱う

import pandas as pd
import psycopg2
import numpy as np
import ipaddress

ip1 = ipaddress.ip_address('127.0.0.1')
ip2 = ipaddress.ip_address('127.0.0.2')
ip3 = ipaddress.ip_address('192.168.0.1')
df = pd.DataFrame({'lt':[ip1 < ip2],'gt':[ip1 > ip3]})
print(df)

f:id:bitop:20180712162226p:plain

ip = ipaddress.ip_address('127.0.0.1')
nw = ipaddress.ip_network('127.0.0.0/8')
df = pd.DataFrame({'is_contained':[ip in nw]})
print(df)

f:id:bitop:20180712162437p:plain

3-3 1つのテーブルに対する操作

3-3-1 グループの特徴を捉える

import pandas as pd
import psycopg2
import numpy as np

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM review", conn)
print(df)

f:id:bitop:20180712162634p:plain

df1 = pd.DataFrame({'user_id':df.groupby('user_id')['user_id'].count(),
                'product_id':df.groupby('user_id')['product_id'].count(),
                'sum':df.groupby('user_id')['score'].sum(),
                'mean':df.groupby('user_id')['score'].mean(),
                'max':df.groupby('user_id')['score'].max(),
                'min':df.groupby('user_id')['score'].min()
               })
print(df1)

f:id:bitop:20180712174213p:plain

#別解
df2 = df.groupby('user_id').agg({'score':['count','sum','mean','max','min'],
                             'product_id':['count']})
print(df2)

f:id:bitop:20180712174301p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-2 複数の値に対する操作

3-2-1 文字列を連結する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM mst_user_location", conn)
print(df)

f:id:bitop:20180708112536p:plain

df['pref_city'] = df['pref_name'] + df['city_name']
print(df[['user_id','pref_city']])

#別解
df = pd.read_sql("SELECT * FROM mst_user_location", conn)
#間にスペースを挟む
df['pref_city'] = df['pref_name'].str.cat(df['city_name'],sep = ' ')
print(df[['user_id','pref_city']])

f:id:bitop:20180708112637p:plain

3-2-2 複数の値を比較する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM quarterly_sales", conn)
print(df)

f:id:bitop:20180708112738p:plain

import numpy as np

df.loc[df['q1'] < df['q2'], 'judge_q1_q2'] = '+'
df.loc[df['q1'] > df['q2'], 'judge_q1_q2'] = '-'
df.loc[df['q1'] == df['q2'], 'judge_q1_q2'] = ' '
df['diff_q2_q1'] = df['q2'] - df['q1']
df['sign_q2_q1'] = np.sign(df['q2'] - df['q1'])

f:id:bitop:20180708112919p:plain

df['greatest_sales'] = df[['q1','q2','q3','q4']].max(axis=1)
df['least_sales'] =    df[['q1','q2','q3','q4']].min(axis=1)
df[['year','greatest_sales','least_sales']]

f:id:bitop:20180708113017p:plain

df['average'] = df[['q1','q2','q3','q4']].mean(axis=1)
print(df)

f:id:bitop:20180708113132p:plain

3-2-3 2つの値の比率を計算する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM advertising_stats", conn)
print(df)

f:id:bitop:20180708113246p:plain

df1 = df[df.dt=='2017-04-01']
df1['ctf'] = df1['clicks']/df1['impressions']
print(df1[['dt','ad_id','ctf']])

f:id:bitop:20180708113340p:plain

df['ctf'] = df['clicks']/df['impressions']
print(df[['dt','ad_id','ctf']])

f:id:bitop:20180708113614p:plain

3-2-4 2つの値の距離を計算する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM location_1d", conn)
print(df)

f:id:bitop:20180708113718p:plain

df['abs'] = np.abs(df.x1 - df.x2)
df['rms'] = np.sqrt(np.power(df['x1'] - df['x2'],2)) #df['x1']とdf.x1は等価
print(df)

f:id:bitop:20180708113828p:plain

xy平面で2点間のユーグリッド距離を計算する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM location_2d", conn)
print(df)

f:id:bitop:20180708113936p:plain

df['dist'] = np.sqrt(np.power(df.x1 - df.x2,2) + np.power(df.y1 - df.y2,2))
print(df[['dist']])

f:id:bitop:20180708114017p:plain

3-2-5 日付・時刻を計算する

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM mst_users_with_birthday", conn)
print(df)

f:id:bitop:20180708114121p:plain

df['register_stamp'] = pd.to_datetime(df['register_stamp'])
df['birth_date'] = pd.to_datetime(df['birth_date'])
from pandas.tseries.offsets import Hour,Minute,Day
#before 1 month NG
#月の差分計算はうまくいかなかった
df['after_1_hour'] = df['register_stamp'] + Hour(1)
df['before_30_minutes'] = df['register_stamp'] - Minute(30)
df['after_1_day'] = df['register_stamp'] + Day(1)
print(df)

f:id:bitop:20180708114254p:plain

df[['user_id','register_stamp','after_1_hour','before_30_minutes','after_1_day']]

f:id:bitop:20180708114337p:plain

from datetime import datetime
df['today'] = datetime.today()
df['diff_days'] = df['today'] - df['register_stamp']
df[['user_id','today','register_stamp','diff_days']]

f:id:bitop:20180708114437p:plain

#年齢計算はうまくいかなかった
df['age'] = (df['today'] -  df['birth_date'])/365
df[['user_id','today','birth_date','age']]

f:id:bitop:20180708114525p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-1-3 文字列を配列に分解する

import re

def process(s):
    m=re.search('//[^/]+([^?#]+)',s)
    return m.group(0).split('/')[3]

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM access_log", conn)
df['path1']=df['url'].map(process)

f:id:bitop:20180707153227p:plain

3-1-4 日付やタイムスタンプを扱う

import datetime 
import time

df = pd.DataFrame({'dt':[datetime.date.today()],'timestamp':[time.ctime()]})

f:id:bitop:20180707152303p:plain

指定した値の日付や時刻データを取得する

import datetime 
import time
from dateutil.parser import parse

df = pd.DataFrame({'dt':[parse('2015-10-12')],'timestamp':[parse('2015-10-12 12:00:00')]})

f:id:bitop:20180707152418p:plain

日付け/時刻から特定のフィールドを取り出す

import datetime 
import time
from dateutil.parser import parse

df = pd.DataFrame({'stamp':[parse('2015-10-12 12:00')]})
df['year'] = df['stamp'].map(lambda x:x.year)
df['month'] = df['stamp'].map(lambda x:x.month)
df['day'] = df['stamp'].map(lambda x:x.day)
df['hour'] = df['stamp'].map(lambda x:x.hour)

f:id:bitop:20180707152548p:plain

#別解
df = pd.DataFrame({'stamp':[parse('2015-11-30 23:00')]})
df['year'] = df['stamp'].dt.year
df['month'] = df['stamp'].dt.month
df['day'] = df['stamp'].dt.day
df['hour'] = df['stamp'].dt.hour

f:id:bitop:20180707152642p:plain

3-1-5 欠損値をデフォルト

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM purchase_log_with_coupon", conn)

f:id:bitop:20180707152739p:plain

df = df.fillna(0)

f:id:bitop:20180707152816p:plain

「ビックデータ分析・活用のためのSQLレシピ」を読む

3-1 1つの値に対する操作

3-1-1 コードをラベルに置き換える

import pandas as pd
import psycopg2

conn = psycopg2.connect("dbname=BigData host=localhost user=testuser")
df = pd.read_sql("SELECT * FROM mst_users", conn)

f:id:bitop:20180705133210p:plain

dev = {
    1:"PC",
    2:"SP",
    3:"アプリ"
}
df['register_device'] = df['register_device'].map(dev)

f:id:bitop:20180705133343p:plain

3-1-2 URLから要素を取り出す

conn = psycopg2.connect("dbname=BigData host=localhost 
user=testuser")
df = pd.read_sql("SELECT * FROM access_log", conn)

f:id:bitop:20180705133532p:plain

df['referrer_host'] =  df['url'].str.extract('https?://([^/]*)',expand=False)
df[['stamp','referrer_host']]

f:id:bitop:20180705133644p:plain

df = pd.read_sql("SELECT * FROM access_log", conn)
df['path'] = df['url'].str.extract('//[^/]+([^?#]+)',expand=False)
df['id'] = df['url'].str.extract('id=([^&]*)',expand=False)
df[['stamp','url','path','id']]

f:id:bitop:20180705133740p:plain