Python-推薦系統社群網路

過了許久多少有些流量，這邊介紹利用這些流量建立推薦系統，主要利用同個 Session 下有被瀏覽過的文章都會建立文章間的關係，再利用先前說的 Louvain 社群發現將文章分類。

資料來源

資料來源主要為 GA 可以拿到的，不過在 SessionId 需要先行設置，可以參考 Simon 的網站。

python

# Import Libraries
import re
import itertools
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from community import community_louvain
%matplotlib inline

# Import data set
df = pd.read_csv('data/smiliu.csv')
df.head()

資料整理

python

# Clean Data
df['Page'] = df['Page'].apply(lambda x: re.search('\d+', x)[0])
df['PageTitle'] = df['PageTitle'].apply(lambda x: x.split('|')[0])
df.head()

建立文章間權重

當同個 Session 有瀏覽過多篇文章，則將文章間的權重設為文章數的倒數，最後將權重加總。

python

session_group = df.groupby('SessionId')['Page'].agg(list).reset_index()

session_group = session_group[session_group['Page'].str.len() > 1]
session_group['Page'] = session_group['Page'].apply(lambda x: sorted(x))
session_group['weight'] = session_group['Page'].apply(lambda x: 1/len(x)) # 倒數

session_group['page_pair'] = session_group['Page'].apply(lambda x: list(itertools.combinations(iter(x), 2)))
session_group = session_group.explode('page_pair')
session_group[['page_1', 'page_2']] = session_group['page_pair'].apply(pd.Series)

graph_data = session_group[['page_1', 'page_2', 'weight']]
graph_data = graph_data.groupby(['page_1', 'page_2']).agg(sum).reset_index()
graph_data.head()

Louvain 演算法分群

python

# Create Graph Data
G = nx.from_pandas_edgelist(graph_data, 'page_1', 'page_2', ['weight'])
partition = community_louvain.best_partition(G, resolution = 1)

# Drawing
size = float(len(set(partition.values())))  # community 群組
pos = nx.spring_layout(G)  # nodes 的 xy軸
cm = plt.cm.get_cmap('RdYlBu', size)
count = 0.
print('community: ', size)
plt.figure(1, figsize=(8,8))
for com in set(partition.values()) :
    count += 1.
    color_code = '#%02x%02x%02x' % (cm(count/size, bytes=True)[:3])  # RGB to color code
    list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
    nx.draw_networkx_nodes(G, pos, list_nodes, node_size=30, node_color=color_code)
    nx.draw_networkx_labels(G, pos, font_size=10)

nx.draw_networkx_edges(G, pos, width=0.1, alpha=1)
plt.show()

分群結果

python

1
2
3

page_label = df.drop_duplicates(['Page', 'PageTitle'])[['Page', 'PageTitle']]
page_label['community_id'] = page_label['Page'].apply(lambda x: partition.get(str(x), None))
page_label.sort_values('community_id').dropna().reset_index(drop=True)