最新公告
  • 欢迎您光临欧资源网,本站秉承服务宗旨 履行“站长”责任,销售只是起点 服务永无止境!立即加入我们
  • 网站日志分析系统

     摘要

    在互联网应用中,日志是非常重要的数据,因为互联网项目往往要求是7*24小时不间断运行的,所以能获取到监控系统运行的相关日志数据并进行分析就显得尤为重要。网站流量统计时改进网站服务的重要段之一,通过获取用户在网站的行为数据,进行分析,从而得到有价值的信息,并基于这些信息对网站进行改进。通过本系统可以省略人力,程序监控。对监控到数据储存到数据库中,再通过大数据相应组件把一个庞大且看似无用的数据进行分析得出相应结果供网站管理员对网站运营情况进行参考,并作出相应解决办法。本系统通过用户基本信息、浏览器信息、地域信息、用户浏览深度信息、外链数据、订单分析以及事件信息整合起来并存储到相应数据库组件,最后通过不同分析需求对数据进行提取、分析。

    第1章  系统需求分析

    1.1  系统需求

    通过本系统可以实现以下功能:

    • 能够记录用户、浏览器、地域、浏览深度等信息。
    • 获取的数据保存到适合的数据库中。
    • 用户基本信息分析
    • 浏览器信息分析
    • 地域信息分析
    • 用户浏览深度分析
    • 外链数据分析
    • 订单分析
    • 事件分析

    1.2  主体功能  

    1. 1、用户基本信息分析模块 
    2. 用户基本信息分析模块主要是从用户/访客和会员两个主要角度分析浏 览相关信息,包括但不限于新增用户、活跃用户、总用户、新增会员、活跃会员、总会员以及会话分析。
    3. 2、浏览器信息分析模块 
    4. 在用户基本信息分析的基础上添加一个浏览器这个维度信息。
    5. 3、地域信息分析模块 
    6. 主要分析各个不同省份的用户和会员情况以及不用地域的获取访客数量。
    7. 4、用户访问深度分析模块 
    8. 主要分析用户的访问记录的深度。
    9. 5、外链数据分析模块 
    10. 主要分析各个不同外链端带来的用户访问量数据。
    11. 6、订单数据分析模块 
    12. 主要分析订单的相关情况。
    13. 7、事件分析模块
    14. 在本次实验中,只选用订单相关的一个事件进行分析。

    1.3  开发环境

    1.  
    2. 三个CentOS 7.0系统的虚拟机。
    3. Jdk 1.8 + Tomcat 8.5 + tengine2.1 + flume1.6 + Hadoop2.5 + hbase0.98.12 + zookeeper3.4.7 + hive2.1.1 + sqoop1.4.6

    第2章 系统概要设计

    2.1 js sdk执行工作流

    网站日志分析系统插图

    图2.1 js sdk执行工作流

    2.2 数据流图

    网站日志分析系统插图1

    图2.2 数据流图

    2.3 模块与事件关系

    其中launch事件是用来新用户指标查询的,而其他任务由pageview事件来完成。各个不同模块对用的事件如下表所示。

    最终分析模块 Js sdk事件
    用户基本信息分析   pageview事件
    浏览器信息分析
    地域信息分析
    外链数据分析
    用户浏览深度分析
    订单信息分析 chargeRequest事件
    事件分析 event事件
    新增用户 Launch事件

    表2.1 模块对应事件、

    2.4 事件数据参数说明

    在各个不同事件中收集不同的数据发送到nginx服务器,但是实际上这些收集到数据还是有一些共性的。下表是所用到的参数描述。

    参数名称 类型 描述
    en string 事件名称, eg: e_pv
    ver string 版本号, eg: 0.0.1
    pl string 平台, eg: website
    sdk string Sdk类型, eg: js
    b_rst string 浏览器分辨率,eg: 1800*678
    b_iev string 浏览器信息useragent
    u_ud string 用户/访客唯一标识符
    l string 客户端语言
    u_mid string 会员id,和业务系统一致
    u_sd string 会话id
    c_time string 客户端时间
    p_url string 当前页面的url
    p_ref string 上一个页面的url
    tt string 当前页面的标题
    ca string Event事件的Category名称
    ac string Event事件的action名称
    kv_* string Event事件的自定义属性
    du string Event事件的持续时间
    oid string 订单id
    on string 订单名称
    cua string 支付金额
    cut string 支付货币类型
    pt string 支付方式

    表2.2 事件数据参数

    a)Launch事件

    当用户第一次访问网站的时候触发该事件,不提供对外调用的接口,只实现该事件的数据收集。

    方法名称
    发送的数据 u_sd=8E9559B3-DA35-44E1-AC98-85EB37D1F263&c_time=1449137597974&ver=1&en=e_l&pl=website&sdk=js&b_rst=1920*1080&u_ud=12BF4079-223E-4A57-AC60-C1A04D8F7A2F&b_iev=Mozilla%2F5.0%20(Windows%20NT%206.1%3B%20WOW64)%20AppleWebKit%2F537.1%20(KHTML%2C%20like%20Gecko)%20Chrome%2F21.0.1180.77%20Safari%2F537.1&l=zh-CN

    表2.3 launch事件

    b)Pageview事件

    当用户访问页面/刷新页面的时候触发该事件。该事件会自动调用,也可以让程序员手动调用。

    方法名称 onPageView
    发送的数据 ver=1&en=e_pv&pl=website&sdk=js&b_rst=1920*1080&u_ud=12BF4079-223E-4A57-AC60-C1A04D8F7A2F&b_iev=Mozilla%2F5.0%20(Windows%20NT%206.1%3B%20WOW64)%20AppleWebKit%2F537.1%20(KHTML%2C%20like%20Gecko)%20Chrome%2F21.0.1180.77%20Safari%2F537.1&l=zh-CN&u_sd=8E9559B3-DA35-44E1-AC98-85EB37D1F263&c_time=1449137597979&ht=http://www.bjsxt.com%3A8080&p_url=http%3A%2F%2Fwww.bjsxt.com%3A8080%2Fvst_track%2Findex.html

    表2.4 pageview事件

    c)chargeRequest事件

    当用户下订单的时候触发该事件,该事件需要程序主动调用。

    方法名称 onChargeRequest
    发送的数据 u_sd=8E9559B3-DA35-44E1-AC98-85EB37D1F263&c_time=1449139048231&oid=orderid123&on=%E4%BA%A7%E5%93%81%E5%90%8D%E7%A7%B0&cua=1000&cut=%E4%BA%BA%E6%B0%91%E5%B8%81&pt=%E6%B7%98%E5%AE%9D&ver=1&en=e_crt&pl=website&sdk=js&b_rst=1920*1080&u_ud=12BF4079-223E-4A57-AC60-C1A04D8F7A2F&b_iev=Mozilla%2F5.0%20(Windows%20NT%206.1%3B%20WOW64)%20AppleWebKit%2F537.1%20(KHTML%2C%20like%20Gecko)%20Chrome%2F21.0.1180.77%20Safari%2F537.1&l=zh-CN
    参数 类型 是否必填 描述
    orderId string 订单id
    orderName String 产品购买描述名称
    currencyAmount double 订单价格
    currencyType String 货币类型
    paymentType String 支付方式

    表2.5 chargerequest事件

    d)Event事件 

    当访客/用户触发业务定义的事件后,前端程序调用该方法。

    方法名称 onEventDuration
    发送的数据 ca=%E7%B1%BB%E5%9E%8B&ac=%E5%8A%A8%E4%BD%9C&c_time=1449139512665&u_sd=8E9559B3-DA35-44E1-AC98-85EB37D1F263&kv_p_url=http%3A%2F%2Fwwwbjsxt..com%3A8080%2Fvst_track%2Findex.html&kv_%E5%B1%9E%E6%80%A7key=%E5%B1%9E%E6%80%A7value&du=1000&ver=1&en=e_e&pl=website&sdk=js&b_rst=1920*1080&u_ud=12BF4079-223E-4A57-AC60-C1A04D8F7A2F&b_iev=Mozilla%2F5.0%20(Windows%20NT%206.1%3B%20WOW64)%20AppleWebKit%2F537.1%20(KHTML%2C%20like%20Gecko)%20Chrome%2F21.0.1180.77%20Safari%2F537.1&l=zh-CN
    参数 类型 是否必填 描述
    category string 自定义事件名称
    action String 自定义事件动作
    map map 其他参数
    duration long 事件持续时间

    表2.6 event事件

    • 其他api方法

    在这里只介绍设置会员id的方法,其他的辅助方法到时候编写js的时候再介绍。a)设置会员id

    方法名称 setMemberId
    方法描述 该方法的主要作用是设置会员id,当会员登录后,请调用该方法进行会员id的设置。
    参数 类型 是否必填 描述
    memberid string 会员id

    表2.7 其他api方法

    第3章 SDK实现

     3.1 项目结构

    网站日志分析系统插图2

    图 3.1 项目结构

    启动服务如下图。

    网站日志分析系统插图3

    图 3.2 启动服务

    3.2 事件测试

    1、浏览器输入http://localhost:8080/BIG_DATA_LOG2/demo.jsp测试。

    网站日志分析系统插图4

    图3.3 测试页面1

    2、当点击demo2会触发pageview事件。

    网站日志分析系统插图5

    图 3.4 测试页面2

    3、测试demo3页面

    网站日志分析系统插图6

    图3.5 测试页面3

    4、测试demo4页面

    网站日志分析系统插图7

    图3.6 测试页面4

    第4章  nginx配置与测试

    4.1 nginx.conf配置

    #user  nobody; worker_processes  1; events {    worker_connections  1024; }http {    include       mime.types;     default_type  application/octet-stream;     log_format my_format ‘$remote_addr^A$msec^A$http_host^A$request_uri’;     sendfile        on;     keepalive_timeout  65;     server {         listen       80;         server_name  localhost;         location / {             root   html;             index  index.html index.htm;         }         location =/log.gif {             default_type image/gif;             access_log /opt/data/access.log my_format;         }         error_page   500 502 503 504  /50x.html;         location =/50x.html {             root   html;         }     }}

    4.2 本地log文件

    网站日志分析系统插图8

    图4.1通过测试页面传回来的日志数据

    第5章 flume配置与测试

    5.1 flume配置

    目前本次实验已经可以收集了,因为数据最终要放到hdfs上,所以我选择了flume。具体配置如下。

    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    a1.sources.r1.type = exec
    a1.sources.r1.command = tail -F /opt/data/access.log
    a1.sinks.k1.type=hdfs
    a1.sinks.k1.hdfs.path=hdfs://log/%Y%m%d
    a1.sinks.k1.hdfs.rollCount=0
    a1.sinks.k1.hdfs.rollInterval=0
    a1.sinks.k1.hdfs.rollSize=10240
    a1.sinks.k1.hdfs.idleTimeout=5
    a1.sinks.k1.hdfs.fileType=DataStream
    a1.sinks.k1.hdfs.useLocalTimeStamp=true
    a1.sinks.k1.hdfs.callTimeout=40000
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1

    5.2 测试flume

    测试flume会把数据写入到hdfs上。

    网站日志分析系统插图9

    图5.1 测试flume是否把数据写入到hdfs上

    第6章 ETL

    6.1 项目结构

    网站日志分析系统插图10

    图6.1 项目结构

    6.2 测试过滤数据

    当接收的到数据不符合规则的时候就过滤掉,符合的数据存储到hbase上。

    网站日志分析系统插图11

    图6.2 测试过滤数据

    6.3 查看hbase数据库

    目前hbase里面的数据就不是脏数据了。

    网站日志分析系统插图12

    图6.3 hbase里干净的数据

    第7章 新增用户指标设计

    7.1 新增用户指标设计思路

    用户基本信息模块:新增用户(时间)
    浏览器分析模块:新增用户(时间,浏览器信息)
    20180810 www.bjsxt.com zhangsan firefox-48
    20180810 www.bjsxt.com lisi firefox-53
    MR设计思路
    1)map:
    20180810 zhangsan
    20180810,firefox-48 zhangsan
    20180810,firefox-all zhangsan
    20180810 lisi
    20180810,firefox-53 lisi
    20180810,firefix-all lisi
    2)reduce:
    20180810 zhangsan 2
    20180810 lisi
    20180810,firefox-48 zhangsan 1
    20180810,firefox-53 lisi 1
    20180810,firefix-all lisi 2
    20180810,firefox-all zhangsan
    模块当做其中一个条件
    时间,用户基本信息模块
    时间,浏览器,浏览器分析模块
    20180810 www.bjsxt.com zhangsan firefox-48
    20180810 www.bjsxt.com lisi firefox-53
    MR设计思路
    1)map
    20180810,user zhangsan
    20180810,firefox-48,browser zhangsan
    20180810,firefix-all,browser zhangsan
    20180810,user lisi
    20180810,firefox-53,browser lisi
    20180810,firefix-all,browser lisi
    2)reduce
    20180810,user zhangsan 2
    20180810,user lisi
    20180810,firefox-48,browser zhangsan 1
    20180810,firefix-all,browser zhangsan 2
    20180810,firefix-all,browser lisi
    20180810,firefox-53,browser lisi 1

    7.2 mysql表设计

    本实训所用到的表的sql语句如下。

    mysql表设计

    DROP TABLE IF EXISTS dimension_browser;
    CREATE TABLE dimension_browser (
    id int(11) NOT NULL AUTO_INCREMENT,
    browser_name varchar(45) NOT NULL DEFAULT COMMENT ‘浏览器名称’,
    browser_version varchar(255) NOT NULL DEFAULT COMMENT ‘浏览器版本号’,
    PRIMARY KEY (id)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT COMMENT=‘浏览器维度信息表’;
    DROP TABLE IF EXISTS dimension_currency_type;
    CREATE TABLE dimension_currency_type (
    id int(11) NOT NULL AUTO_INCREMENT,
    currency_name varchar(10) DEFAULT NULL COMMENT ‘货币名称’,
    PRIMARY KEY (id)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT COMMENT=‘支付货币类型维度信息表’;
    DROP TABLE IF EXISTS dimension_date;
    CREATE TABLE dimension_date (
    id int(11) NOT NULL AUTO_INCREMENT,
    year int(11) DEFAULT NULL,
    season int(11) DEFAULT NULL,
    month int(11) DEFAULT NULL,
    week int(11) DEFAULT NULL,
    day int(11) DEFAULT NULL,
    calendar date DEFAULT NULL,
    type enum(‘year’,‘season’,‘month’,‘week’,‘day’) DEFAULT NULL COMMENT ‘日期格式’,
    PRIMARY KEY (id)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT COMMENT=‘时间维度信息表’;

    由于篇幅问题,后面的代码放在文件供大家下载

    mysql表设计

    下图是创建成功后的表。

    网站日志分析系统插图13

    图 7.1 查询表

    7.3 项目结构

    网站日志分析系统插图14

    图 7.2 项目结构

    7.4 数据存入到mysql

    执行mr后会把相应数据写入到mysql上供web端访问。

    网站日志分析系统插图15

    图7.3 mysql上stats_device_browser表的数据

    第8章 hive与hbase整合

    8.1 sqoop的配置

    sqoop在本次实训中的作用是把mysql里的数据与hive里的数据相互转换。修改配置configure-sqoop去掉未安装服务相关内容;例如(HBase、HCatalog、Accumulo):

    #if [ ! -d “${HBASE_HOME}” ]; then
    # echo “Error: $HBASE_HOME does not exist!”
    # echo ‘Please set $HBASE_HOME to the root of your HBase installation.’
    # exit 1

    8.2 sqoop导入

    sqoop:
    导入:
    import
    –connect
    jdbc:mysql://hadoop1:3306/test
    –username
    root
    –password
    123
    –as-textfile
    –columns
    id,name,msg
    –table
    psn
    –delete-target-dir
    –target-dir
    /sqoop/data
    -m
    1
    命令:
    sqoop –options-file sqoop1
    导入:
    import
    –connect
    jdbc:mysql://hadoop1/test
    –username
    root
    –password
    123
    –as-textfile
    –query
    ‘select id, name, msg from psn where id like “1%” and $CONDITIONS’
    –delete-target-dir
    –target-dir
    /sqoop/tmp
    -m
    1
    –hive-home
    /home/hive-1.2.1
    –hive-import
    –create-hive-table
    –hive-table
    t_test
    导出:
    export
    –connect
    jdbc:mysql://node1/test
    –username
    root
    –password
    123
    -m
    1
    –columns
    id,name,msg
    –export-dir
    /sqoop/data
    –table
    h_psn
    网站日志分析系统插图16

    图8.1 hdfs里sqoop目录

    8.3 hive和hbase同步

    1、把hive-hbase-handler-1.2.1.jar  cp到hbase/lib 下 ,同时把hbase中的所有的jar,cp到hive/lib

    2、在hive的配置文件增加属性:

    <property>
    <name>hbase.zookeeper.quorum</name>
    <value>node1,node2,node3</value>
    </property>

    3、在hive中创建临时表

    CREATE EXTERNAL TABLE tmp_order
    (key string, id string, user_id string)
    STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
    WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,order:order_id,order:user_id”)
    TBLPROPERTIES (“hbase.table.name” = “t_order”);
    CREATE TABLE hbasetbl(key int, value string)
    STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
    WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,cf1:val”)
    TBLPROPERTIES (“hbase.table.name” = “xyz”, “hbase.mapred.output.outputtable” = “xyz”);

    只需要在hive和hbase里做好映射,两个数据库即可同步。注意hbase里的列名必须是经过映射的,不然不会同步。

    网站日志分析系统插图17

    图8.1 hbase里t_order数据表

    网站日志分析系统插图18

    图8.2 hive里t_order数据表

    第9章 用户浏览深度

    9.1 创建数据表

    1. 在hive中创建hbase的event_log对应表

    CREATE EXTERNAL TABLE event_logs(
    key string, pl string, en string, s_time bigint, p_url string, u_ud string, u_sd string
    ) ROW FORMAT SERDE ‘org.apache.hadoop.hive.hbase.HBaseSerDe’
    STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
    with serdeproperties(‘hbase.columns.mapping’=‘:key,log:pl,log:en,log:s_time,log:p_url,log:u_ud,log:u_sd’)
    tblproperties(‘hbase.table.name’=‘eventlog’);

    2. 创建mysql在hive中的对应表,hive中的表,执行HQL之后分析的结果保存该表,然后通过sqoop工具导出到mysql

    CREATE TABLE stats_view_depth (
    platform_dimension_id bigint ,
    data_dimension_id bigint ,
    kpi_dimension_id bigint ,
    pv1 bigint ,
    pv2 bigint ,
    pv3 bigint ,
    pv4 bigint ,
    pv5_10 bigint ,
    pv10_30 bigint ,
    pv30_60 bigint ,
    pv60_plus bigint ,
    created string
    ) row format delimited fields terminated by ‘\t’;

    3. hive创建临时表:把hql分析之后的中间结果存放到当前的临时表。

    CREATE TABLE stats_view_depth_tmp(pl string, date string, col string, ct bigint);

    4. 编写UDF(platformdimension & datedimension)<需要注意,要删除DimensionConvertClient类中所有FileSystem关闭的操作>

    5. 上传transformer-0.0.1.jar到hdfs的/sxt/transformer文件夹中

    6. 创建hive的function

    #create function platform_convert as ‘com.sxt.transformer.hive.PlatformDimensionUDF’ using jar ‘hdfs://sxt/sxt/transformer/transformer-0.0.1.jar’;
    create function date_convert as ‘com.sxt.transformer.hive.DateDimensionUDF’ using jar ‘hdfs://sxt/sxt/transformer/transformer-0.0.1.jar’;

    7. hql编写(统计用户角度的浏览深度)<注意:时间为外部给定>

    from (
    select
    pl, from_unixtime(cast(s_time/1000 as bigint),‘yyyy-MM-dd’) as day, u_ud,
    (case when count(p_url) = 1 then “pv1”
    when count(p_url) = 2 then “pv2”
    when count(p_url) = 3 then “pv3”
    when count(p_url) = 4 then “pv4”
    when count(p_url) >= 5 and count(p_url) <10 then “pv5_10”
    when count(p_url) >= 10 and count(p_url) <30 then “pv10_30”
    when count(p_url) >=30 and count(p_url) <60 then “pv30_60”
    else ‘pv60_plus’ end) as pv
    from event_logs
    where
    en=‘e_pv’
    and p_url is not null
    and pl is not null
    and s_time >= unix_timestamp(‘2016-06-08’,‘yyyy-MM-dd’)*1000
    and s_time < unix_timestamp(‘2016-06-09’,‘yyyy-MM-dd’)*1000
    group by
    pl, from_unixtime(cast(s_time/1000 as bigint),‘yyyy-MM-dd’), u_ud
    ) as tmp
    insert overwrite table stats_view_depth_tmp
    select pl,day,pv,count(distinct u_ud) as ct where u_ud is not null group by pl,day,pv;

    –把临时表的多行数据,转换一行

    with tmp as
    (
    select pl,date as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv1’ union all
    select pl,date as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv2’ union all
    select pl,date as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv3’ union all
    select pl,date as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv4’ union all
    select pl,date as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv5_10’ union all
    select pl,date as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv10_30’ union all
    select pl,date as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv30_60’ union all
    select pl,date as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col=‘pv60_plus’ union all
    select ‘all’ as pl,date as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv1’ union all
    select ‘all’ as pl,date as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv2’ union all
    select ‘all’ as pl,date as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv3’ union all
    select ‘all’ as pl,date as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv4’ union all
    select ‘all’ as pl,date as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv5_10’ union all
    select ‘all’ as pl,date as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv10_30’ union all
    select ‘all’ as pl,date as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv30_60’ union all
    select ‘all’ as pl,date as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col=‘pv60_plus’
    )
    from tmp
    insert overwrite table stats_view_depth
    select 2,3,6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),‘2017-01-10’ group by pl,date1;

    7. sqoop脚步编写(统计用户角度)

    sqoop export –connect jdbc:mysql://hh:3306/report –username hive –password hive –table stats_view_depth –export-dir /hive/bigdater.db/stats_view_depth/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,data_dimension_id,kpi_dimension_id

    8. hql编写(统计会话角度的浏览深度)<注意:时间为外部给定>

    from (
    select pl, from_unixtime(cast(s_time/1000 as bigint),‘yyyy-MM-dd’) as day, u_sd, (case when count(p_url) = 1 then “pv1” when count(p_url) = 2 then “pv2” when count(p_url) = 3 then “pv3” when count(p_url) = 4 then “pv4” when count(p_url) >= 5 and count(p_url) <10 then “pv5_10” when count(p_url) >= 10 and count(p_url) <30 then “pv10_30” when count(p_url) >=30 and count(p_url) <60 then “pv30_60” else ‘pv60_plus’ end) as pv
    from event_logs
    where en=‘e_pv’ and p_url is not null and pl is not null and s_time >= unix_timestamp(‘2015-12-13’,‘yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14’,‘yyyy-MM-dd’)*1000
    group by pl, from_unixtime(cast(s_time/1000 as bigint),‘yyyy-MM-dd’), u_sd
    ) as tmp
    insert overwrite table stats_view_depth_tmp select pl,day,pv,count(distinct u_sd) as ct where u_sd is not null group by pl,day,pv;
    with tmp as
    (
    select pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv1’ union all
    select pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv2’ union all
    select pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv3’ union all
    select pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv4’ union all
    select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv5_10’ union all
    select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv10_30’ union all
    select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv30_60’ union all
    select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col=‘pv60_plus’ union all
    select ‘all’ as pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv1’ union all
    select ‘all’ as pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv2’ union all
    select ‘all’ as pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv3’ union all
    select ‘all’ as pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv4’ union all
    select ‘all’ as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv5_10’ union all
    select ‘all’ as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv10_30’ union all
    select ‘all’ as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col=‘pv30_60’ union all
    select ‘all’ as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col=‘pv60_plus’
    )
    from tmp
    insert overwrite table stats_view_depth select platform_convert(pl),date_convert(date),6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),‘2015-12-13’ group by pl,date;

    9. sqoop脚步编写(统计会话角度)

    sqoop export –connect jdbc:mysql://hh:3306/report –username hive –password hive –table stats_view_depth –export-dir /hive/bigdater.db/stats_view_depth/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,data_dimension_id,kpi_dimension_id

    — 1. 在hive中创建hbase的event_logs对应表

    CREATE EXTERNAL TABLE event_logs(row string, pl string, en string, s_time bigint, p_url string, u_ud string, u_sd string, ca string, ac string, oid string, on string, cua bigint, cut string, pt string)
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.hbase.HBaseSerDe’
    STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
    with serdeproperties(‘hbase.columns.mapping’=‘:key,info:pl,info:en,info:s_time,info:p_url,info:u_ud,info:u_sd,info:ca,info:ac, info:oid,info:on,info:cua,info:cut,info:pt’)
    tblproperties(‘hbase.table.name’=‘eventlog’);

    — 2. 自定义UDF(CurrencyType&PaymentType)

    — 3. 上传transformer-0.0.1.jar到linux集群和hdfs集群上,然后启动DimensionConverterServer服务

    — 4. 创建function

    create function currency_type_convert as ‘com.sxt.transformer.hive.CurrencyTypeDimensionUDF’ using jar ‘hdfs://node1:8020/sxt/transformer/transformer-0.0.1.jar’;create function payment_type_convert as ‘com.sxt.transformer.hive.PaymentTypeDimensionUDF’ using jar ‘hdfs://node1:8020/sxt/transformer/transformer-0.0.1.jar’;create function order_info as ‘com.sxt.transformer.hive.OrderInfoUDF’ using jar ‘hdfs://node1:8020/sxt/transformer/transformer-0.0.1.jar’;create function order_total_amount as ‘com.sxt.transformer.hive.OrderTotalAmountUDF’ using jar ‘hdfs://node1:8020/sxt/transformer/transformer-0.0.1.jar’;

    — 5. 创建临时表

    create table stats_order_tmp1(pl string, dt string, cut string, pt string, values bigint);CREATE TABLE stats_order_tmp2 (platform_dimension_id bigint ,date_dimension_id bigint , currency_type_dimension_id bigint ,payment_type_dimension_id bigint , values bigint, created string);

    — 6. 保存订单数据到mysql中

    创建hive中间表

    CREATE TABLE order_info (order_id string,platform string,s_time bigint ,currency_type string ,payment_type string , amount bigint);

    hql插入到hive表

    from event_logsinsert overwrite table order_info select oid,pl,s_time,cut,pt,cuawhere en=’e_crt’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000或者from (select oid,pl,s_time,cut,pt,cua from event_logswhere en=’e_crt’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000) as tmpinsert overwrite table order_info select oid,pl,s_time,cut,pt,cua

    sqoop脚本

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table order_info –export-dir /hive/bigdater.db/order_info/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key order_id

    — 7. 订单数量hql(总的)

    from (select pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,cut,pt,count(distinct oid) as ordersfrom event_logswhere en=’e_crt’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),cut,pt) as tmpinsert overwrite table stats_order_tmp1 select pl,date,cut,pt,ordersinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),orders,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as orders,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dt,cut

    — 8. 订单数量sqoop(总的)

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,orders,created

    — 9. 订单金额hql(总的)

    from (from (select pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,cut,pt,oid,max(cua) as amountfrom event_logswhere en=’e_crt’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),cut,pt,oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),amount,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dt,cut

    — 10.订单金额sqoop(总的)

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,order_amount,created

    — 11. 成功支付订单数量hql

    from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,count(distinct oid) as ordersfrom event_logswhere en=’e_cs’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’)) as tmpinsert overwrite table stats_order_tmp1 select pl,date,cut,pt,ordersinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),orders,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as orders,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dt,cut

    — 12. 成功支付订单数量sqoop

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,success_orders,created

    — 13. 成功支付订单金额hql

    from (from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,oid,max(order_info(oid)) as amountfrom event_logswhere en=’e_cs’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’),oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),amount,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dt,cut

    — 14. 成功支付订单金额sqoop

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,revenue_amount,created

    — 15. 成功支付订单迄今为止总金额hql

    from (from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,oid,max(order_info(oid)) as amountfrom event_logswhere en=’e_cs’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’),oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),order_total_amount(platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),cast(amount as int),’revenue’) as amount,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),cast(sum(values) as int),”revenue”) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),cast(sum(values) as int),”revenue”) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),cast(sum(values) as int),”revenue”) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),cast(sum(values) as int),”revenue”) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),cast(sum(values) as int),”revenue”) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),cast(sum(values) as int),”revenue”) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),cast(sum(values) as int),”revenue”) as amount,dt group by pl,dt,cut

    — 16. 成功支付订单迄今为止总金额sqoop

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,total_revenue_amount,created  — 17. 退款订单数量hql

    from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,count(distinct oid) as ordersfrom event_logswhere en=’e_cr’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’)) as tmpinsert overwrite table stats_order_tmp1 select pl,date,cut,pt,ordersinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),orders,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as orders,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dt,cut

    — 18. 退款订单数量sqoop

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,refund_orders,created

    — 19. 退款订单金额hql

    from (from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,oid,max(order_info(oid)) as amountfrom event_logswhere en=’e_cr’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’),oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),amount,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dt,cut

    — 20. 退款订单金额sqoop

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,refund_amount,created

    — 21. 退款订单迄今为止总金额hql

    from (from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,oid,max(order_info(oid)) as amountfrom event_logswhere en=’e_cr’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’),oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),order_total_amount(platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),cast(amount as int),’refund’) as amount,datefrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),cast(sum(values) as int),’refund’) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),cast(sum(values) as int),’refund’) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),cast(sum(values) as int),’refund’) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),cast(sum(values) as int),’refund’) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),cast(sum(values) as int),’refund’) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),cast(sum(values) as int),’refund’) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),cast(sum(values) as int),’refund’) as amount,dt group by pl,dt,cut

    — 22. 退款订单迄今为止总金额sqoop

    sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,total_refund_amount,created 下图是浏览深度(p1\p2\p3…….)

    网站日志分析系统插图19

    图 9.1 用户浏览深度

    网站日志分析系统插图20

    图 9.2 stats_view_depth数据表

    网站日志分析系统插图21

    图9.3 stats_view_depth_tmps数据表

    第10章 项目调优

    一、调优的目的
    充分的利用机器的性能,更快的完成mr程序的计算任务。甚至是在有限的机器条件下,能够支持运行足够多的mr程序。
    二、调优的总体概述
    从mr程序的内部运行机制,我们可以了解到一个mr程序由mapper和reducer两个阶段组成,其中mapper阶段包括数据的读取、map处理以及写出操作(排序和合并/sort&merge),而reducer阶段包含mapper输出数据的获取、数据合并(sort&merge)、reduce处理以及写出操作。那么在这七个子阶段中,能够进行较大力度的进行调优的就是map输出、reducer数据合并以及reducer个数这三个方面的调优操作。也就是说虽然性能调优包括cpu、内存、磁盘io以及网络这四个大方面,但是从mr程序的执行流程中,我们可以知道主要有调优的是内存、磁盘io以及网络。在mr程序中调优,主要考虑的就是减少网络传输和减少磁盘IO操作,故本次课程的mr调优主要包括服务器调优、代码调优、mapper调优、reducer调优以及runner调优这五个方面。
    三、服务器调优
    服务器调优主要包括服务器参数调优和jvm调优。在本次项目中,由于我们使用hbase作为我们分析数据的原始数据存储表,所以对于hbase我们也需要进行一些调优操作。除了参数调优之外,和其他一般的java程序一样,还需要进行一些jvm调优。

    hdfs调优
    1. dfs.datanode.failed.volumes.tolerated: 允许发生磁盘错误的磁盘数量,默认为0,表示不允许datanode发生磁盘异常。当挂载多个磁盘的时候,可以修改该值。
    2. dfs.replication: 复制因子,默认3
    3. dfs.namenode.handler.count: namenode节点并发线程量,默认10
    4. dfs.datanode.handler.count:datanode之间的并发线程量,默认10。
    5. dfs.datanode.max.transfer.threads:datanode提供的数据流操作的并发线程量,默认4096。
    一般将其设置为linux系统的文件句柄数的85%~90%之间,查看文件句柄数语句ulimit -a,修改vim /etc/security/limits.conf, 不能设置太大
    文件末尾,添加
    * soft nofile 65535
    * hard nofile 65535
    注意:句柄数不能够太大,可以设置为1000000以下的所有数值,一般不设置为-1。
    异常处理:当设置句柄数较大的时候,重新登录可能出现unable load session的提示信息,这个时候采用单用户模式进行修改操作即可。
    单用户模式:
    启动的时候按’a’键,进入选择界面,然后按’e’键进入kernel修改界面,然后选择第二行’kernel…’,按’e’键进行修改,在最后添加空格+single即可,按回车键回到修改界面,最后按’b’键进行单用户模式启动,当启动成功后,还原文件后保存,最后退出(exit)重启系统即可。
    6. io.file.buffer.size: 读取/写出数据的buffer大小,默认4096,一般不用设置,推荐设置为4096的整数倍(物理页面的整数倍大小)。

    hbase调优
    1. 设置regionserver的内存大小,默认为1g,推荐设置为4g。
    修改conf/hbase-env.sh中的HBASE_HEAPSIZE=4g
    2. hbase.regionserver.handler.count: 修改客户端并发线程数,默认为10。设置规则为,当put和scans操作比较的多的时候,将其设置为比较小的值;当get和delete操作比较多的时候,将其设置为比较大的值。原因是防止频繁GC操作导致内存异常。
    3. 自定义hbase的分割和紧缩操作,默认情况下hbase的分割机制是当region大小达到hbase.hregion.max.filesize(10g)的时候进行自动分割,推荐每个regionserver的region个数在20~500个为最佳。hbase的紧缩机制是hbase的一个非常重要的管理机制,hbase的紧缩操作是非常消耗内存和cpu的,所以一般机器压力比较大的话,推荐将其关闭,改为手动控制。
    4. hbase.balancer.period: 设置hbase的负载均衡时间,默认为300000(5分钟),在负载比较高的集群上,将其值可以适当的改大。
    5. hfile.block.cache.size:修改hflie文件块在内存的占比,默认0.4。在读应用比较多的系统中,可以适当的增大该值,在写应用比较多的系统中,可以适当的减少该值,不过不推荐修改为0。
    6. hbase.regionserver.global.memstore.upperLimit:修改memstore的内存占用比率上限,默认0.4,当达到该值的时候,会进行flush操作将内容写的磁盘中。
    7. hbase.regionserver.global.memstore.lowerLimit: 修改memstore的内存占用比率下限,默认0.38,进行flush操作后,memstore占用的内存比率必须不大于该值。
    8. hbase.hregion.memstore.flush.size: 当memstore的值大于该值的时候,进行flush操作。默认134217728(128M)。
    9. hbase.hregion.memstore.block.multiplier: 修改memstore阻塞块大小比率值,默认为4。也就是说在memstore的大小超过4*hbase.hregion.memstore.flush.size的时候就会触发写阻塞操作。最终可能会导致出现oom异常。

    mapreduce调优
    1. mapreduce.task.io.sort.factor: mr程序进行合并排序的时候,打开的文件数量,默认为10个.
    2. mapreduce.task.io.sort.mb: mr程序进行合并排序操作的时候或者mapper写数据的时候,内存大小,默认100M
    3. mapreduce.map.sort.spill.percent: mr程序进行flush操作的阀值,默认0.80。
    4. mapreduce.reduce.shuffle.parallelcopies:mr程序reducer copy数据的线程数,默认5。
    5. mapreduce.reduce.shuffle.input.buffer.percent: reduce复制map数据的时候指定的内存堆大小百分比,默认为0.70,适当的增加该值可以减少map数据的磁盘溢出,能够提高系统性能。
    6. mapreduce.reduce.shuffle.merge.percent:reduce进行shuffle的时候,用于启动合并输出和磁盘溢写的过程的阀值,默认为0.66。如果允许,适当增大其比例能够减少磁盘溢写次数,提高系统性能。同mapreduce.reduce.shuffle.input.buffer.percent一起使用。
    7. mapreduce.task.timeout:mr程序的task执行情况汇报过期时间,默认600000(10分钟),设置为0表示不进行该值的判断。

    四、代码调优
    代码调优,主要是mapper和reducer中,针对多次创建的对象,进行代码提出操作。这个和一般的java程序的代码调优一样。
    五、mapper调优
    mapper调优主要就是就一个目标:减少输出量。我们可以通过增加combine阶段以及对输出进行压缩设置进行mapper调优。
    combine介绍:
    实现自定义combine要求继承reducer类,特点:
    以map的输出key/value键值对作为输入输出键值对,作用是减少网络输出,在map节点上就合并一部分数据。
    比较适合,map的输出是数值型的,方便进行统计。
    压缩设置:
    在提交job的时候分别设置启动压缩和指定压缩方式。
    六、reducer调优
    reducer调优主要是通过参数调优和设置reducer的个数来完成。
    reducer个数调优:
    要求:一个reducer和多个reducer的执行结果一致,不能因为多个reducer导致执行结果异常。
    规则:一般要求在hadoop集群中的执行mr程序,map执行完成100%后,尽量早的看到reducer执行到33%,可以通过命令hadoop job -status job_id或者web页面来查看。
    原因: map的执行process数是通过inputformat返回recordread来定义的;而reducer是有三部分构成的,分别为读取mapper输出数据、合并所有输出数据以及reduce处理,其中第一步要依赖map的执行,所以在数据量比较大的情况下,一个reducer无法满足性能要求的情况下,我们可以通过调高reducer的个数来解决该问题。
    优点:充分利用集群的优势。
    缺点:有些mr程序没法利用多reducer的优点,比如获取top n的mr程序。
    七、runner调优
    runner调优其实就是在提交job的时候设置job参数,一般都可以通过代码和xml文件两种方式进行设置。
    1~8详见ActiveUserRunner(before和configure方法),9详解TransformerBaseRunner(initScans方法)

    1. mapred.child.java.opts: 修改childyard进程执行的jvm参数,针对map和reducer均有效,默认:-Xmx200m
    2. mapreduce.map.java.opts: 需改map阶段的childyard进程执行jvm参数,默认为空,当为空的时候,使用mapred.child.java.opts。
    3. mapreduce.reduce.java.opts:修改reducer阶段的childyard进程执行jvm参数,默认为空,当为空的时候,使用mapred.child.java.opts。
    4. mapreduce.job.reduces: 修改reducer的个数,默认为1。可以通过job.setNumReduceTasks方法来进行更改。
    5. mapreduce.map.speculative:是否启动map阶段的推测执行,默认为true。其实一般情况设置为false比较好。可通过方法job.setMapSpeculativeExecution来设置。
    6. mapreduce.reduce.speculative:是否需要启动reduce阶段的推测执行,默认为true,其实一般情况设置为fase比较好。可通过方法job.setReduceSpeculativeExecution来设置。
    7. mapreduce.map.output.compress:设置是否启动map输出的压缩机制,默认为false。在需要减少网络传输的时候,可以设置为true。
    8. mapreduce.map.output.compress.codec:设置map输出压缩机制,默认为org.apache.hadoop.io.compress.DefaultCodec,推荐使用SnappyCodec(在之前版本中需要进行安装操作,现在版本不太清楚,安装参数:http://www.cnblogs.com/chengxin1982/p/3862309.html)
    9. hbase参数设置
    由于hbase默认是一条一条数据拿取的,在mapper节点上执行的时候是每处理一条数据后就从hbase中获取下一条数据,通过设置cache值可以一次获取多条数据,减少网络数据传输。

    第11章 课程设计总结

    本次课程设计的选题是我第二次接触了,但是做完整个实训我还是收获了很多,因为这是我自学Hadoop组件的第一次项目实训;此实训来源于北京尚学堂1903班大数据全套课程里三个实训项目中的一个实训项目。在本学期通过老师的教学让我有效的补充当初自学的漏洞,而且本次实训让我深刻了解到了自己还有好多不足之处,在实训过程中也遇到好多问题,由于本次实训代码是上学堂写好的,但是我每理解一段代码的时候也能深刻感觉到这次实训并不容易,里面好多代码自己需要理解好久才能想明白,对于项目中的代码规范自己还并不熟悉,因为平常写代码就是不按照规范编写,所以在理解这些大量运用设计模式的代码很吃力。在本次实训的环境配置中也遇到了好多问题,一些是自己大意写错了配置,一些是自己对网络以及这些组件的原理理解不够,导致在集群上配置文件时不清楚配置的哪个。这次实训把Hadoop一些组件整合起来,能够把他们正确运行并且数据能够共享也给我这次实训造成了一些问题。回顾本次实训,我能深刻的认识到自己的不足,的确,虽然是第二次实验,但是还是会出现一些以前没有出现的错误,这些错误都是自己的失误和无知造成的,所以,在以后的学习道路上,我更需要努力,让自己的能力得到更大的提升。

    参考文献

    站内大部分资源收集于网络,若侵犯了您的合法权益,请联系我们删除!
    欧资源网 » 网站日志分析系统

    常见问题FAQ

    免费下载或者VIP会员专享资源能否直接商用?
    本站所有资源版权均属于原作者所有,这里所提供资源均只能用于参考学习用,请勿直接商用。若由于商用引起版权纠纷,一切责任均由使用者承担。更多说明请参考 VIP介绍。
    提示下载完但解压或打开不了?
    最常见的情况是下载不完整: 可对比下载完压缩包的与网盘上的容量,若小于网盘提示的容量则是这个原因。这是浏览器下载的bug,建议用百度网盘软件或迅雷下载。若排除这种情况,可在对应资源底部留言,或 联络我们.。
    找不到素材资源介绍文章里的示例图片?
    对于PPT,KEY,Mockups,APP,网页模版等类型的素材,文章内用于介绍的图片通常并不包含在对应可供下载素材包内。这些相关商业图片需另外购买,且本站不负责(也没有办法)找到出处。 同样地一些字体文件也是这种情况,但部分素材会在素材包内有一份字体下载链接清单。
    欧资源网
    一个高级程序员模板开发平台

    发表评论