基于BOS的Presto实践

更新时间：2023-07-07

概览

Presto是Facebook开发的数据查询引擎，可对海量数据进行快速地交互式分析，支持Hive，关系数据库等多种数据源。由于 BOS 在超低价格、超高性能、高可靠和高吞吐的强大存储优势，越来越多企业选择 BOS 作为大数据的存储媒介。因此，本文将对Presto在 BOS 上的使用方法作一个简要的介绍。

基于BOS HDFS的Presto使用

BOS HDFS

首先参考BOS HDFS一文安装并配置BOS HDFS，本机安装的hadoop版本为hadoop-3.3.2，参考文中"开始使用"一节完成BOS HDFS的基本试用，并设置环境变量：

Plain Text

1export HADOOP_HOME=/opt/hadoop-3.3.2
2export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath

Hive

hive是基于Hadoop的一个数据仓库工具，用来进行数据提取、转化、加载，这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。hive数据仓库工具能将结构化的数据文件映射为一张数据库表，并提供SQL查询功能，能将SQL语句转变成MapReduce任务来执行.

安装mysql

mysql用来存储hive的元数据，可以选择本地安装，也可以直接远程连接已安装的mysql或者RDS。本机安装的版本是：mysql-5.1.61-4.el6.x86_64 安装完成之后，可用 service mysqld status查看运行状态，并使用

Plain Text

1/usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #设置新密码

可以在mysql中创建hive专用的用户，并设置密码。

安装hive

本机安装的版本为2.3.9.

修改conf文件夹下的两个配置：

Plain Text

1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml

在hive-env.sh中添加：

Plain Text

1export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf

在hive-site.xml中添加：

Plain Text

1<property>
2      <name>javax.jdo.option.ConnectionURL</name>
3      <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
4      <description>MySQL</description>
5  </property>
6  <property>
7        <name>javax.jdo.option.ConnectionDriverName</name>
8        <value>com.mysql.jdbc.Driver</value>
9        <description>JDBC</description>
10  </property>
11  <property>
12        <name>javax.jdo.option.ConnectionUserName</name>
13        <value>root</value>
14        <description>username</description>
15        </property>
16 <property>
17        <name>javax.jdo.option.ConnectionPassword</name>
18        <value>new-password</value>
19        <description>passward</description>
20 </property>

配置中 javax.jdo.option.ConnectionURL 是指mysql server的连接地址，javax.jdo.option.ConnectionUserName 就是用于hive的mysql用户名，javax.jdo.option.ConnectionPassword 是用户名对应的密码。设置完成后，把mysql的JBDC驱动复制到lib文件夹下，本机采用的驱动为 mysql-connector-java-5.1.32-bin.jar.

初始化msyql

Plain Text

1./bin/schematool -dbType mysql -initSchema

启动hive

Plain Text

1./bin/hive

hive测试

创建表

Plain Text

1create database hive;      // 创建数据库
2create table hive_test (a int, b string) //创建表
3ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

新建一个shell脚本，名为gen_data.sh

Plain Text

1#!/bin/bash
2MAXROW=1000000 #指定生成数据行数
3for((i = 0; i < $MAXROW; i++))
4do
5   echo $RANDOM, \"$RANDOM\"
6done

运行脚本，生成测试数据

Plain Text

1./gen_data.sh > hive_test.data

把数据加载到表中

Plain Text

1load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;

查询

Plain Text

1hive> select count(*) from hive_test;
2WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
3Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
4Total jobs = 1
5Launching Job 1 out of 1
6Number of reduce tasks determined at compile time: 1
7In order to change the average load for a reducer (in bytes):
8  set hive.exec.reducers.bytes.per.reducer=<number>
9In order to limit the maximum number of reducers:
10  set hive.exec.reducers.max=<number>
11In order to set a constant number of reducers:
12  set mapreduce.job.reduces=<number>
13Job running in-process (local Hadoop)
142023-05-28 17:30:16,548 Stage-1 map = 0%,  reduce = 0%
152023-05-28 17:30:18,558 Stage-1 map = 100%,  reduce = 100%
16Ended Job = job_local238749048_0001
17MapReduce Jobs Launched:
18Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
19Total MapReduce CPU Time Spent: 0 msec
20OK
211000000
22
23hive> select * from hive_test limit 10;
24OK
2511027	 "11345"
2610227	 "24281"
2732535	 "16409"
2824286	 "24435"
292498	 "10969"
3016662	 "16163"
315345	 "26005"
3221407	 "5365"
3330608	 "4588"
3419686	 "11831"

Presto

安装版本为349，可参考presto安装一文的过程。其中，如果本机也作为worker，在config.properties中，

Plain Text

1node-scheduler.include-coordinator=true #需要改为true

在etc/catalog/hive.properties中，配置为：

Plain Text

1connector.name=hive-hadoop2 
2hive.config.resources=/ssd2/hadoop-3.3.2/etc/hadoop/core-site.xml,/ssd2/hadoop-3.3.2/etc/hadoop/hdfs-site.xml #这里的地址一定要正确
3hive.metastore.uri=thrift://127.0.0.1:9083 
4hive.allow-drop-table=false
5hive.storage-format=ORC
6hive.metastore-cache-ttl=1s
7hive.metastore-refresh-interval=1s
8hive.metastore-timeout=35m
9hive.max-partitions-per-writers=1000
10hive.cache.enabled=true
11hive.cache.location=/opt/hive-cache

把bos filesystem的jar包复制到plugin/hive-hadoop2/下，之后运行：

Plain Text

1./bin/launcher start

启动presto-server，

Plain Text

1./presto-cli --server localhost:8881 --catalog hive --schema default

运行

Plain Text

1presto:default>use hive;
2 USE
3 presto:hive>select * from hive_test limit 10;
4     a   |    b
5-------+----------
6 11027 |  "11345"
7 10227 |  "24281"
8 32535 |  "16409"
9 24286 |  "24435"
10  2498 |  "10969"
11 16662 |  "16163"
12  5345 |  "26005"
13 21407 |  "5365"
14 30608 |  "4588"
15 19686 |  "11831"
16 (10 rows)
17
18Query 20230601_084130_00004_dzvjb, FINISHED, 1 node
19Splits: 18 total, 18 done (100.00%)
20[Latency: client-side: 0:02, server-side: 0:02] [59.4K rows, 831KB] [28.9K rows/s, 404KB/s]

在以上示例中，我们就通过presto查询到了存储在bos中的数据。

基于S3的presto访问

presto访问存储在BOS中的数据只能是通过hive，但是hive访问BOS中的数据有两种方式，第一种就是通过上述的介绍，基于bos-hdfs；第二种就是直接通过S3协议访问BOS。

hive配置

参考安装metastore中的步骤：

Plain Text

1wget "https://repo1.maven.org/maven2/org/apache/hive/hive-standalone-metastore/3.1.2/hive-standalone-metastore-3.1.2-bin.tar.gz"
2tar -zxvf hive-standalone-metastore-3.1.2-bin.tar.gz
3sudo mv apache-hive-metastore-3.1.2-bin /usr/local/metastore
4sudo chown user:user /usr/local/metastore

下载并使用hive-standalone-metastore，增加必需的jar包：

Plain Text

1rm /usr/local/metastore/lib/guava-19.0.jar
2cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar \
3  /usr/local/metastore/lib/
4cp /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-3.2.1.jar \
5  /usr/local/metastore/lib/
6cp /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar \
7  /usr/local/metastore/lib/

之后配置/usr/local/metastore/conf/metastore-site.xml

Plain Text

1<property>
2    <name>javax.jdo.option.ConnectionURL</name>
3    <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
4</property>
5<property>
6    <name>javax.jdo.option.ConnectionDriverName</name>
7    <value>com.mysql.jdbc.Driver</value>
8</property>
9<property>
10    <name>javax.jdo.option.ConnectionUserName</name>
11    <value>hive</value>
12</property>
13<property>
14    <name>javax.jdo.option.ConnectionPassword</name>
15    <value>hive</value>
16</property>
17<property>
18    <name>hive.metastore.event.db.notification.api.auth</name>
19    <value>false</value>
20</property>
21<property>
22    <name>fs.s3a.access.key</name>
23    <value>S3_ACCESS_KEY</value>
24</property>
25<property>
26    <name>fs.s3a.secret.key</name>
27    <value>S3_SECRET_KEY</value>
28</property>
29<property>
30    <name>fs.s3a.connection.ssl.enabled</name>
31    <value>false</value>
32</property>
33<property>
34    <name>fs.s3a.path.style.access</name>
35    <value>true</value>
36</property>
37<property>
38    <name>fs.s3a.endpoint</name>
39    <value>S3_ENDPOINT</value>
40</property>

fs.s3a.endpoint 是指s3 的endponit，bos的s3 endpoint在BOS S3域名可查. 启动hive metastore:

Plain Text

1/usr/local/metastore/bin/start-metastore &

presto配置

preto的版本这里不作改变，hive.properties的内容改为：

Plain Text

1connector.name=hive-hadoop2
2hive.metastore.uri=thrift://localhost:9083
3hive.s3.path-style-access=true
4hive.s3.endpoint=S3_ENDPOINT 
5hive.s3.aws-access-key=S3_ACCESS_KEY
6hive.s3.aws-secret-key=S3_SECRET_KEY
7hive.s3.ssl.enabled=false

启动presto

Plain Text

1/usr/local/trino/bin/launcher start

#创建schema

Plain Text

1CREATE SCHEMA IF NOT EXISTS hive.iris
2WITH (location = 's3a://my-bos-bucket/'); //注意，这里的location就是以s3a开头，如果bucket为my-bos-bucket，则location应为s3a://my-bos-bucket/开头
3#创建表
4CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet (
5  sepal_length DOUBLE,
6  sepal_width  DOUBLE,
7  petal_length DOUBLE,
8  petal_width  DOUBLE,
9  class        VARCHAR
10)
11WITH (
12  external_location = 's3a://my-bos-bucket/iris_parquet',
13  format = 'PARQUET'
14);
15
16SELECT 
17  sepal_length,
18  class
19FROM hive.iris.iris_parquet 
20LIMIT 10;

评价此篇文章

有帮助没帮助

百度智能云

BOS 对象存储

BOS 对象存储

基于BOS的Presto实践

概览

基于BOS HDFS的Presto使用

BOS HDFS

Hive

安装mysql

安装hive

Presto

基于S3的presto访问

hive配置

presto配置