Presto 使用指南

更新时间：2024-08-15

概览

Presto 是 Facebook 开发的数据查询引擎，可对海量数据进行快速地交互式分析，支持 Hive，关系数据库等多种数据源。由于 BOS 在超低价格、超高性能、高可靠和高吞吐的强大存储优势，越来越多企业选择 BOS 作为大数据的存储媒介。因此，本文将对 Presto 在 BOS 上的使用方法作一个简要的介绍。

前提条件

参考 Hive使用指南一文安装并配置 Hive

安装配置

安装版本为349，可参考 presto部署一文的过程。其中，如果本机也作为 worker，在 config.properties 中可设置：

node-scheduler.include-coordinator=true #需要改为true

在 etc/catalog/hive.properties 中，配置为：

connector.name=hive-hadoop2 
hive.config.resources=/ssd2/hadoop-3.3.2/etc/hadoop/core-site.xml,/ssd2/hadoop-3.3.2/etc/hadoop/hdfs-site.xml #这里的地址一定要正确
hive.metastore.uri=thrift://127.0.0.1:9083 
hive.allow-drop-table=false
hive.storage-format=ORC
hive.metastore-cache-ttl=1s
hive.metastore-refresh-interval=1s
hive.metastore-timeout=35m
hive.max-partitions-per-writers=1000
hive.cache.enabled=true
hive.cache.location=/opt/hive-cache

把 bos filesystem 的 jar 包复制到 plugin/hive-hadoop2/ 下，之后运行：

./bin/launcher start

启动 presto-server，

./presto-cli --server localhost:8881 --catalog hive --schema default

运行

presto:default>use hive;
 USE
 presto:hive>select * from hive_test limit 10;
     a   |    b
-------+----------
 11027 |  "11345"
 10227 |  "24281"
 32535 |  "16409"
 24286 |  "24435"
  2498 |  "10969"
 16662 |  "16163"
  5345 |  "26005"
 21407 |  "5365"
 30608 |  "4588"
 19686 |  "11831"
 (10 rows)

Query 20230601_084130_00004_dzvjb, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
[Latency: client-side: 0:02, server-side: 0:02] [59.4K rows, 831KB] [28.9K rows/s, 404KB/s]

在以上示例中，我们就通过 presto 查询到了存储在 bos 中的数据。

基于 S3 的 presto 访问

presto 访问存储在BOS中的数据只能是通过 hive，但是 hive 访问 BOS 中的数据有两种方式，第一种就是通过上述的介绍，基于 bos-hdfs；第二种就是直接通过 S3 协议访问 BOS。

hive 配置

安装 metastore：

wget "https://repo1.maven.org/maven2/org/apache/hive/hive-standalone-metastore/3.1.2/hive-standalone-metastore-3.1.2-bin.tar.gz"
tar -zxvf hive-standalone-metastore-3.1.2-bin.tar.gz
sudo mv apache-hive-metastore-3.1.2-bin /usr/local/metastore
sudo chown user:user /usr/local/metastore

下载并使用 hive-standalone-metastore，增加必需的 jar 包：

rm /usr/local/metastore/lib/guava-19.0.jar
cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar \
  /usr/local/metastore/lib/
cp /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-3.2.1.jar \
  /usr/local/metastore/lib/
cp /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar \
  /usr/local/metastore/lib/

之后配置 /usr/local/metastore/conf/metastore-site.xml

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
</property>
<property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
</property>
<property>
    <name>fs.s3a.access.key</name>
    <value>S3_ACCESS_KEY</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>S3_SECRET_KEY</value>
</property>
<property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
</property>
<property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
</property>
<property>
    <name>fs.s3a.endpoint</name>
    <value>S3_ENDPOINT</value>
</property>

fs.s3a.endpoint 是指 s3 的 endponit，bos 的 s3 endpoint在 BOS S3域名可查.

启动 hive metastore:

/usr/local/metastore/bin/start-metastore &

presto 配置

preto 的版本这里不作改变，hive.properties 的内容改为：

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.path-style-access=true
hive.s3.endpoint=S3_ENDPOINT 
hive.s3.aws-access-key=S3_ACCESS_KEY
hive.s3.aws-secret-key=S3_SECRET_KEY
hive.s3.ssl.enabled=false

启动 presto

/usr/local/trino/bin/launcher start

创建 schema

CREATE SCHEMA IF NOT EXISTS hive.iris
WITH (location = 's3a://my-bos-bucket/'); //注意，这里的location就是以s3a开头，如果bucket为my-bos-bucket，则location应为s3a://my-bos-bucket/开头
#创建表
CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet (
  sepal_length DOUBLE,
  sepal_width  DOUBLE,
  petal_length DOUBLE,
  petal_width  DOUBLE,
  class        VARCHAR
)
WITH (
  external_location = 's3a://my-bos-bucket/iris_parquet',
  format = 'PARQUET'
);

SELECT 
  sepal_length,
  class
FROM hive.iris.iris_parquet 
LIMIT 10;

Flume 数据存储到 BOS

Impala 使用指南

对象存储 BOS