最近开始学习大数据相关的知识,最著名的就是开源的 Hadoop 平台了。这里记录一下目前最新版的 Hadoop 在 Ubuntu 系统下的搭建过程。

前提条件

在开始安装 Hadoop 之前,我们需要使用可用的最新软件补丁更新 Ubuntu。

sudo apt-get update && sudo apt-get -y dist-upgrade

安装 JDK

下载 JDK

  • JDK 下载地址
https://www.jb51.net/softs/551521.html
  • 创建软件下载目录
lichaoxiang@ubuntu-vm:~$ mkdir Download
  • 上传 JDK 至 Linux 环境
lichaoxiang@ubuntu-vm:~/Download$ ls

jdk-linux-x64.tar.gz

安装 JDK

  • 创建软件安装目录
lichaoxiang@ubuntu-vm:~$ sudo mkdir /usr/local/opt
  • 解压 JDK 至指定目录下
lichaoxiang@ubuntu-vm:~/Download$ sudo tar -zxvf jdk-linux-x64.tar.gz -C /usr/local/opt/

配置 JDK

  • 编辑 /etc/profile 文件,配置环境变量
lichaoxiang@ubuntu-vm:~$ sudo vim /etc/profile

在末尾处添加以下命令:

export JAVA_HOME=/usr/local/opt/jdk1.8.0_131
export CLASSPATH=.:$JAVA_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin
  • 使环境变量生效
lichaoxiang@ubuntu-vm:~$ source /etc/profile
  • 检查是否安装成功
lichaoxiang@ubuntu-vm:~$ java -version

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
  • 查看 Linux 的版本
lichaoxiang@ubuntu-vm:~$ more /etc/os-release

NAME="Ubuntu"
VERSION="21.04 (Hirsute Hippo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 21.04"
VERSION_ID="21.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=hirsute
UBUNTU_CODENAME=hirsute

安装 MySQL

  • 查看是否已安装 MySQL
dpkg -l | grep mysql
  • 安装 MySQL

注:通过 APT 方式安装的版本都是现在最新的版本,通过这种方式安装好之后开机自启动都已经配置好,且命令行上的环境变量无需手动配置。

sudo apt-get install mysql-server

Hadoop-Instasll02

如果提示依赖不足,运行下面命令解决依赖问题,如果没出现依赖问题,那么就不需要使用此命令。

sudo apt-get install -f
  • 检查是否安装成功
netstat -tap | grep mysql

通过上述命令检查之后,如果看到有 MySQL 的 socket 处于 LISTEN 状态则表示安装成功。

lichaoxiang@ubuntu-vm:~$ sudo netstat -tap | grep mysql
tcp        0      0 localhost:33060         0.0.0.0:*               LISTEN      46392/mysqld
tcp        0      0 localhost:mysql         0.0.0.0:*               LISTEN      46392/mysqld
  • 查看已安装的 MySQL 版本
lichaoxiang@ubuntu-vm:~$ mysql -V

mysql  Ver 8.0.27-0ubuntu0.21.04.1 for Linux on x86_64 ((Ubuntu))
  • 检查 MySQL 服务状态
lichaoxiang@ubuntu-vm:~$ service mysql status

 mysql.service - MySQL Community Server
     Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-11-08 16:43:58 CST; 19min ago
    Process: 46384 ExecStartPre=/usr/share/mysql/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
   Main PID: 46392 (mysqld)
     Status: "Server is operational"
      Tasks: 38 (limit: 4596)
     Memory: 354.5M
     CGroup: /system.slice/mysql.service
             └─46392 /usr/sbin/mysqld

11月 08 16:43:58 ubuntu-vm systemd[1]: Starting MySQL Community Server...
11月 08 16:43:58 ubuntu-vm systemd[1]: Started MySQL Community Server.
  • 查看 root 的默认密码
lichaoxiang@ubuntu-vm:~$ sudo cat /etc/mysql/debian.cnf

# Automatically generated for Debian scripts. DO NOT TOUCH!
[client]
host     = localhost
user     = debian-sys-maint
password = nygkYwDcGppIO6qc
socket   = /var/run/mysqld/mysqld.sock
[mysql_upgrade]
host     = localhost
user     = debian-sys-maint
password = nygkYwDcGppIO6qc
socket   = /var/run/mysqld/mysqld.sock
  • 使用默认 root 密码登录 MySQL
lichaoxiang@ubuntu-vm:~$ sudo mysql -u root -pnygkYwDcGppIO6qc

mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 13
Server version: 8.0.27-0ubuntu0.21.04.1 (Ubuntu)

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
4 rows in set (0.00 sec)
  • 修改 root 密码
mysql> use mysql
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed


mysql> alter user 'root'@'localhost' IDENTIFIED BY '123456';
Query OK, 0 rows affected (0.01 sec)


mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
  • 创建 MySQL 新用户 hive 并授权
mysql> create user 'hive'@'localhost' identified by 'hive';
Query OK, 0 rows affected (0.01 sec)


mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)


mysql> grant all privileges on *.* to 'hive'@'localhost' with grant option;
Query OK, 0 rows affected (0.01 sec)


mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
  • 退出 root 用户, 使用 hive 用户登陆 MySQL 并创建数据库
lichaoxiang@ubuntu-vm:~$ mysql -uhive -phive

mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 11
Server version: 8.0.27-0ubuntu0.21.04.1 (Ubuntu)

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.


mysql> create database hive_test;
Query OK, 1 row affected (0.01 sec)


mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| hive_test          |
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
5 rows in set (0.00 sec)
  • MySQL 服务相关
# 服务状态
service mysql status

# 启动服务 
sudo systemctl mysql start

# 重启服务
sudo service mysql restart 

# 关闭服务
sudo service mysql stop

配置 SSH 免密登录

  • 安装 ssh-server
sudo apt install openssh-server
  • 检查 ssh 是否安装成功
lichaoxiang@ubuntu-vm:~$ ssh localhost

lichaoxiang@localhost's password:
Welcome to Ubuntu 21.04 (GNU/Linux 5.11.0-38-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

0 updates can be applied immediately.

New release '21.10' available.
Run 'do-release-upgrade' to upgrade to it.

*** System restart required ***
Last login: Mon Nov  8 18:01:35 2021 from 127.0.0.1

安装成功则会提示需要使用密码,输入密码后则会提示连接成功。

  • 退出 ssh 连接
lichaoxiang@ubuntu-vm:~$ logout

Connection to localhost closed.
  • 配置 ssh 免密登陆(生成文件 authorized_keys)

在进行了初次登陆后,会在当前 HOME 目录用户下有一个 .ssh 文件夹,进入该文件夹下。

lichaoxiang@ubuntu-vm:~$ cd ~/.ssh

lichaoxiang@ubuntu-vm:~/.ssh$ pwd
/home/lichaoxiang/.ssh
  • 使用 rsa 算法生成秘钥和公钥对
lichaoxiang@ubuntu-vm:~/.ssh$ ssh-keygen -t rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/home/lichaoxiang/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/lichaoxiang/.ssh/id_rsa
Your public key has been saved in /home/lichaoxiang/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:FVRq4ghF+qLznEY4moNn2gAc3Y0g6tcJttJV1uqB2kc lichaoxiang@ubuntu-vm
The key's randomart image is:
+---[RSA 3072]----+
| . . .oo..o..    |
|. o oo=  . o     |
|.. +o+..o +      |
|o + =+.E +       |
|.+ =+o= S        |
|. =o.o o         |
|oooo  .          |
|=oo+..           |
|.=..+            |
+----[SHA256]-----+

运行后一路回车就可以了,其中第一个是要输入秘钥和公钥对的保存位置,默认是在 ~/.ssh/id_rsa 下。

  • 导入 authorized_keys(把公钥加入到授权中)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • 测试是否免密码登录 localhost
lichaoxiang@ubuntu-vm:~$ ssh localhost

Welcome to Ubuntu 21.04 (GNU/Linux 5.11.0-38-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

0 updates can be applied immediately.

New release '21.10' available.
Run 'do-release-upgrade' to upgrade to it.

*** System restart required ***
Last login: Mon Nov  8 18:03:50 2021 from 127.0.0.1

再次 ssh localhost 的时候就可以免密登陆了。

安装 Hadoop

下载 Hadoop

  • Hadoop 下载地址
https://www.apache.org/dyn/closer.cgi/hadoop/common

Hadoop-Instasll03

  • 上传 Hadoop 至 Linux 环境
lichaoxiang@ubuntu-vm:~/Download$ ls
hadoop-3.3.1.tar.gz  jdk-17_linux-x64_bin.tar.gz
  • 解压 Hadoop 至指定目录下
lichaoxiang@ubuntu-vm:~/Download$ sudo tar -zxvf hadoop-3.3.1.tar.gz -C /usr/local/opt/
  • 更新 Hadoop 环境变量配置
lichaoxiang@ubuntu-vm:/usr/local/opt$ sudo vim /etc/profile

export HADOOP_HOME=/usr/local/opt/hadoop-3.3.1
export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin

lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1$ source /etc/profile
  • 检查环境变量
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1$ echo $HADOOP_HOME

/usr/local/opt/hadoop-3.3.1

伪分布式模式配置

Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中。

  • 配置 hadoop-env.sh 
# 修改 java 路径
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ sudo vim hadoop-env.sh    

JAVA_HOME=${JAVA_HOME} 改成 JAVA_HOME=JAVA_HOME=/usr/local/opt/jdk1.8.0_131

Hadoop-Instasll04

  • 配置 core-site.xml
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ sudo vim core-site.xml

<configuration>
    <!--指定fs的缺省名称-->
    <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
    </property>
    <!--指定HDFS的(NameNode)的缺省路径地址,localhost:是计算机名,也可以是ip地址-->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <!-- 指定hadoop运行时产生文件的存储目录(以个人为准) -->
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/usr/local/opt/hadoop-3.3.1/tmp</value>
    </property>
</configuration>

注:如果没有目录 /usr/local/opt/hadoop-3.3.1/tmp,则需要自己新建该目录。

lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ ll /usr/local/opt/hadoop-3.3.1/

总用量 120
drwxr-xr-x 10 lichaoxiang lichaoxiang  4096  6月 15 13:52 ./
drwxr-xr-x  4 root        root         4096 11月  9 09:18 ../
drwxr-xr-x  2 lichaoxiang lichaoxiang  4096  6月 15 13:52 bin/
drwxr-xr-x  3 lichaoxiang lichaoxiang  4096  6月 15 13:15 etc/
drwxr-xr-x  2 lichaoxiang lichaoxiang  4096  6月 15 13:52 include/
drwxr-xr-x  3 lichaoxiang lichaoxiang  4096  6月 15 13:52 lib/
drwxr-xr-x  4 lichaoxiang lichaoxiang  4096  6月 15 13:52 libexec/
-rw-rw-r--  1 lichaoxiang lichaoxiang 23450  6月 15 13:02 LICENSE-binary
drwxr-xr-x  2 lichaoxiang lichaoxiang  4096  6月 15 13:52 licenses-binary/
-rw-rw-r--  1 lichaoxiang lichaoxiang 15217  6月 15 13:02 LICENSE.txt
-rw-rw-r--  1 lichaoxiang lichaoxiang 29473  6月 15 13:02 NOTICE-binary
-rw-rw-r--  1 lichaoxiang lichaoxiang  1541  5月 22 00:11 NOTICE.txt
-rw-rw-r--  1 lichaoxiang lichaoxiang   175  5月 22 00:11 README.txt
drwxr-xr-x  3 lichaoxiang lichaoxiang  4096  6月 15 13:15 sbin/
drwxr-xr-x  4 lichaoxiang lichaoxiang  4096  6月 15 14:18 share/


lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ mkdir /usr/local/opt/hadoop-3.3.1/tmp
  • 配置 hdfs-site.xml
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ sudo vim hdfs-site.xml

<configuration>
<!-- 指定HDFS副本的数量 -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
      <name>dfs.name.dir</name>
      <value>/usr/local/opt/hadoop-3.3.1/hdfs/name</value>
    </property>
    <property>
      <name>dfs.data.dir</name>
      <value>/usr/local/opt/hadoop-3.3.1/hdfs/data</value>
    </property>
</configuration>

注:如果没有目录 /usr/local/opt/hadoop-3.3.1/hdfs/name 和 /usr/local/opt/hadoop-3.3.1/hdfs/data,则需要自己新建该目录。

lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ mkdir -p /usr/local/opt/hadoop-3.3.1/hdfs/name

lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ mkdir -p /usr/local/opt/hadoop-3.3.1/hdfs/data
  • 配置 mapred-site.xml
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ sudo vim mapred-site.xml

<configuration>
    <!-- 指定mr运行在yarn上 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
  • 配置 yarn-site.xml
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ sudo vim yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
        <!-- 指定ResourceManager的地址 -->
    <property>
        <name>yarn.resourcemanage.hostname</name>
        <value>localhost</value>
    </property>
        <!-- 指定reducer获取数据的方式 -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.application.classpath</name>
        <value>/usr/local/opt/hadoop-3.3.1/etc/hadoop:/usr/local/opt/hadoop-3.3.1/share/hadoop/common/lib/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/common/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/hdfs:/usr/local/opt/hadoop-3.3.1/share/hadoop/hdfs/lib/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/hdfs/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/mapreduce/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/yarn:/usr/local/opt/hadoop-3.3.1/share/hadoop/yarn/lib/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/yarn/*</value>
    </property> 
</configuration>

其中 yarn.application.classpath 的 value 内容通过以下方式获取, 直接复制即可。

lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1$ hadoop classpath

/usr/local/opt/hadoop-3.3.1/etc/hadoop:/usr/local/opt/hadoop-3.3.1/share/hadoop/common/lib/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/common/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/hdfs:/usr/local/opt/hadoop-3.3.1/share/hadoop/hdfs/lib/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/hdfs/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/mapreduce/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/yarn:/usr/local/opt/hadoop-3.3.1/share/hadoop/yarn/lib/*:/usr/local/opt/hadoop-3.3.1/share/hadoop/yarn/*

启动 Hadoop

  • 格式化 namenode

对 HDFS 集群进行格式化,HDFS 集群是用来存储数据的。 

lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ hdfs namenode -format

WARNING: /usr/local/opt/hadoop-3.3.1/logs does not exist. Creating.
2021-11-09 10:53:03,906 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu-vm/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.3.1
  • 启动 hadoop hdfs
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ start-all.sh

WARNING: Attempting to start all Apache Hadoop daemons as lichaoxiang in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ubuntu-vm]
Starting resourcemanager
Starting nodemanagers
  • 通过 jps 查看启动的 hadoop 服务
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ jps

4082 Jps
3603 SecondaryNameNode
3214 NameNode
3359 DataNode
  • 执行计算任务
lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar pi 10 10

Number of Maps  = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
2021-11-09 12:52:12,134 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-11-09 12:52:12,613 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/lichaoxiang/.staging/job_1636433506029_0001
2021-11-09 12:52:13,167 INFO input.FileInputFormat: Total input files to process : 10
2021-11-09 12:52:13,640 INFO mapreduce.JobSubmitter: number of splits:10
2021-11-09 12:52:13,830 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1636433506029_0001
2021-11-09 12:52:13,831 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-11-09 12:52:14,031 INFO conf.Configuration: resource-types.xml not found
2021-11-09 12:52:14,032 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-11-09 12:52:14,367 INFO impl.YarnClientImpl: Submitted application application_1636433506029_0001
2021-11-09 12:52:14,420 INFO mapreduce.Job: The url to track the job: http://ubuntu-vm:8088/proxy/application_1636433506029_0001/
2021-11-09 12:52:14,421 INFO mapreduce.Job: Running job: job_1636433506029_0001
2021-11-09 12:52:23,575 INFO mapreduce.Job: Job job_1636433506029_0001 running in uber mode : false
2021-11-09 12:52:23,577 INFO mapreduce.Job:  map 0% reduce 0%
2021-11-09 12:52:45,409 INFO mapreduce.Job:  map 60% reduce 0%
2021-11-09 12:52:58,527 INFO mapreduce.Job:  map 100% reduce 0%
2021-11-09 12:53:00,549 INFO mapreduce.Job:  map 100% reduce 100%
2021-11-09 12:53:00,563 INFO mapreduce.Job: Job job_1636433506029_0001 completed successfully
2021-11-09 12:53:00,698 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=226
                FILE: Number of bytes written=3009266
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2690
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=45
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Launched map tasks=10
                Launched reduce tasks=1
                Data-local map tasks=10
                Total time spent by all maps in occupied slots (ms)=165307
                Total time spent by all reduces in occupied slots (ms)=11850
                Total time spent by all map tasks (ms)=165307
                Total time spent by all reduce tasks (ms)=11850
                Total vcore-milliseconds taken by all map tasks=165307
                Total vcore-milliseconds taken by all reduce tasks=11850
                Total megabyte-milliseconds taken by all map tasks=169274368
                Total megabyte-milliseconds taken by all reduce tasks=12134400
        Map-Reduce Framework
                Map input records=10
                Map output records=20
                Map output bytes=180
                Map output materialized bytes=280
                Input split bytes=1510
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=280
                Reduce input records=20
                Reduce output records=0
                Spilled Records=40
                Shuffled Maps =10
                Failed Shuffles=0
                Merged Map outputs=10
                GC time elapsed (ms)=4139
                CPU time spent (ms)=10010
                Physical memory (bytes) snapshot=3187793920
                Virtual memory (bytes) snapshot=37711335424
                Total committed heap usage (bytes)=2828533760
                Peak Map Physical memory (bytes)=329674752
                Peak Map Virtual memory (bytes)=3427729408
                Peak Reduce Physical memory (bytes)=209952768
                Peak Reduce Virtual memory (bytes)=3436142592
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1180
        File Output Format Counters
                Bytes Written=97
Job Finished in 48.652 seconds
Estimated value of Pi is 3.20000000000000000000
  • 打开 hadoop 集群 Web 页面
http://localhost:9870/

停止 Hadoop

lichaoxiang@ubuntu-vm:/usr/local/opt/hadoop-3.3.1/etc/hadoop$ stop-all.sh 

安装 Hive

下载 Hive

  • Hadoop 下载地址
http://www.apache.org/dyn/closer.cgi/hive/

Hadoop-Instasll08

  • 上传 Hive 至 Linux 环境
lichaoxiang@ubuntu-vm:~/Download$ ls

apache-hive-3.1.2-bin.tar.gz 
  • 解压 Hive 至指定目录下
lichaoxiang@ubuntu-vm:~/Download$ sudo tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /usr/local/opt/

安装 Hive

  • 编辑 /etc/profile 文件,配置 Hive 环境变量
lichaoxiang@ubuntu-vm:~$ sudo vim /etc/profile

在末尾处添加以下命令:

export HIVE_HOME=/usr/local/opt/apache-hive-3.1.2-bin
export PATH=${PATH}:${HIVE_HOME}/bin
  • 使环境变量生效
lichaoxiang@ubuntu-vm:~$ source /etc/profile
  • 配置 Hive
lichaoxiang@ubuntu-vm:~$ cd /usr/local/opt/apache-hive-3.1.2-bin/conf/

lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/conf$ sudo cp ./hive-default.xml.template ./hive-default.xml

lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/conf$ sudo cp ./hive-env.sh.template ./hive-env.sh

lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/conf$ sudo cp ./hive-log4j2.properties.template ./hive-log4j2.properties

lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/conf$ sudo touch ./hive-site.xml
  • 修改 hive-env.sh 文件
lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/conf$ sudo vim ./hive-env.sh

export HADOOP_HOME=/usr/local/opt/hadoop-3.3.1
export HIVE_CONF_DIR=/usr/local/opt/apache-hive-3.1.2-bin/conf
export HIVE_AUX_JARS_PATH=/usr/local/opt/apache-hive-3.1.2-bin/lib

Hadoop-Instasll09

  • 修改 hive-log4j2.properties 文件
lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/conf$ sudo vim ./hive-log4j2.properties

property.hive.log.dir = /usr/local/opt/apache-hive-3.1.2-bin/logs/***
appender.DRFA.fileName = /usr/local/opt/apache-hive-3.1.2-bin/logs/${sys:hive.log.file}
appender.DRFA.filePattern = /usr/local/opt/apache-hive-3.1.2-bin/logs/${sys:hive.log.file}.%d{yyyy-MM-dd}

Hadoop-Instasll10

注:如果没有目录 /usr/local/opt/apache-hive-3.1.2-bin/logs,则需要自己新建该目录。

lichaoxiang@ubuntu-vm:~$ sudo mkdir -p /usr/local/opt/apache-hive-3.1.2-bin/logs
  • 修改 hive-site.xml 文件
lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/conf$ sudo vim hive-site.xml

<configuration>
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>
    <property>
        <name>hive.exec.scratchdir</name>
        <value>/tmp/hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.cj.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive</value>
    </property>
</configuration>

初次配置可以使用以上 hive-site.xml 内容;熟悉之后可以使用上文提到 hive-default.xml 内容进行配置,同时删除以上 hive-site.xml 并将配置好的 hive-default.xml 重命名为 hive.site.xml。

执行前准备

  • hdfs 中创建 hive 配置的文件路径
lichaoxiang@ubuntu-vm:~$ hdfs dfs -mkdir -p /user/hive/warehouse

lichaoxiang@ubuntu-vm:~$ hdfs dfs -ls -R / | grep 'hive'
drwx-wx-wx   - lichaoxiang supergroup          0 2021-11-09 13:05 /tmp/hive
drwx------   - lichaoxiang supergroup          0 2021-11-09 13:05 /tmp/hive/lichaoxiang
drwxr-xr-x   - lichaoxiang supergroup          0 2021-11-09 13:34 /user/hive
drwxr-xr-x   - lichaoxiang supergroup          0 2021-11-09 13:34 /user/hive/warehouse
  • 添加 MySQL 的驱动包到 hive/lib 文件夹下
lichaoxiang@ubuntu-vm:~$ sudo cp Download/mysql-connector-java-8.0.21/mysql-connector-java-8.0.21.jar /usr/local/opt/apache-hive-3.1.2-bin/lib/

lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ ls -al | grep 'mysql'
-rw-r--r--  1 root root   2397321 11月  9 13:53 mysql-connector-java-8.0.21.jar
-rw-r--r--  1 root staff    10476 11月 16  2018 mysql-metadata-storage-0.12.0.jar

lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ sudo chmod -R 755 ./mysql-connector-java-8.0.21.jar

下载地址:https://dev.mysql.com/downloads/file/?id=496589

  • 替换低版本 guava jar 包
lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ ls -al | grep 'guava'

-rw-r--r--  1 root staff  2308517  9月 27  2018 guava-19.0.jar
-rw-r--r--  1 root staff   971309  5月 21  2019 jersey-guava-2.25.1.jar


lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ ls -al /usr/local/opt/hadoop-3.3.1/share/hadoop/common/lib/*.jar | grep 'guava'

-rw-r--r-- 1 lichaoxiang lichaoxiang 2747878  6月 15 13:15 /usr/local/opt/hadoop-3.3.1/share/hadoop/common/lib/guava-27.0-jre.jar
-rw-r--r-- 1 lichaoxiang lichaoxiang 3362359  6月 15 13:15 /usr/local/opt/hadoop-3.3.1/share/hadoop/common/lib/hadoop-shaded-guava-1.1.1.jar
-rw-r--r-- 1 lichaoxiang lichaoxiang    2199  6月 15 13:15 /usr/local/opt/hadoop-3.3.1/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar


lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ sudo cp /usr/local/opt/hadoop-3.3.1/share/hadoop/common/lib/guava-27.0-jre.jar ./


lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ ls -al | grep 'guava'

-rw-r--r--  1 root staff  2308517  9月 27  2018 guava-19.0.jar
-rw-r--r--  1 root root   2747878 11月  9 13:57 guava-27.0-jre.jar
-rw-r--r--  1 root staff   971309  5月 21  2019 jersey-guava-2.25.1.jar
  • 初始化 hiveSchema
lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ cd

lichaoxiang@ubuntu-vm:~$ sudo /usr/local/opt/apache-hive-3.1.2-bin/bin/schematool -dbType mysql -initSchema

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/opt/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:        jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
Metastore Connection Driver :    com.mysql.cj.jdbc.Driver
Metastore connection User:       hive
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.mysql.sql
  • 删除 hive 的 log4j-slf4j-impl.jar 包
lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ ls -al | grep 'log4j-slf4j-impl'
-rw-r--r--  1 root staff    24173  9月 27  2018 log4j-slf4j-impl-2.10.0.jar

lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/lib$ sudo mv log4j-slf4j-impl-2.10.0.jar log4j-slf4j-impl-2.10.0.jar-bak

执行 Hive

执行 Hive 前需要使用(start-all.sh)启动 Hadoop 伪分布模式环境。

lichaoxiang@ubuntu-vm:~$ hive

Logging initialized using configuration in file:/usr/local/opt/apache-hive-3.1.2-bin/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 931026ae-acd8-4d70-be83-3ef397c45a8e
  • 创建 hive 库表并插入数据
hive> create database hive;
OK
Time taken: 0.198 seconds

hive> use hive;
OK
Time taken: 0.033 seconds

启动 hiveserver2

  • 修改 core-site.xml 文件,添加如下内容
lichaoxiang@ubuntu-vm:~$ sudo vim /usr/local/opt/hadoop-3.3.1/etc/hadoop/core-site.xml

    <property>
        <name>hadoop.proxyuser.lichaoxiang.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.lichaoxiang.groups</name>
        <value>*</value>
    </property>
  • 修改 hive-site.xml 文件,添加如下内容
lichaoxiang@ubuntu-vm:~$ sudo vim /usr/local/opt/apache-hive-3.1.2-bin/conf/hive-site.xml

    <property>
        <name>beeline.hs2.connection.user</name>
        <value>hive</value>
    </property>
    <property>
        <name>beeline.hs2.connection.password</name>
        <value>hive</value>
    </property>
    <property>
        <name>beeline.hs2.connection.hosts</name>
        <value>localhost:10000</value>
    </property>
  • 重启 Hadoop
lichaoxiang@ubuntu-vm:~$ stop-all.sh

lichaoxiang@ubuntu-vm:~$ start-all.sh
  • 进入 hive 的 bin 目录,启动 hiveserver2
lichaoxiang@ubuntu-vm:~$ cd /usr/local/opt/apache-hive-3.1.2-bin/bin


lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/bin$ hiveserver2

2021-11-09 17:26:01: Starting HiveServer2
.
.
.
Hive Session ID = d9d6f67a-cb2b-4007-b074-a3c8ef046cc1
Hive Session ID = 3aa9ef99-315a-4a1a-82b0-c965aacca06b
Hive Session ID = 55b2814f-68ec-478e-919d-944efaaff474
Hive Session ID = 8bdaf869-7ab1-4d7f-b3e2-9d1bddde47b5
OK
  • 另起一个窗口
lichaoxiang@ubuntu-vm:~$ beeline

Beeline version 3.1.2 by Apache Hive


beeline> !connect jdbc:hive2://localhost:10000

Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: lichaoxiang
Enter password for jdbc:hive2://localhost:10000: ********
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>

配置 DBeaver

  • 下载 DBeaver

链接: https://pan.baidu.com/s/1nSmsiFd-KCi5-gCYsO-uWA

提取码: e92x

  • 新建连接

Hadoop-Instasll23

  • 选择新连接类型

Hadoop-Instasll24

  • 编辑驱动设置

Hadoop-Instasll25

  • 使用 hive 自带的 jdbc jar
lichaoxiang@ubuntu-vm:/usr/local/opt/apache-hive-3.1.2-bin/jdbc$ ls

hive-jdbc-3.1.2-standalone.jar

然后将 DBeaver 默认的 jar 删除,再添加上 hive 自带的 jdbc jar 包。

Hadoop-Instasll26

  • 配置连接

Hadoop-Instasll27

Hadoop-Instasll28

异常处理

  • Browse Directory 报错

Hadoop-Instasll05

解决办法:该问题时由于Java版本导致的,本地安装了多个版本,默认使用版本 Java16,版本过高。修改 Java 版本到 1.8,问题就能解决。

换成 1.8 版本后,页面显示结果如下:

Hadoop-Instasll06

  • 执行计算任务时报错,/bin/bash: 行 1: /bin/java: 没有那个文件或目录

Hadoop-Instasll07

解决办法:创建软链接

lichaoxiang@ubuntu-vm:/usr/local/opt/jdk1.8.0_131/bin$ sudo ln -s /usr/local/opt/jdk1.8.0_131/bin/java /bin/java

参考资料

Ubuntu21.04 安装 JDK 16.0.2

Ubuntu 之安装 MySQL5.7

ubuntu18.04 搭建 Hadoop3.2.1

Ubuntu 20.04.1 LTS 安装 Hadoop3.3.0 和 Hive3.1.2

Hadoop 安装伪分布式(Hadoop3.2.0/Ubuntu14.04 64位)

通过数据库客户端界面工具 DBeaver 连接 Hive

原创文章,转载请注明出处:http://www.opcoder.cn/article/48/