In a previous post, I had talked a bit about using Pig with Hadoop. Of recent I have been playing with Apache Hive and as always thought I'd share :-). In particular this BLOG demonstrates using JSON data with Hive and also the Hive Web Interface. I wanted to have some fun playing with some technologies like Apache Wicket and Spring Java Config so I created my own version of Hive Web Interface which I am sharing as well.
Hive is a data warehouse that allows one to issue ad hoc queries to access data in HDFS via a language that is very similar to SQL. Hive's version of SQL is called HiveQL and does not support the full SQL-92 specification. Apparently Hive was created by Facebook to allow business analysts with strong SQL skills and 'little' java skills to analyze the large volumes of data collected. Hive is not a OLTP processing system by any means and querying of data can take some time to complete, depending on the size of data and complexity of the query. Hive takes the SQL queries submitted and converts the same into a series of MapReduce jobs to run on a Hadoop Cluster.
My steps for getting Hive going on a single node -
1. Install Hadoop:
Instructions for installation of Hadoop is quite well detailed on FamousPhil's blog. Note that you want to substitute the version of the Hadoop Tarball with the one from Cloudera. I prefer to use Cloudera's distribution as the different Hadoop artifacts are readily made available via their maven repositories and I needed them for my example. I installed Hadoop at /opt/hadoop/hadoop-0.20.2-cdh3u1
2. Install Hive:
Obtain Cloudera's version of Hive and install the same. Instructions on the same is available on the Cloudera Installation Page. I installed Hive at /opt/hive/hive-0.7.1-cdh3u1
3. Install MySQL and Configure metastore
Hive stores metadata in a standard relational database. Hive out of the box runs on Derby. This is fine for a single user setup but in a multi-user environment one would like to share metadata and for this reason, a centralized database like MySQL is recommended. Detailed instructions on configuring MySQL as the metastore are available on the Cloudera MySQL installation page. Make sure that you copy the mysql driver to $HIVE_HOME/lib folder. At this point, you should be able to connect to hive. The following shows a command to list all tables:
Leaning on my previous simple example with Pig, a "comments" file is available in the attached example (at the bottom). Load the file provided into HDFS via:
In Hive one finds the concept of a Managed Table and an External Table. When one creates a Managed table, Hive will manage the data moving forward. In other words, Hive will move the data from a source location in HDFS into Hive's warehouse. In the case of an External table, one tells Hive to refer to data in a particular HDFS location and Hive will not copy the data into its warehouse. For the comment data,we will use a managed table and create the same using:
Once data is loaded into the table, one can query the same using HiveQL. Hive's get_json_object function is to extract JSON data as shown below:
6. Hive Web Interface (HWI):
Hive is a data warehouse that allows one to issue ad hoc queries to access data in HDFS via a language that is very similar to SQL. Hive's version of SQL is called HiveQL and does not support the full SQL-92 specification. Apparently Hive was created by Facebook to allow business analysts with strong SQL skills and 'little' java skills to analyze the large volumes of data collected. Hive is not a OLTP processing system by any means and querying of data can take some time to complete, depending on the size of data and complexity of the query. Hive takes the SQL queries submitted and converts the same into a series of MapReduce jobs to run on a Hadoop Cluster.
My steps for getting Hive going on a single node -
1. Install Hadoop:
Instructions for installation of Hadoop is quite well detailed on FamousPhil's blog. Note that you want to substitute the version of the Hadoop Tarball with the one from Cloudera. I prefer to use Cloudera's distribution as the different Hadoop artifacts are readily made available via their maven repositories and I needed them for my example. I installed Hadoop at /opt/hadoop/hadoop-0.20.2-cdh3u1
2. Install Hive:
Obtain Cloudera's version of Hive and install the same. Instructions on the same is available on the Cloudera Installation Page. I installed Hive at /opt/hive/hive-0.7.1-cdh3u1
3. Install MySQL and Configure metastore
Hive stores metadata in a standard relational database. Hive out of the box runs on Derby. This is fine for a single user setup but in a multi-user environment one would like to share metadata and for this reason, a centralized database like MySQL is recommended. Detailed instructions on configuring MySQL as the metastore are available on the Cloudera MySQL installation page. Make sure that you copy the mysql driver to $HIVE_HOME/lib folder. At this point, you should be able to connect to hive. The following shows a command to list all tables:
>SHOW TABLES; OK Time taken:1.2 seconds4. Creating and Loading a Table:
Leaning on my previous simple example with Pig, a "comments" file is available in the attached example (at the bottom). Load the file provided into HDFS via:
>hadoop fs -put /tmp/comments /user/hadoop/commentsNote that we are assuming that the comments file is available at /tmp.
In Hive one finds the concept of a Managed Table and an External Table. When one creates a Managed table, Hive will manage the data moving forward. In other words, Hive will move the data from a source location in HDFS into Hive's warehouse. In the case of an External table, one tells Hive to refer to data in a particular HDFS location and Hive will not copy the data into its warehouse. For the comment data,we will use a managed table and create the same using:
hive>CREATE TABLE comment_data(comment_val STRING); hive> SHOW TABLES; OK comment_data Time taken: 2.42 seconds hive> DESCRIBE comment_data; OK comment_val stringThe JSON data from the file is loaded into a single COLUMN of the comment_data table.
hive>LOAD DATA INPATH '/user/hadoop/comments' OVERWRITE INTO TABLE comment_data;
hive>SELECT * from comment_data LIMIT 5;
OK
{"commenterId":"nemesis","commentData":"Java is dead, long live Java","country":"TANZANIA"}
{"commenterId":"rambo","commentData":"Felix Lighter and James Bond work well together as they are cave men","country":"SPAIN"}
{"commenterId":"donaldduck","commentData":"Fred Flintstone is in his cave","country":"INDIA"}
{"commenterId":"donaldduck","commentData":"The world is a cave. James bond lives in a Cave.","country":"TANZANIA"}
{"commenterId":"jamesbond","commentData":"The world is a cave. James bond lives in a Cave.","country":"BHUTAN"}
Time taken: 0.117 seconds
5. Querying Hive
Once data is loaded into the table, one can query the same using HiveQL. Hive's get_json_object function is to extract JSON data as shown below:
// Select commenter Identifier and comment
hive>select get_json_object(comment_data.comment_val, '$.commenterId'), get_json_object(comment_data.comment_val, '$.commentData') from comment_data;
Total MapReduce jobs = 1
...
Ended Job = job_201110171315_0001
OK
nemesis Java is dead, long live Java
rambo Felix Lighter and James Bond work well together as they are cave men
donaldduck Fred Flintstone is in his cave
donaldduck The world is a cave. James bond lives in a Cave.
jamesbond The world is a cave. James bond lives in a Cave.
....
// Select all comments which contain the string '%cave%' and group them
hive> SELECT b.commentData,COUNT(b.commentData)
FROM comment_data a LATERAL VIEW json_tuple(a.comment_val, 'commentData') b
AS commentData where b.commentData LIKE '%cave%' GROUP BY b.commentData;
Total MapReduce jobs = 1
Launching Job 1 out of 1
....
Starting Job = job_201110171733_0015, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201110171733_0015
Kill Command = /opt/hadoop/hadoop-0.20.2-cdh3u1/bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201110171733_0015
2011-10-17 19:57:59,797 Stage-1 map = 0%, reduce = 0%
2011-10-17 19:58:01,807 Stage-1 map = 100%, reduce = 0%
2011-10-17 19:58:09,855 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201110171733_0015
OK
Felix Lighter and James Bond work well together as they are cave men 927
Fred Flintstone is in his cave 935
Only a cave man could do this 907
The world is a cave. James bond lives in a Cave. 881
Time taken: 13.715 seconds
Another option when working with JSON Data is to use the hive-json-serde. With hive-json-serde one can load the JSON data into individual columns. In order to use the UDF's in hive-json-serde, the jar needs to be registered with Hive. A comment table can then be created using the following. Note that the following is an example of an EXTERNAL Table and while creating the same, we are specifying the directory that contains the comments file:
hive>ADD JAR /opt/hive-0.7.1-cdh3u1/lib/hive-json-serde-0.2.jar; hive>CREATE EXTERNAL TABLE comment_data_split(commenterId STRING, commentData STRING, country STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' LOCATION '/user/hadoop/';An example query of the data:
hive> select a.commenterId, a.commentData, count(a.commentData) from comment_data_split a where a.commentData like '%cave%' group by a.commenterId, a.commentData; .... Ended Job = job_201110171315_0011 OK donaldduck Felix Lighter and James Bond work well together as they are cave men 69 donaldduck Fred Flintstone is in his cave 83 donaldduck Only a cave man could do this 65 donaldduck The world is a cave. James bond lives in a Cave. 68 factorypilot Felix Lighter and James Bond work well together as they are cave men 67 factorypilot Fred Flintstone is in his cave 77 factorypilot Only a cave man could do this 81 ...Clearly the above examples are trivial queries.
Hive comes with a web interface for those unable to get to a command line or do not wish to install Hive on their work stations. HWI allows a user to work with Hive just as they would using the console client. One can create Hive Sessions and execute HiveSQL with them. The Hive web interface is started using:
>hive --service hwiExecuting the same starts a jetty container and one can access the hive web application at http://localhost:9999/hwi.
7. HWI Amped
One of the things I wanted to play with is to create a little more friendly web user interface to Hive than what HWI provides. So I proceeded to have some fun and do the same :-). In addition, I wanted to play with Wicket and Spring as well to get a feel for the framework. Killing quite a few birds with one stone here. The web application itself is largely based on the hwi source code with a few tweaks. HWI Amped uses the Thrift client to establish a connection to the Hive thrift service and allows a user to:
- Browse Schemas and Tables
- Create Sessions and Manage Sessions
- Execute HiveQL queries either in streaming where results are displayed in the browser or background mode where results are piped to a file.
Running the web application:
The Hwi Amped web application expects a Hive Server running at port 10001. It communicates with the Hive Server using Thrift. Start the server using:
>hive --service hiveserver 10001The Hwi Amped web application itself is a maven project, so from the command line execute the following to start it:
>mvn jetty:runThe Web application can be accessed from http://localhost:9090. You will be required to sign in. Currently there is no authentication hook-in built but one can provide the user "hadoop" and any dummy password to access the application.
Some screen shots are shown below:
The application is a learning experience and is devoid of unit-tests. The URL for the Hive server and metastore are also hard coded and not externalized.
Wicket? I found programming Wicket very much like programming Swing and would definitely consider it for other applications. A book in hand would have helped me move faster through my code. I did not struggle too much in creating the application, but then the struggle is usually not for the creator but the user of the application heh? One thing I fear is I have not understood the Wicket life cycle yet and worry if Page objects created might linger for longer than expected. Wicket integration with Spring was a breeze and I used Spring Java Config to inject a Session Service into the different Wicket Pages. My hope with Java Config was to be able to get rid of applicationContext.xml totally but found that I needed it to bootstrap Wicket. Spring Java Config is definitely the way forward IMHO.
On Hive, well my initial perspective (far from an expert by any means) was it feels very familiar due to it being SQLish and that the learning curve to using Hive is lesser that Pig for a person who is familiar with SQL. In addition, I think that for ad hoc querying, Hive is definitely the way to go with Pig being reserved for programmers and for periodic jobs. That said, Phase 2 of Hwi Amped might be to add Pig support, wot?
Overall, quite some fun for a weekend hackathon..
Download HWI Amped from HERE. In the top level directory the comments file is available. Any feedback on the UI is welcomed and if I have got some Hive concept wrong, please do let me know. Good luck and happy Hives..I mean Hive ;-)







2 comments:
Hi,
thanks for this interesting post and the UI :-)
Where exactly can I find the comments with JSON content? I've downloaded the zip and had a look into the resources folders, however all I found were configuration files.
Thanks
George
The comments file is at the root level of the project as mentioned in the blog. It is in the same level as the pom.xml file
Post a Comment