rhive tutorials - basic functions
DESCRIPTION
One can learn how to use basic functions in RHive as reading this document. This document was updated at 5th March 2012.TRANSCRIPT
RHive tutorial - basic functions This tutorial explains how to load RHive library and use basic Functions for RHive.
Loading RHive Load RHive with the method used when using any R package. Load RHive like below:
library(RHive)
But before loading RHive, you must not forget to configure HADOOP_HOME and HIVE_HOME environment And if they are not set then you can temporarily set them before loading the library, like as follows. HADOOP_HOME is the home directory where Hadoop is installed and HIVE_HOME is the home directory where Hive is installed. Consult RHive tutorial - RHive installation and setting for details on environment variables.
Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")
Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")
library(RHive)
rhive.init rhive.init is a procedure that internally initializes and if, before loading RHive, environment variables were calibrated accurately then they will automatically run. But if these environment variable were not configured while RHive was loaded via library(RHIve) then the following error message will result.
rhive.connect()
Error in .jcall("java/lang/Class", "Ljava/lang/Class;", "forName", cl, :
No running JVM detected. Maybe .jinit() would help.
Error in .jfindClass(as.character(class)) :
No running JVM detected. Maybe .jinit() would help.
For this case then designate HADOOP_HOME and HADOOP_HOME as shown below or exit R then configure environment variables and restart R.
Sys.setenv(HIVE_HOME="/service/hive-‐0.7.1")
Sys.setenv(HADOOP_HOME="/service/hadoop-‐0.20.203.0")
rhive.init()
Or,
close R
export HIVE_HOME="/service/hive-‐0.7.1"
export HADOOP_HOME="/service/hadoop-‐0.20.203.0"
open R
rhive.connect All Functions of RHive will only work after having connected to Hive server. If before using other Functions of RHive, you have not established a connection by using the rhive.connect Function, All RHive Functions will malfunction and produce the following errors when running.
Error in .jcast(hiveclient[[1]], new.class = "org/apache/hadoop/hive/service/HiveClient", :
cannot cast anything but Java objects
Establishing a connection with Hive server to use RHive is simple with the following:
rhive.connect()
The example above can additionally assign a few more things.
rhiveConnection <-‐ rhive.connect("10.1.1.1")
In the case the user’s Hive server is installed to a server other than the one with RHive installed, and has to remotely connect, a connection can be made by handing arguments over to the rhive.connect Function.
Then if you have multiple Hadoop and Hive clusters, then after making the right configurations to have RHive activated, and you want to switch between the Hives then just like using DB client such as MySQL, you should make connections and hand it over to the Functions via arguments to explicitly select connection.
rhive.query If the user has experience in using Hive, then he/she probably knows that Hive supports SQL syntax to handle the data for Map/Reduce and HDFS. rhive.query gives SQL to Hive and receives results from Hive. Users who know SQL syntax will find this a frequently encountered example.
rhive.query("SELECT * FROM usarrests")
If you run the example above then you will see the contents of a table named ‘usarrests’ printed on the screen. Or, on top of printing the returned result on the screen, you can also assign to a data.frame object those results.
resultDF <-‐ rhive.query("SELECT * FROM usarrests")
A thing to beware of is if the data returned from rhive.query is bigger than the RHive server’s memory or laptop’s, exhaustion of available memory will induce an error message. That is why you must not receive and put into object any data of such size. It is better to first create a temporary table and then put the results of the SQL to the temporary table. You can do it as the following.
rhive.query("
CREATE TABLE new_usarrests (
rowname string,
murder double,
assault int,
urbanpop int,
rape double
)")
rhive.query("INSERT OVERWRITE TABLE new_usarrests SELECT * FROM usarrest")
Consult a Hive document for a detailed account of how to use Hive SQL.
rhive.close If you have finished using Hive and do not wish to use RHive Functions any longer, you can use the rhive.close Function to terminate the connection.
rhive.close()
Alternatively, you can assign a specific connection to close it.
conn <-‐ rhive.connect()
rhive.close(conn)
rhive.list.tables The rhive.list.tables Function returns the results of tables in Hive.
rhive.list.tables()
tab_name
1 aids2
2 new_usarrests
3 usarrests
This is effectively identical to this:
rhive.query("SHOW TABLES")
rhive.desc.table The rhive.desc.table Function shows the description of the chosen table.
rhive.desc.table("usarrests")
col_name data_type comment
1 rowname string
2 murder double
3 assault int
4 urbanpop int
5 rape double
This is effectively identical to this:
rhive.query("DESC usarrests")
rhive.load.table The rhive.load.table Function loads Hive tables’ contents as R’s data.frame object.
df1 <-‐ rhive.load.table("usarrests")
df1
This is effectively identical to this:
df1 <-‐ rhive.query("SELECT * FROM usarrests")
df1
rhive.write.table The rhive.write.table Function is the antithesis of rhive.load.table. But it is more useful than rhive.load.table. If you wish to add data to a table located in Hive, you must first make a table. But using rhive.write.table does not require any additional work, and simply creates R’s dataframe into Hive and inserts all data.
head(UScrime)
M So Ed Po1 Po2 LF M.F Pop NW U1 U2 GDP Ineq Prob Time y
1 151 1 91 58 56 510 950 33 301 108 41 394 261 0.084602 26.2011 791
2 143 0 113 103 95 583 1012 13 102 96 36 557 194 0.029599 25.2999 1635
3 142 1 89 45 44 533 969 18 219 94 33 318 250 0.083401 24.3006 578
4 136 0 121 149 141 577 994 157 80 102 39 673 167 0.015801 29.9012 1969
5 141 0 121 109 101 591 985 18 30 91 20 578 174 0.041399 21.2998 1234
6 121 0 110 118 115 547 964 25 44 84 29 689 126 0.034201 20.9995 682
rhive.write.table(UScrime)
[1] "UScrime"
rhive.list.tables()
tab_name
1 aids2
2 new_usarrests
3 usarrests
4 uscrime
rhive.query("SELECT * FROM uscrime LIMIT 10")
rowname m so ed po1 po2 lf mf pop nw u1 u2 gdp ineq prob time
1 1 151 1 91 58 56 510 950 33 301 108 41 394 261 0.084602 26.2011
2 2 143 0 113 103 95 583 1012 13 102 96 36 557 194 0.029599 25.2999
3 3 142 1 89 45 44 533 969 18 219 94 33 318 250 0.083401 24.3006
4 4 136 0 121 149 141 577 994 157 80 102 39 673 167 0.015801 29.9012
5 5 141 0 121 109 101 591 985 18 30 91 20 578 174 0.041399 21.2998
6 6 121 0 110 118 115 547 964 25 44 84 29 689 126
0.034201 20.9995
7 7 127 1 111 82 79 519 982 4 139 97 38 620 168 0.042100 20.6993
8 8 131 1 109 115 109 542 969 50 179 79 35 472 206 0.040099 24.5988
9 9 157 1 90 65 62 553 955 39 286 81 28 421 239 0.071697 29.4001
10 10 140 0 118 71 68 632 1029 7 15 100 24 526 174 0.044498 19.5994
y
1 791
2 1635
3 578
4 1969
5 1234
6 682
7 963
8 1555
9 856
10 705
The rhive.write.table Function encounters an error and does not work if the table to be saved into Hive already exists. Hence, if attempting to save to Hive any dataframes with the same name and symbol as any table already in Hive, it is imperative that you delete them before using rhive.write.table.
if (rhive.exist.table("uscrime")) {
rhive.query("DROP TABLE uscrime")
}
rhive.write.table(UScrime)
RHive - alias functions RHive’s Functions look similar to S3 generic’s naming rules but many are actually not generic. This is for the S3 generic Functions which RHive may or may not support in the future. For users who detest confusion wrought by Functions that, despite containing “.” yet still do not count as generic, there exist some Functions with different names but serve the same roles. The following alias Functions are such as described below.
hiveConnect This is same as rhive.connect.
hiveQuery This is same as rhive.query.
hiveClose This is same as hive.close.
hiveListTables This is same as hive.list.tables.
hiveDescTable This is same as hive.desc.table.
hiveLoadTable This is same as hive.load.table.
rhive.basic.cut rhive.basic.cut converts one numerical column from a table to one factorized column. First, the range of the numerical column is divided into intervals, and the values in the numerical column are factorized according to which interval they fall. Rhive.basic.cut receives the following six arguments, tablename(a table name), col(a numerical column name), breaks, right, summary, and forcedRef. breaks are numerical cut points for the numerical column. right indicates if the ends of the intervals are open or closed. If TRUE, the intervals are closed on the right and open on the left. If not, vice versa. summary = TRUE spits out total counts of numerical values corresponding to the intervals. If FALSE, the name of a new table updated by the factorized table is returned. forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a data frame for forcedRef = FALSE. The defaults of right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.
Example for summary = FALSE
> table_name = rhive.basic.cut(tablename = "iris", col = "sepallength", breaks = seq(0, 5, 0.5), right = FALSE, summary = FALSE, forcedRef = TRUE)
> table_name
[1] "rhive_result_1330382904"
attr(,"result:size")
[1] 4296
> results = rhive.query("select * from rhive_result_1330382904")
> head(results)
rowname sepalwidth petallength petalwidth species sepallength
1 1 3.5 1.4 0.2 setosa NULL
2 2 3.0 1.4 0.2 setosa [4.5,5.0)
3 3 3.2 1.3 0.2 setosa [4.5,5.0)
4 4 3.1 1.5 0.2 setosa [4.5,5.0)
5 5 3.6 1.4 0.2 setosa NULL
6 6 3.9 1.7 0.4 setosa NULL
Example for summary = TRUE
> summary = rhive.basic.cut(tablename = "iris", col = "sepallength", breaks = seq(0, 5, 0.5), right = FALSE, summary = TRUE, forcedRef = TRUE)
> summary
NULL [4.0,4.5) [4.5,5.0)
128 4 18
rhive.basic.cut2 rhive.basic.cut2 converts two numerical columns from a table to two factorized columns. That is, the range of each numerical column is divided into intervals, and the values in each numerical column are factorized according to which interval they fall. Rhive.basic.cut2 receives the following eight arguments, tablename(a table name), col1, col2(two column names), breaks1, breaks2, right, keepCol, and forcedRef. breaks1 and breaks2 are numerical cut points for the two numerical columns. right indicates if the ends of the intervals are open or closed. If TRUE, the intervals are closed on the right and open on the left. If not, vice versa. keepCol = TRUE makes the two numerical columns kept even after the conversion. Otherwise, the factorized columns replace the original numerical columns. forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a data frame for forcedRef = FALSE. The defaults of right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.
Example for right = TRUE and keepCol = FALSE
> table_name = rhive.basic.cut2(tablename = "iris", col1 = "sepallength", col2 = "petallength", breaks1 = seq(0, 5, 0.5), breaks2 = seq(0, 5, 0.5), right = TRUE, keepCol = FALSE, forcedRef = TRUE)
> table_name
[1] "rhive_result_1330385833"
attr(,"result:size")
[1] 5272
> results = rhive.query("select * from rhive_result_1330385833")
> head(results)
rowname sepalwidth petalwidth species sepallength petallength rep
1 1 3.5 0.2 setosa NULL (1.0,1.5] 1
2 2 3.0 0.2 setosa (4.5,5.0] (1.0,1.5] 1
3 3 3.2 0.2 setosa (4.5,5.0] (1.0,1.5] 1
4 4 3.1 0.2 setosa (4.5,5.0] (1.0,1.5] 1
5 5 3.6 0.2 setosa (4.5,5.0] (1.0,1.5] 1
6 6 3.9 0.4 setosa NULL (1.5,2.0] 1
Example for right = FALSE and keepCol = TRUE
> table_name = rhive.basic.cut2(tablename = "iris", col1 = "sepallength", col2 = "petallength", breaks1 = seq(0, 5, 0.5), breaks2 = seq(0, 5, 0.5), right = FALSE, keepCol = TRUE, forcedRef = TRUE)
> table_name
[1] "rhive_result_1330315663"
attr(,"result:size")
[1] 6374
> results = rhive.query("select * from rhive_result_1330315663")
> head(results)
rowname sepalwidth petalwidth species sepallength sepallength_cut petallength petallength_cut rep
1 1 3.5 0.2 setosa 5.1 NULL 1.4 [1.0,1.5) 1
2 2 3.0 0.2 setosa 4.9 [4.5,5.0) 1.4 [1.0,1.5) 1
3 3 3.2 0.2 setosa 4.7 [4.5,5.0) 1.3 [1.0,1.5) 1
4 4 3.1 0.2 setosa 4.6 [4.5,5.0) 1.5 [1.5,2.0) 1
5 5 3.6 0.2 setosa 5.0 NULL 1.4 [1.0,1.5) 1
rhive.basic.xtabs rhive.basic.xtabs makes a contingency table from cross-classifying factors. A formula object and a table name are used as input arguments and a contingency table with matrix format is returned based on the given formula. For instance, two column names, agegp and alcgp from a table are cross-classifying factors in this formula, "ncontrols ~ agegp + alcgp". Also, observations for each combination of the cross-classifying factors are summed up through another column name, ncontrols.
Example for esoph data
> xtab_formula = as.formula(paste("ncontrols","~", "agegp", "+","alcgp",sep =""))
> xtab_formula
ncontrols ~ agegp + alcgp
> table_result = rhive.basic.xtabs(formula = xtab_formula, tablename = "esoph")
> head(table_result)
alcgp
agegp 0-‐39g/day 120+ 40-‐79 80-‐119
25-‐34 61 5 45 5
35-‐44 89 10 80 20
45-‐54 78 15 81 39
55-‐64 89 26 84 43
65-‐74 71 8 53 29
75+ 27 3 12 2
rhive.basic.t.test The rhive.basic.t.test Function runs Welch's t-test on two samples. In this case the two sample's mean difference is tested while holding the alternative hypothesis, "two sample's mean difference is not 0." Thus, two-side test is performed.
The following is an example of test the mean difference between the irises' sepal widths and petal widths. Pay attention to how the Functions that used the "sepallength" and "petallength" variables were called.
> rhive.basic.t.test("iris", "sepallength", "iris", "petallength")
[1] "t = 13.1422338118038, df = 211.542688378717, p-‐value = 0, mean of x : 5.84333333333333, mean of y : 3.758"
$statistic
t
13.14223
$parameter
df
211.5427
$p.value
[1] 0
$estimate
$estimate[[1]]
mean of x
5.843333
$estimate[[2]]
mean of y
3.758
>
Interpreting the results gives you a p-value of 0, thus revealing a difference between the means of petal width and sepal width. The resulting statistics are converted as an R list Object, and the string made from amassed statistics is printed onto console.
Iris data is 150 observation cases provided by R. Using this data for R's t.test results in a slightly off t-statistic of 13.0984. This is due to the variance used by t.test Function to find t-statistic is sample variance, while rhive.basic.t.test Function uses population variance. Like the example scenario, in the case of little data, t-statistic deviance may exist but the larger the data gets the deviance dwindles. With rhive.basic.t.test being a Function made for massive data analysis in mind, population variance is used for speedy calculations.
rhive.block.sample The percent argument is an optional argument that sets the percentage of data to extract from the total data. It has a default value of 0.01, which means it extracts 0.01% of the total data. But this percent argument's value is not the ratio of the actually sampled data count to the total data count but more akin to the ratio of Blocks to the total Blocks. Thus, rhive.block.sample Function takes Samples by the Block.
Thus the entire data may be returned when using the rhive.block.sample Function on Hive Tables of small data size. This occurs when the data is smaller than the Block size set in Hive.
The seed variable is for specifying the Random Seed used when executing Block Sampling in Hive. Should the Random Seeds be identical, Hive's Block Sampling returns the same results. Thus in order to guarantee Random Samples for every sampling, it is best to assign a value for the seed variable in rhive.block.sample, by using the Sample Function of R.
The subset variable is an optional variable that can specify the condition for the data to be extracted from the Table targeted by Hive, when returning Sample Block. This argument uses the character type and corresponds to the 'where' clause in Hive HQL. Thus it must use syntax appropriate for HQL's where clause.
rhive.block.sample Function's return values are the character values of the name of the Hive Table that contain Sample Block results. That is, the rhive.block.sample Function uses Sample Block to automatically create a temporary Hive Table and return that Table's name. The following example involves sampling data worth 0.01% of the Hive Table called listvirtualmachines. This example used R's sample Function for the Random Seed to be used during Block Sampling of Hive.
seedNumber <-‐ sample(1:2^16, 1)
rhive.block.sample("listvirtualmachines", seed=seedNumber )
[1] "rhive_sblk_1330404552"
As per this example, a Hive Table of the name "rhive_sblk_1330404552" bearing 0.01% worth of data from the Hive Table, "listvirtualmachines", has been created.
rhive.basic.scale The rhive.basic.scale function converts numerical data with 0 average and 1 deviation. Input table name for the first argument, and the output column name for the second.
In the returned list, there is added a "scaled_column name" column saved as a string. This is also approachable/editable in RHive, along with/just like other Hive tables.
scaled <-‐ rhive.basic.scale("iris", "sepallength")
attr(scaled, "scaled:center")
# [1] 5.843333
attr(scaled, "scaled:scale")
# [1] 0.8253013
> rhive.desc.table(scaled[[1]])
col_name data_type comment
# 1 rowname string
# 2 sepalwidth double
# 3 petallength double
# 4 petalwidth double
# 5 species string
# 6 sepallength double
# 7 sacled_sepallength double
rhive.basic.by The rhive.basic.by Function consists of code that runs group by for a specified/particular column. Thus the code below excecutes/applies group by for "species" column, and returns the result of applying the sum Function on
"sepallength". In the results you will find the sum of each species and sepallength.
rhive.basic.by("iris", "species", "sum","sepallength")
# species sum
# 1 setosa 250.3
# 2 versicolor 296.8
# 3 virginica 329.4
rhive.basic.merge rhive.basic.merge makes new data set from merging two tables, based on their common rows.
# checking data
rhive.query('select * from iris limit 5')
rowname sepallength sepalwidth petallength petalwidth species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
rhive.query('select * from usarrests limit 5')
rowname murder assault urbanpop rape
1 Alabama 13.2 236 58 21.2
2 Alaska 10.0 263 48 44.5
3 Arizona 8.1 294 80 31.0
4 Arkansas 8.8 190 50 19.5
5 California 9.0 276 91 40.6
##rhive.basic.merge
rhive.basic.merge('iris','usarrests',by.x='sepallength',by.y='
murder')
sepallength sepalwidth petallength petalwidth species assault urbanpop rape rowname
1 4.3 3.0 1.1 0.1 setosa 102 62 16.5 14
2 4.4 2.9 1.4 0.2 setosa 149 85 16.3 9
3 4.4 3.0 1.3 0.2 setosa 149 85 16.3 39
4 4.4 3.2 1.3 0.2 setosa 149 85 16.3 43
5 4.9 3.1 1.5 0.1 setosa 159 67 29.3 10
Merge is similar with ‘join’ in SQL. Followings are same with that.
# Use join to extract and print the names of all rows not found to be common after merging. # Should row names overlap, only print out the name of the former row.
rhive.big.query('select a.sepallength,a.sepalwidth,a.petallength,a.petalwidth,a.species,b.assault,b.urbanpop,b.rape,a.rowname from iris a join usarrests b on a.sepallength = b.murder')
sepallength sepalwidth petallength petalwidth species assault urbanpop rape rowname
1 4.3 3.0 1.1 0.1 setosa 102 62 16.5 14
2 4.4 2.9 1.4 0.2 setosa 149 85 16.3 9
3 4.4 3.0 1.3 0.2 setosa 149 85 16.3 39
4 4.4 3.2 1.3 0.2 setosa 149 85 16.3 43
5 4.9 3.1 1.5 0.1 setosa 159 67 29.3 10
rhive.basic.mode rhive.basic.mode returns the mode and its frequency within a specified row of the Hive table.
rhive.basic.mode('iris', 'sepallength')
sepallength freq
1 5 10
rhive.basic.range rhive.basic.range returns the greatest and lowest values within the specified numerical row of the Hive table.
rhive.basic.range('iris', 'sepallength')
[1] 4.3 7.9