random query generator for hive november 2015 hive contributor meetup szehon ho
DESCRIPTION
3 © 2014 Cloudera, Inc. All rights reserved. Data Generator Table-count (max, min) Column-count (max, min) Row-count (max, min) Column Data Types BooleanFloat TinyIntDecimal(r_precision, r_scale) SmallIntChar(r_length) BigIntVarchar(r_length) DoubleTimestampTRANSCRIPT
Random Query Generator for HiveNovember 2015 Hive Contributor Meetup
Szehon Ho
2© 2014 Cloudera, Inc. All rights reserved.
Overview• Collaboration with Impala team, work to run against Hive• Automates generation of test cases, solves:• Humans can only generate so many test queries• Humans focus on positive queries (what about machine-generated
queries)• Idea is to have two databases: test (Hive, Impala) and
reference database (Postgres, Mysql, Oracle)• Generate random data, issue random queries against both
3© 2014 Cloudera, Inc. All rights reserved.
Data Generator• Table-count (max, min)• Column-count (max, min)• Row-count (max, min)
Column Data TypesBoolean FloatTinyInt Decimal(r_precision, r_scale)SmallInt Char(r_length)BigInt Varchar(r_length)Double Timestamp
4© 2014 Cloudera, Inc. All rights reserved.
Query Generator1. Generate QueryModel based on QueryProfile2. ModelTranslator to translate from Model to database’s SQL dialect3. Execute the SQL on via DbConnectors4. Result comparison (sort if unsorted)
QueryModel
HiveProfile
ImpalaProfile
HiveTranslator
PostgresTranslator“Test databases”
MysqlTranslator
HiveQL
SQL (Postgres dialect)
SQL (Mysql dialect)
“Reference databases”
5© 2014 Cloudera, Inc. All rights reserved.
Query Model, High Level
Query
Clause
Constant/Col Funcs TableExpr
• Represent valid SQL query• Query consist of one or more
clause (from, select, group-by, union)
• Clause has one or more expressions (constants, columns, functions of columns, tables), different for different clause types
• Model is Recursive in nature:• Funcs can be run on output of
other funcs• Union clause can contain
another query• Some boolean funcs can contain
subquery
6© 2014 Cloudera, Inc. All rights reserved.
Query Model, Funcs• Func types:• Boolean funcs (isnull, and, or, in, =, !=, >, <)• Subquery funcs (exists, not exists, in, not in): May contain
another Query• Val funcs (Trim, Length, Concat, Add, Abs, Floor, Ceil, Greatest,
Least, etc)• Agg funcs (Eg, Max, Min, Sum, Avg, Count)• Analytic Funcs (Rank, DenseRank, RowNumber, Lead, Lag,
FirstValue, LastValue, Max, Min, etc..)• Window specification (“Rows between x and y”, “rows
unbounded preceding”, etc)• PartitionByClause (“over (partition by x)”)• OrderByClause
• Rules to determine where to use a func, based on func type and return type
7© 2014 Cloudera, Inc. All rights reserved.
QueryModel: Clauses• QueryModel• WithClause• SelectClause• FromClause: Table Expression• WhereClause:
• Predicate (Boolean expr)• GroupByClause: if Select (Basic or
AggFunc)• HavingClause: if Select (AggFunc)
• Predicate (Boolean expr)• UnionClause (Query)• OrderByClause• LimitClause
• SelectClause, List of Expr’s:• Constant• Col• Val Funcs• AggFunc• AnalyticFunc
• Window• PartitionByClause• OrderByClause
WithClause: Adds a table expression:
“With bar as (select * from foo) select * from bar;
GroupByClause, List of:• Constant• Col
OrderByClause, List of:• Constant• Col• Func
8© 2014 Cloudera, Inc. All rights reserved.
QueryModel: Joins• QueryModel• WithClause• SelectClause• FromClause:
• Multiple table expressions• JoinClause (define table
relationship)• WhereClause:
• Predicate (Boolean function, using expr from tables in JoinClause)
• GroupByClause• HavingClause
• JoinClause Types:• Inner• Left• Right• Left semi• Right semi• Right anti• Full outer• Cross
9© 2014 Cloudera, Inc. All rights reserved.
Demo
10© 2014 Cloudera, Inc. All rights reserved.
Results 1: HiveQL Discrepancies• Language Deficiences (as of Hive 1.1)• Support “Interval” for date arithemetic operations: date + INTERVAL
expr unit• With {…} cannot be used in subquery• Having must have a group by• Cannot sort by two expressions in window function, unless window
specified• Negative lag or lead amount not allowed• Only “Union all” and not “Union” (since fixed)
• Null Ordering• Hive lacks specifying null order (opposite of Postgres)
11© 2014 Cloudera, Inc. All rights reserved.
Results 2: JIRA’s so far• Many valid issues found, fixed since 1.1• HIVE-12082 : Null comparison for greatest and least operator• HIVE-12070 : Relax type restrictions on ‘Greatest’ and ‘Least’• HIVE-11737: IndexOutOfBounds compiling query with
duplicated groupby keys• HIVE-11712: Duplicate groupby keys cause ClassCastException• HIVE-11835: Type decimal(1,1) reads 0.0, 0.00, etc from text
file as NULL• HIVE-12296 : ClassCastException when selecting constant in
inner select (pending)
12© 2014 Cloudera, Inc. All rights reserved.
Going Forward
• Tackle non-SQL-92 query-support• Nested Types• Partitioned tables• Multi-insert
Thank you.