Connec to Hive from Apache Spark -
i have simple program i'm running on standalone cloudera vm. have created managed table in hive , want read in apache spark, initial connection hive not being established. please advise.
i'm running program in intellij, have copied hive-site.xml /etc/hive/conf /etc/spark/conf, spark-job not connecting hive metastore
public static void main(string[] args) throws analysisexception { string master = "local[*]"; sparksession sparksession = sparksession .builder().appname(connecttohive.class.getname()) .config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse") .enablehivesupport() .master(master).getorcreate(); sparkcontext context = sparksession.sparkcontext(); context.setloglevel("error"); sqlcontext sqlctx = sparksession.sqlcontext(); hivecontext hivecontext = new hivecontext(sparksession); hivecontext.setconf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse"); hivecontext.sql("show databases").show(); hivecontext.sql("show tables").show(); sparksession.close(); }
the output below, expect see "employee table" , can query. since i'm running on standa-alone , hive metastore in local mysql server.
+------------+ |databasename| +------------+ | default| +------------+ +--------+---------+-----------+ |database|tablename|istemporary| +--------+---------+-----------+ +--------+---------+-----------+
jdbc:mysql://127.0.0.1/metastore?createdatabaseifnotexist=true configuration hive metastore
hive> show databases; ok default sxm temp time taken: 0.019 seconds, fetched: 3 row(s) hive> use default; ok time taken: 0.015 seconds hive> show tables; ok employee time taken: 0.014 seconds, fetched: 1 row(s) hive> describe formatted employee; ok # col_name data_type comment id string firstname string lastname string addresses array<struct<street:string,city:string,state:string>> # detailed table information database: default owner: cloudera createtime: tue jul 25 06:33:01 pdt 2017 lastaccesstime: unknown protect mode: none retention: 0 location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee table type: managed_table table parameters: transient_lastddltime 1500989581 # storage information serde library: org.apache.hadoop.hive.ql.io.parquet.serde.parquethiveserde inputformat: org.apache.hadoop.hive.ql.io.parquet.mapredparquetinputformat outputformat: org.apache.hadoop.hive.ql.io.parquet.mapredparquetoutputformat compressed: no num buckets: -1 bucket columns: [] sort columns: [] storage desc params: serialization.format 1 time taken: 0.07 seconds, fetched: 29 row(s) hive>
added spark logs
log4j:warn no appenders found logger (org.apache.hadoop.util.shell). log4j:warn please initialize log4j system properly. log4j:warn see http://logging.apache.org/log4j/1.2/faq.html#noconfig more info. using spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/07/25 11:38:30 info sparkcontext: running spark version 2.1.0 17/07/25 11:38:30 warn nativecodeloader: unable load native-hadoop library platform... using builtin-java classes applicable 17/07/25 11:38:30 info securitymanager: changing view acls to: cloudera 17/07/25 11:38:30 info securitymanager: changing modify acls to: cloudera 17/07/25 11:38:30 info securitymanager: changing view acls groups to: 17/07/25 11:38:30 info securitymanager: changing modify acls groups to: 17/07/25 11:38:30 info securitymanager: securitymanager: authentication disabled; ui acls disabled; users view permissions: set(cloudera); groups view permissions: set(); users modify permissions: set(cloudera); groups modify permissions: set() 17/07/25 11:38:31 info utils: started service 'sparkdriver' on port 55232. 17/07/25 11:38:31 info sparkenv: registering mapoutputtracker 17/07/25 11:38:31 info sparkenv: registering blockmanagermaster 17/07/25 11:38:31 info blockmanagermasterendpoint: using org.apache.spark.storage.defaulttopologymapper getting topology information 17/07/25 11:38:31 info blockmanagermasterendpoint: blockmanagermasterendpoint 17/07/25 11:38:31 info diskblockmanager: created local directory @ /tmp/blockmgr-eb1e611f-1b88-487f-b600-3da1ff8353db 17/07/25 11:38:31 info memorystore: memorystore started capacity 1909.8 mb 17/07/25 11:38:31 info sparkenv: registering outputcommitcoordinator 17/07/25 11:38:31 info utils: started service 'sparkui' on port 4040. 17/07/25 11:38:31 info sparkui: bound sparkui 0.0.0.0, , started @ http://10.0.2.15:4040 17/07/25 11:38:31 info executor: starting executor id driver on host localhost 17/07/25 11:38:31 info utils: started service 'org.apache.spark.network.netty.nettyblocktransferservice' on port 41433. 17/07/25 11:38:31 info nettyblocktransferservice: server created on 10.0.2.15:41433 17/07/25 11:38:31 info blockmanager: using org.apache.spark.storage.randomblockreplicationpolicy block replication policy 17/07/25 11:38:31 info blockmanagermaster: registering blockmanager blockmanagerid(driver, 10.0.2.15, 41433, none) 17/07/25 11:38:31 info blockmanagermasterendpoint: registering block manager 10.0.2.15:41433 1909.8 mb ram, blockmanagerid(driver, 10.0.2.15, 41433, none) 17/07/25 11:38:31 info blockmanagermaster: registered blockmanager blockmanagerid(driver, 10.0.2.15, 41433, none) 17/07/25 11:38:31 info blockmanager: initialized blockmanager: blockmanagerid(driver, 10.0.2.15, 41433, none) 17/07/25 11:38:32 info sharedstate: warehouse path 'file:/home/cloudera/works/jsonhive/spark-warehouse/'. 17/07/25 11:38:32 info hiveutils: initializing hivemetastoreconnection version 1.2.1 using spark classes. 17/07/25 11:38:32 info deprecation: mapred.max.split.size deprecated. instead, use mapreduce.input.fileinputformat.split.maxsize 17/07/25 11:38:32 info deprecation: mapred.reduce.tasks.speculative.execution deprecated. instead, use mapreduce.reduce.speculative 17/07/25 11:38:32 info deprecation: mapred.committer.job.setup.cleanup.needed deprecated. instead, use mapreduce.job.committer.setup.cleanup.needed 17/07/25 11:38:32 info deprecation: mapred.min.split.size.per.rack deprecated. instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 17/07/25 11:38:32 info deprecation: mapred.min.split.size deprecated. instead, use mapreduce.input.fileinputformat.split.minsize 17/07/25 11:38:32 info deprecation: mapred.min.split.size.per.node deprecated. instead, use mapreduce.input.fileinputformat.split.minsize.per.node 17/07/25 11:38:32 info deprecation: mapred.reduce.tasks deprecated. instead, use mapreduce.job.reduces 17/07/25 11:38:32 info deprecation: mapred.input.dir.recursive deprecated. instead, use mapreduce.input.fileinputformat.input.dir.recursive 17/07/25 11:38:32 info hivemetastore: 0: opening raw store implemenation class:org.apache.hadoop.hive.metastore.objectstore 17/07/25 11:38:32 info objectstore: objectstore, initialize called 17/07/25 11:38:32 info persistence: property hive.metastore.integral.jdo.pushdown unknown - ignored 17/07/25 11:38:32 info persistence: property datanucleus.cache.level2 unknown - ignored 17/07/25 11:38:34 info objectstore: setting metastore object pin classes hive.metastore.cache.pinobjtypes="table,storagedescriptor,serdeinfo,partition,database,type,fieldschema,order" 17/07/25 11:38:35 info datastore: class "org.apache.hadoop.hive.metastore.model.mfieldschema" tagged "embedded-only" not have own datastore table. 17/07/25 11:38:35 info datastore: class "org.apache.hadoop.hive.metastore.model.morder" tagged "embedded-only" not have own datastore table. 17/07/25 11:38:35 info datastore: class "org.apache.hadoop.hive.metastore.model.mfieldschema" tagged "embedded-only" not have own datastore table. 17/07/25 11:38:35 info datastore: class "org.apache.hadoop.hive.metastore.model.morder" tagged "embedded-only" not have own datastore table. 17/07/25 11:38:35 info query: reading in results query "org.datanucleus.store.rdbms.query.sqlquery@0" since connection used closing 17/07/25 11:38:35 info metastoredirectsql: using direct sql, underlying db derby 17/07/25 11:38:35 info objectstore: initialized objectstore 17/07/25 11:38:36 info hivemetastore: added admin role in metastore 17/07/25 11:38:36 info hivemetastore: added public role in metastore 17/07/25 11:38:36 info hivemetastore: no user added in admin role, since config empty 17/07/25 11:38:36 info hivemetastore: 0: get_all_databases 17/07/25 11:38:36 info audit: ugi=cloudera ip=unknown-ip-addr cmd=get_all_databases 17/07/25 11:38:36 info hivemetastore: 0: get_functions: db=default pat=* 17/07/25 11:38:36 info audit: ugi=cloudera ip=unknown-ip-addr cmd=get_functions: db=default pat=* 17/07/25 11:38:36 info datastore: class "org.apache.hadoop.hive.metastore.model.mresourceuri" tagged "embedded-only" not have own datastore table. 17/07/25 11:38:36 info sessionstate: created local directory: /tmp/76258222-81db-4ac1-9566-1d8f05c3ecba_resources 17/07/25 11:38:36 info sessionstate: created hdfs directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba 17/07/25 11:38:36 info sessionstate: created local directory: /tmp/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba 17/07/25 11:38:36 info sessionstate: created hdfs directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba/_tmp_space.db 17/07/25 11:38:36 info hiveclientimpl: warehouse location hive client (version 1.2.1) file:/home/cloudera/works/jsonhive/spark-warehouse/ 17/07/25 11:38:36 info hivemetastore: 0: get_database: default 17/07/25 11:38:36 info audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: default 17/07/25 11:38:36 info hivemetastore: 0: get_database: global_temp 17/07/25 11:38:36 info audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: global_temp 17/07/25 11:38:36 warn objectstore: failed database global_temp, returning nosuchobjectexception +------------+ |databasename| +------------+ | default| +------------+ +--------+---------+-----------+ |database|tablename|istemporary| +--------+---------+-----------+ +--------+---------+-----------+ process finished exit code 0
update
/usr/lib/hive/conf/hive-site.xml not in classpath not reading tables, after adding in classpath worked fine ... since running intellij have problem .. in production spark-conf folder have link hive-site.xml ...
17/07/25 11:38:35 info metastoredirectsql: using direct sql, underlying db derby
this hint you're not connected remote hive metastore (that you've set mysql), , xml file not correctly on classpath.
you can programmatically without xml before make sparksession
system.setproperty("hive.metastore.uris", "thrift://metastore:9083");
Comments
Post a Comment