Start this tutorial by launching the Zeppelin notebook 6_ORC and Parquet files
.
In order to create ORC and Parquet files you will switch to Spark context. More information about Spark SQL methods and function can be found here:
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/types/package-summary.html
Run the highlighted paragraphs in the notebook.

In the next step use StructType
to create a structured data frame of the data set. StructType
is a built-in data type in Spark SQL to represent a collection of fields.

Then create a fixed value called options
that stores required option as strings. This will be passed in as Vora options.

Create a data frame with the defined schema and options. This also includes the source file being loaded.

Select the desired columns from the PRODUCT_CSV
data frame and write them to HDFS in Parquet format.
If you receive an error that the file already exists please change the output file to something like /user/vora/products_p_2.parquet
to ensure its uniqueness.
Move to the next paragraph and create an ORC output file.
If you receive an error that the file already exists, then please update the statement and change to something like /user/vora/products_o_2.orc
.
Now you can create an in-memory Vora resident table, using ORC and Parquet files for loading data.
Ensure paths
option is updated with previous step changes
Run the next paragraph and select data from the newly created table containing data loaded from your Parquet file.

Final step of this tutorial is to create a table from your ORC based file and select data from the table.
Ensure paths
option is updated with previous step changes.