Skip to Content
Previous

Import the MovieLens dataset

By Abdel DADOUCHE

Learn how to easily import flat dataset files in your SAP HANA, express edition instance using the Design-Time objects and Core Data Services (CDS) features

Details

You will learn

  • How to create CDS artifacts to expose your dataset as CDS entities.
  • How to load your dataset content in the CDS entities.
Step 1: Which MovieLens dataset?

In order to build your movie recommendation engine, you will be using one of the MovieLens dataset.

These datasets are made available by the GroupLens Research © group.

They have collected and made available movie rating data sets from the MovieLens web site which were collected over various periods of time.

The data set that you will be using for this series is the small version of the MovieLens Latest Datasets downloadable here.

This dataset, thanks to its size, can quickly be uploaded in your SAP HANA, express edition instance.

If you have additional capacity resources, you can run this tutorial series with the larger datasets, but the validations steps implemented were built based on the small dataset.

Before using these data sets, please review the README file for the usage licenses and other details.

Please log in to access this content.
Step 2: SAP HANA data import options

There are multiple ways to import data set files inside of your SAP HANA, express edition instance.

  • Eclipse IDE

The SAP HANA Tools plugin for Eclipse provides an Import/Export feature which would allow you to create the appropriate physical tables first and then import the data.

However, this would require the Eclipse IDE to be locally installed and properly configured with the SAP HANA Tools plugin.

Then, you would need to know the complete data file format description in order to create the tables with the proper columns structure. And, last but not least, any changes would require to recreate the all structure and reload the data.

If you want to learn more about this import method, you can check the following tutorial: Import CSV into SAP HANA, express edition using the SAP HANA Tools for Eclipse

  • SAP HANA HDB Client

The SAP HANA HDB Client provides an IMPORT FROM statement allowing you to import CSV files physically located on your SAP HANA, express edition host using a SQL command.

However, this method requires that the table are created before the execution of the command.

If you want to learn more about this import method, you can check the following tutorial: Import CSV into SAP HANA, express edition using IMPORT FROM SQL command

  • SAP HANA Persistence Model

The SAP HANA extended application services, advanced model, (XS advanced) provide a comprehensive platform for the development and execution of native data-intensive applications.

The application data model comprises all the database artifacts used to store and provision data for your application’s back end and user interface.

As part of the process of defining the database persistence model for your application, you can create database design-time artifacts such as tables and views, for example using Core Data Services (CDS).

At the same time, you can also create procedures and functions using SQLScript, which can be used to insert data into (and remove data from) tables or views.

Solution

As the purpose of this tutorial series is to discover how to use Machine Learning algorithms to build an end to end solution including a native SAP HANA application, you will be using the SAP HANA Persistence Model with the Core Data Service (CDS) artifacts in a SAP HANA Database Module.

Please log in to access this content.
Step 3: Create a SAP HANA Database Module

Open the Web IDE, and login using the XSA_DEV credentials.

Switch to the Development perspective using the Web IDE Development icon.

Web IDE

As a reminder the default URL for the Web IDE is:

  • https://hxehost:53075

A link to the Web IDE can also be found on the XSA Controller page at:

  • https://hxehost:39030

In the left panel, right click on the movielens project folder, then select New > SAP HANA Database Module.

Web IDE

Set the name to db and click on Next.

Web IDE

Set the following details on the next screen:

Name Value
Name Space aa.movielens.db
Schema Name ml
SAP HANA Database Version 2.0 SPS 03
Build module after creation checked

Click on Finish.

Web IDE
Please log in to access this content.
Step 4: Create the folder structure

Expand the movielens/db/src folder.

Create the following directory structure:

|-- movielens/db/src
    |-- data
    |-- hdb
        |-- apl
            |-- afllang
            |-- procedures
            |-- views
        |-- pal
            |-- afllang
            |-- procedures
            |-- views        
        |-- summary

You can use the right click on the target folder and select New > Folder.

Enter the folder name, then click on OK.

Please log in to access this content.
Step 5: Import the dataset CSV files

As the dataset files get updated periodically, and in order to ensure consistency of content and validation, a copy of the data is available under the data directory within the SAP Tutorial GitHub repository.

Download the following files locally (right click on the link, then use the Save link as option):

Again, before using these files, make sure you have reviewed the dataset README file for the usage licenses and other details.

In the left side panel, expand the movielens/db/src/data tree node.

Right click on the data folder, and use the Import > File or Project menu item.

Select one of the previously downloaded files.

Web IDE

Click on OK.

Repeat the operation for all the previously downloaded files:

Your package structure should now look like this:

Web IDE

Note:

You should close the tabs that got opened for each of the imported files in order to release the resources used by your browser.

Please log in to access this content.
Step 6: Create the CDS Entity Artifacts

CDS Entities Artifacts

In the SAP HANA database, a CDS entity is a table with a set of data elements that are organized using columns and rows.
 
A CDS entity has a specified number of columns, defined at the time of entity creation, but can have any number of rows.
 
Database entities also typically have meta-data associated with them; the meta-data might include constraints on the entity or on the values within particular columns.
 
SAP HANA Extended Application Services (SAP HANA XS) enables you to create a database entity as a design-time file. All design-time files, including your CDS entity definition, can be transported to other SAP HANA systems and, when deployed, used to generate the same catalog objects. You can define the entity using CDS compliant DDL.
 
They are stored in a CDS documents which are design-time source files that contain DDL code according to rules defined in Core Data Services.
 
The SAP HANA Core Data Services (CDS) plug-in transforms a design-time SAP HANA CDS resource (defined in a .hdbcds artifact) into the database objects defined in the CDS document, for example, tables, views, types, etc.
 
For additional details, check the CDS Entities in XS Advanced documentation.

Create a new file named data.hdbcds in the movielens/db/src/hdb folder.

By default the Graphical Editor will open. Close it, and right click on the data.hdbcds file, and go to Open with > Code Editor.

This is the full path of the created file:

movielens/db/src/hdb/data.hdbcds

Paste the following content:

namespace aa.movielens.db.hdb;

context data {
    entity links {
        key MOVIEID : Integer;
            IMDBID  : Integer;
            TMDBID  : Integer;
    };

    entity movies {
        key MOVIEID : Integer;
            TITLE   : String(255);
            GENRES  : String(255);
    };

    entity ratings {
        key USERID    : Integer;
        key MOVIEID   : Integer;
            RATING    : DecimalFloat;
            TIMESTAMP : Integer;
    };

    entity tags {
        key USERID    : Integer;
        key MOVIEID   : Integer;
        key TAG       : String(255);
            TIMESTAMP : Integer;
    };
};

Save the file using the save icon from the menu.

Please log in to access this content.
Step 7: Create the Table Data Artifacts

Table Data Artifact

The Table Data plug-in can be used to insert data defined in other design-time artifacts into database tables which are managed by SAP HANA DI and are not system-versioned, temporary, or virtual tables.
 
For additional details, check the Table Data in XS Advanced documentation.

Create a new file named data.hdbtabledata in the movielens/db/src/hdb folder.

This is the full path of the created file:

movielens/db/src/hdb/data.hdbtabledata

Paste the following content:

{
  "format_version": 1,
  "imports": [
    {
      "column_mappings" : {
        "MOVIEID" : "movieId",
        "IMDBID" : "imdbId",
        "TMDBID" : "tmdbId"
      },
      "import_settings" : {
        "import_columns" : [ "MOVIEID", "IMDBID", "TMDBID"],
        "include_filter" : [], "exclude_filter" : []
      },
      "source_data" : {
        "data_type" : "CSV", "has_header" : true, "dialect"   : "HANA",
        "file_name" : "aa.movielens.db.data::links.csv"
      },
      "target_table" : "aa.movielens.db.hdb::data.links"  
    },
    {
      "column_mappings" : {
        "MOVIEID" : "movieId",
        "TITLE" : "title",
        "GENRES" : "genres"
      },
      "import_settings" : {
        "import_columns" : [ "MOVIEID", "TITLE", "GENRES"],
        "include_filter" : [], "exclude_filter" : []
      },
      "source_data" : {
        "data_type" : "CSV", "has_header" : true, "dialect"   : "HANA",
        "file_name" : "aa.movielens.db.data::movies.csv"
      },
      "target_table" : "aa.movielens.db.hdb::data.movies"  
    },
    {
      "column_mappings" : {
        "USERID" : "userId",
        "MOVIEID" : "movieId",
        "RATING" : "rating",
        "TIMESTAMP": "timestamp"
      },
      "import_settings" : {
        "import_columns" : [ "USERID", "MOVIEID", "RATING", "TIMESTAMP"],
        "include_filter" : [], "exclude_filter" : []
      },
      "source_data" : {
        "data_type" : "CSV", "has_header" : true, "dialect"   : "HANA",
        "file_name" : "aa.movielens.db.data::ratings.csv"
      },
      "target_table" : "aa.movielens.db.hdb::data.ratings"  
    },
    {
      "column_mappings" : {
        "USERID" : "userId",
        "MOVIEID" : "movieId",
        "TAG" : "tag",
        "TIMESTAMP": "timestamp"
      },
      "import_settings" : {
        "import_columns" : [ "USERID", "MOVIEID", "TAG", "TIMESTAMP"],
        "include_filter" : [], "exclude_filter" : []
      },
      "source_data" : {
        "data_type" : "CSV", "has_header" : true, "dialect"   : "HANA",
        "file_name" : "aa.movielens.db.data::tags.csv"
      },
      "target_table" : "aa.movielens.db.hdb::data.tags"  
    }    
  ]
}

Save the file using the plus icon from the menu.

Please log in to access this content.
Step 8: Build the SAP HANA Database Module

Right click on the db folder and select Build.

Web IDE

The console should display at the end the following message:

(Builder) Build of /movielens/db completed successfully.
Please log in to access this content.
Step 9: Open the Database Explorer

On the left side bar, click on the Database Explorer icon Web IDE icon.

Web IDE

Use the Add a database to the Database Explorer icon Web IDE.

Select HDI Container as Database Type and pick the entry that starts with XSA_DEV and ends with movielens-hdi_db that belongs to the development Space, then click on OK.

Web IDE

Select the Tables element, and your tables should appear in the list.

Web IDE
Please log in to access this content.
Step 10: Validate the import

Let’s now validate that the data was properly loaded.

Open a new SQL Console using the sql icon.

Paste the following content in the console, and use the execute icon run.

select 'links'   as "table name", count(1) as "row count" from "aa.movielens.db.hdb::data.links"
union all
select 'movies'  as "table name", count(1) as "row count" from "aa.movielens.db.hdb::data.movies"
union all
select 'ratings' as "table name", count(1) as "row count" from "aa.movielens.db.hdb::data.ratings"
union all
select 'tags'    as "table name", count(1) as "row count" from "aa.movielens.db.hdb::data.tags";

Based on the result returned by the above SQL statement, provide an answer to the question below then click on Validate.

How many rows were loaded in the "ratings" table?
×
Step 11: Commit your changes

On the icon bar located on the right side of the Web IDE, click on the Git Pane icon Web IDE.

Click on Stage All, enter a commit comment, then click on Commit and Push > origin master.

Please log in to access this content.
Next
Back to top