Skip to Content
Previous

Test the "Outliers" service

By Abdel Dadouche

Using a REST client, you will test the "Outliers" SAP Cloud Platform predictive service

You will learn

  • How to use the “Outliers” SAP Cloud for predictive services from a REST Client.

Details

Only the synchronous mode will be tested here but you can mimic what was done in the Test the “Forecast” SAP Cloud for predictive services using a REST client tutorial for the asynchronous mode.

In order to ease the readability of this tutorial, we have used tokens to replace long URLs.
Therefore you can replace any occurrence of the token by the value listed above.

Token Value
<Account name> your SAP Cloud Platform account name. On a developer trial account, it should end by trial
<C4PA URL> https://aac4paservices<Account name>.hanatrial.ondemand.com/com.sap.aa.c4pa.services

If you are unclear with what is your SAP Cloud Platform account name, you can refer to the following blog entry: SAP Cloud Platform login, user name, account id, name or display name: you are lost? Not anymore!

Info: A short description of the Outliers service

The Outliers service identifies the odd profiles of a dataset whose target indicator is significantly different from what is expected.

This service:

  • Identifies outliers contained in a dataset with regard to a target indicator
  • Ranks the outliers to get the oddest on top
  • Provides the reasons why an identified outlier is odd

In general, an outlier can either result from a data quality issue to correct or represent a suspicious case to investigate.

An observation is considered an outlier if the difference between its “predicted value” and its “real value” exceeds the value of the error bar where the error bar is a deviation measure of the values around the predicted score.

Reasons will list the variables whose values have the most influence in the score. For each variables, the contribution corresponding to the score is compared to its contribution for the whole population. The variables for which the contribution is the most differential are selected as the most important reason.

Note: The target of the dataset must be either binary or continuous. Multinomial targets are not supported.

To summarize, in order to execute the outliers service, you need a dataset with:

  • a target variable
  • a set of variables that will be analyzed

Optionally, you can define the following parameters to enhance your analysis:

  • number of outliers : number of outliers to return
  • number of reasons : number of reasons to return for each outlier
  • weight variable: column to be used to increase the importance of a row
  • skipped variables: a list of variables to skip from the analysis
  • variable description: a more details description of the dataset
  • weight variable: a column to be used to increase the importance of a row
Please log in to access this content.
Info: A short description of the Census dataset

The dataset will be using during this tutorial is extracted from the sample dataset available with SAP BusinessObjects Predictive Analytics.

The Census sample data file that you will use to follow the scenarios for Regression/Classification and Segmentation/Clustering is an excerpt from the American Census Bureau database, completed in 1994 by Barry Becker.

Note: For more information about the American Census Bureau, see http://www.census.govInformation published on non-SAP site.

-

This file presents the data on 48,842 individual Americans, of at least 17 years of age. Each individual is characterized by 15 data items. These data, or variables, are described in the following table.

Variable Description Example of Values
age Age of individuals Any numerical value greater than 17
workclass Employer category of individuals Private, Self-employed-not-inc, …
fnlwgt Weight variable, allowing each individual to represent a certain percentage of the population Any numerical value, such as 0, 2341 or 205019
education Level of study, represented by a schooling level, or by the title of the degree earned 11th, Bachelors
education_num Number of years of study, represented by a numerical value A numerical value between 1 and 16
marital_status Marital status Divorced, Never-married, …
occupation Job classification Sales, Handlers-cleaners, …
relationship Position in family Husband, Wife, …
race Ethnicity
sex Gender Male, Female, …
capital_gain Annual capital gains Any numerical value
capital_loss Annual capital losses Any numerical value
native country Country of origin United States, France, …
class Variable indicating whether or not the salary of the individual is greater or less than $50,000 “1” if the individual has a salary of greater than $50,000 & “0” if the individual has a salary of less than $50,000
Please log in to access this content.
Step 1: Register the Census dataset

As described in Step 1 of Test the “Dataset” services tutorial, register the Census dataset using the following elements:

Open a new tab in Postman.

If you don’t have Postman installed yet, you can refer to the following how-to guide: Install Postman extension for Google Chrome as a REST client

Field Name Value
Request Type POST
URL <C4PA URL>/api/analytics/dataset/sync
{
  "location": {
    "schema" : "DEMO",
    "table" : "Census"
  }
}

Take note of the returned dataset identifier.

Please log in to access this content.
Step 2: Run the Outliers service

Open a new tab in Postman.

Fill in the following information:

Field Name Value
Request Type POST
URL <C4PA URL>/api/analytics/outliers/sync
Postman URL

Select the Authorization tab and fill in the following information:

Field Name Value
Type Basic Auth
Username your SAP Cloud Platform Account login (usually the email address used to register your SAP Cloud Platform account)
Password* your SAP Cloud Platform Account password
Postman URL

Select the Body tab, enable the raw mode and select JSON (application/json) in the drop down, then add the following content:

{
  "datasetID": 3,
  "targetColumn": "age",
  "skippedVariables" : ["id", "class", "sex", "race"],
  "variableDescription" : [
  	{"position" : "1", "variable" : "id", "storage" : "number" , "value" : "nominal" ,  "key" : "1"},
  	{"position" : "2", "variable" : "age", "storage" : "number" , "value" : "continuous"},
  	{"position" : "3", "variable" : "workclass", "storage" : "string" , "value" : "nominal" ,  "missing" : "?"},
  	{"position" : "4", "variable" : "fnlwgt", "storage" : "number" , "value" : "continuous"},
  	{"position" : "5", "variable" : "education", "storage" : "string" , "value" : "nominal"},
  	{"position" : "6", "variable" : "education_num", "storage" : "number" , "value" : "ordinal"},
  	{"position" : "7", "variable" : "marital_status", "storage" : "string" , "value" : "nominal"},
  	{"position" : "8", "variable" : "occupation", "storage" : "string" , "value" : "nominal" ,  "missing" : "?"},
  	{"position" : "9", "variable" : "relationship", "storage" : "string" , "value" : "nominal"},
  	{"position" : "10", "variable" : "race", "storage" : "string" , "value" : "nominal"},
  	{"position" : "11", "variable" : "sex", "storage" : "string" , "value" : "nominal"},
  	{"position" : "12", "variable" : "capital_gain", "storage" : "number" , "value" : "continuous" ,  "missing" : "99999"},
  	{"position" : "13", "variable" : "capital_loss", "storage" : "number" , "value" : "continuous"},
  	{"position" : "14", "variable" : "hours_per_week", "storage" : "number" , "value" : "continuous"},
  	{"position" : "15", "variable" : "native_country", "storage" : "string" , "value" : "nominal" ,  "missing" : "?"},
  	{"position" : "16", "variable" : "class", "storage" : "number" , "value" : "nominal"}
  ]  
}

Make sure the datasetID (here the value 3) is correct. To get the list of valid identifier, you can run Step 6: List all registered datasets from the Test the “Data Set” SAP Cloud for predictive services using a REST client tutorial.

-

With these settings, we will get a scoring equation as SQL for HANA to predict the probability of the class variable to be a 1, excluding the “id”, “sex”, “race” variables from the analysis. It will also adjust the dataset description with proper settings.

Click on Send

Congratulations! You have just run the outliers service on the Census dataset.

Here is the result:

{
  "modelPerformance": {
    "confidenceIndicator": 1,
    "predictionConfidence": 0.9925,
    "predictivePower": 0.8196,
    "qualityRating": 5
  },
  "numberOfOutliers": 356,
  "outliers": [
    {
      "dataPoint": {
        "id": 43706,
        "age": 28,
        "workclass": "Private",
        "fnlwgt": 103432,
        "education": "HS-grad",
        "education_num": 9,
        "marital_status": "Never-married",
        "occupation": "Transport-moving",
        "relationship": "Own-child",
        "race": "White",
        "sex": "Male",
        "capital_gain": 0,
        "capital_loss": 0,
        "hours_per_week": 45,
        "native_country": "Portugal",
        "class": 1
      },
      "errorBar": 0.07390104953508236,
      "predictedValue": -0.19959287909119922,
      "realValue": "1",
      "reasons": [
        {
          "value": "Never-married",
          "variable": "marital_status"
        },
        {
          "value": "Own-child",
          "variable": "relationship"
        },
        {
          "value": "28",
          "variable": "age"
        }
      ]
    },
    ...
  ],
  "parameters": {
    "datasetID": 3,
    "skippedVariables": [
      "id"
    ],
    "targetColumn": "class",
    "variableDescription": [
      {
        "key": 1,
        "position": 1,
        "storage": "integer",
        "value": "nominal",
        "variable": "id"
      },
      ...
    ]
  }
}

We can see that 356 records out of the 48842 are marked as outliers, where the difference between the “predicted value” and the “real value” exceeds the value of the error bar. The list is sorted by descending order to give first the records with the highest difference.

You can also play with the following parameters and check the differences:
- number of outliers : ask for 10, 50 and 100
- number of reasons“ : ask for 1,5 and 10
- skipped variables: exclude ”marital_status"
- variable description: for example as an ordinal variable

Please log in to access this content.

Optional

For more details on the SAP Cloud for predictive services, you can check the following URL that can also allow you to run the service:
- <C4PA URL>/raml/console/index.html?raml=../api/aa-cloud-services.raml
Or the public documentation
- https://help.hana.ondemand.com/c4pa/api/aa-cloud-services.html#api_analytics_forecast_post

Next Steps

Next
Back to top