Dativa Pipeline API: Sample Data

Dativa Pipeline API: Sample Data

Previous: AWS API | Next: Type Validation

We have uploaded an extensive set up of sample data that we use in our automated tests suit to the s3 bucket s3://pipeline-api-demo which can be accessed by any user on a requester pays basis.

You can view these using the AWS command line interface by running the following command:

aws s3 ls s3://pipeline-api-demo/source --recursive --request-payer requester

Running the examples

Each sample CSV file is accompanied by a set of rules that can be applied, a version of the processed file after the rules have been run, and an example 'curl' command that can be executed to call the API with the rules as defined:

To run a sample, select a sample file and get the example 'curl' command associated with it e.g.:

aws s3api get-object --bucket pipeline-api-demo --key source/number/test_int_is_unique_dirty.csv.curl --output table --query 'Body' --request-payer requester /dev/stdout

returns the following:

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' --header 'x-api-key: <your-api-key-here>' -d "{ \
    \"source\": { \
        \"s3_url\": \"https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv\", \
        \"delimiter\": \",\", \
        \"encoding\": \"UTF-8\" \
    }, \
    \"destination\": { \
        \"s3_url\": \"https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv\", \
        \"delimiter\": \",\", \
        \"encoding\": \"UTF-8\" \
    }, \
    \"rules\": [ \
        { \
            \"append_results\": false, \
            \"params\": { \
                \"attempt_closest_match\": false, \
                \"decimal_places\": 1, \
                \"default_value\": 0, \
                \"fallback_mode\": \"use_default\", \
                \"fix_decimal_places\": true, \
                \"is_unique\": false, \
                \"lookalike_match\": false, \
                \"maximum_value\": 100.0, \
                \"minimum_value\": 1.0, \
                \"skip_blank\": false, \
                \"string_distance_threshold\": 0.7 \
            }, \
            \"rule_type\": \"Number\", \
            \"field\": \"TotalEpisodes\" \
        } \
    ] \
}" 'https://pipeline-api.dativa.com/clean'

Substitute your own API key in the command and run it, ensuring that the processing job starts successfully:

{
  "job_id": "e7f3de81-0dab-4bd1-b7c9-4bbca9a62309",
  "status": "PREPARING",
  "reason": "Method called"
}

Use the Job ID returned above to poll the status of the processing:

curl -X GET --header 'Accept: application/json' --header 'x-api-key: <your-api-key>' 'https://pipeline-api.dativa.com/status/e7f3de81-0dab-4bd1-b7c9-4bbca9a62309'

Ensure that the processing completes successfully:

{
    "job_id": "e7f3de81-0dab-4bd1-b7c9-4bbca9a62309", 
    "status": "COMPLETED", 
    "reason": "Processing successful", 
    "report": [
        {
            "date": "2018-06-05 09:21:45.640178", 
            "source_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv", 
            "field": "TotalEpisodes", 
            "rule": "Number", 
            "category": "modified", 
            "description": "Automatically fixed to 1 decimal places", 
            "modified_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.report.modified.totalepisodes"
        }, 
        {
            "date": "2018-06-05 09:21:46.280061", 
            "source_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv", 
            "field": "TotalEpisodes", 
            "rule": "Number", 
            "category": "replaced", 
            "description": "Replaced with default value", 
            "modified_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.report.replaced.totalepisodes"            
        }
    ], 
    "source": {
        "s3_url": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv", 
        "delimiter": ",", 
        "encoding": "UTF-8", 
        "header": 0, 
        "skiprows": 0, 
        "quotechar": "\"", 
        "strip_whitespace": true
    }, 
    "destination": {
        "s3_url": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.cleaned", 
        "delimiter": ",", 
        "encoding": "UTF-8"
    }, 
    "seconds_taken": "1.4441330432891846"
}

Check that the processed file has been generated in the destination e.g.:

 aws s3 ls s3://pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.cleaned  --recursive --request-payer requester

returns the following:

2018-06-05 10:21:47        133 destination/number/test_int_is_unique_dirty.csv.cleaned

View the Pipeline API report of the processing performed e.g.:

aws s3api get-object --bucket pipeline-api-demo --key destination/number/test_int_is_unique_dirty.csv.report --output table --query 'Body' --request-payer requester /dev/stdout

returns the following:

[
    {
        "date": "2018-06-05 09:21:45.640178",
        "source_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv",
        "field": "TotalEpisodes",
        "rule": "Number",
        "category": "modified",
        "description": "Automatically fixed to 1 decimal places",
        "modified_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.report.modified.totalepisodes"
    },
    {
        "date": "2018-06-05 09:21:46.280061",
        "source_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv",
        "field": "TotalEpisodes",
        "rule": "Number",
        "category": "replaced",
        "description": "Replaced with default value",
        "modified_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.report.replaced.totalepisodes"
    }
]

It should be noted that each call to the API (e.g. each running of the 'curl' command in the samples) will incur the standard metered charge for using the Pipeline API. Access to the sample S3 bucket is also on a 'Requester Pays' basis.

Useful commands

The following AWS CLI commands may be useful when navigating the sample files:

Recursively list all files in the 'source' directory

aws s3 ls s3://pipeline-api-demo/source --recursive --request-payer requester

View the contents of the example file

For example in the 'number/test_int_is_unique_dirty.csv' demo

aws s3api get-object --bucket pipeline-api-demo --key
source/number/test_int_is_unique_dirty.csv.curl
--output table --query 'Body' --request-payer requester
/dev/stdout

Check the file created

For example, after running the 'number/test_int_is_unique_dirty.csv' example

aws s3 ls
s3://pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.cleaned
--recursive --request-payer requester

View the contents of the report file

For example, after running the 'number/test_int_is_unique_dirty.csv' example

aws s3api get-object --bucket pipeline-api-demo --key
destination/number/test_int_is_unique_dirty.csv.report
--output table --query 'Body' --request-payer requester
/dev/stdout

Note that for the above commands you must use AWS credentials for the account that the 'Requester Pays' access is to be billed against.

Troubleshooting

We've come across the following errors:

Curl command returns '{"message":"Forbidden"}' error

Ensure you have substituted the '<your-api-key-here>' text with your actual API key, obtained from the Dativa Developer Portal, in the Curl command (charges apply)

An error occurred (AccessDenied) when calling the ListObject/GetObject operation

Ensure that you have included the '--request-payer requester' option in the aws CLI command (charges apply). Only ListObject or GetObject S3 commands are allowed

Curl command returns '{"message": "Invalid request body"}' error

Ensure the JSON in the Curl command body is correctly specified

Full list of example files

We have provided a number of sample files to demonstrate the use of the Pipeline API in a publicly accessible S3 bucket (s3://pipeline-api-demo/source/). These consist of a set of CSV files, each illustrating one or more issues that can be resolved using the Pipeline API:

FolderFileExample
anonymization emails.csv TOKENIZATION DATE VALIDATION STRING VALIDATION
emails_rotating_encrypted.csv TOKENIZATION DATE VALIDATION STRING VALIDATION DECRYPTION
date check_epoch_time.csv DATE VALIDATION
date_format.csv DATE VALIDATION
date_format1.csv DATE VALIDATION
date_is_unique_dirty.csv DATE VALIDATION
generic US1_movies_dirty.csv REFERENTIAL INTEGRITY STRING VALIDATION
email_test.csv STRING VALIDATION
ip_list.csv STRING VALIDATION UNIQUENESS
names_blank.csv STRING VALIDATION
test_cities_dirty1.csv REFERENTIAL INTEGRITY STRING VALIDATION
test_cities_dirty_skip_blank.csv REFERENTIAL INTEGRITY
lookup test_cities_dirty.csv REFERENTIAL INTEGRITY
test_cities_dirty_windows1252.csv REFERENTIAL INTEGRITY
test_short_original.csv REFERENTIAL INTEGRITY
number test_int_is_unique_dirty.csv NUMBER VALIDATION
test_int_range_dirty.csv NUMBER VALIDATION
test_int_range_dirty1.csv NUMBER VALIDATION
test_int_range_dirty2.csv NUMBER VALIDATION
test_int_range_dirty3.csv NUMBER VALIDATION
test_int_range_dirty_4.csv NUMBER VALIDATION
session NewSession rule checking file.csv SESSION VALIDATION
NewSession rule checking file1.csv SESSION VALIDATION
Session Test 2 only 2 records.csv SESSION VALIDATION
Session_Test1.csv SESSION VALIDATION
allowed_gap_seconds.csv SESSION VALIDATION
alternate_overlap_second.csv SESSION VALIDATION
check_epoch_time.csv SESSION VALIDATION
check_epoch_time_insert_new_dirty1.csv SESSION VALIDATION
check_epoch_time_threshold_dirty.csv SESSION VALIDATION
clean_overlap_full_with_zero.csv SESSION VALIDATION
gap_first.csv SESSION VALIDATION
gap_first_time.csv SESSION VALIDATION
gap_second.csv SESSION VALIDATION
gap_two.csv SESSION VALIDATION
overlap_child.csv SESSION VALIDATION
overlap_first.csv SESSION VALIDATION
overlap_first_2.csv SESSION VALIDATION
overlap_full.csv SESSION VALIDATION
overlap_second.csv SESSION VALIDATION
overlap_second_2.csv SESSION VALIDATION
session_check.csv SESSION VALIDATION
string test_string_dirty.csv STRING VALIDATION
test_string_dirty1.csv STRING VALIDATION
test_string_is_unique_dirty.csv STRING VALIDATION
unique duplication_test.csv UNIQUENESS
names_blank.csv UNIQUENESS

Each sample CSV file is accompanied by a set of rules that can be applied, a version of the processed file after the rules have been run, and an example 'curl' command that can be executed to call the API with the rules as defined:

Filename
Contents
<sample-file>
Sample CSV data file
<sample-file>.rules
Rules to apply in JSON format
<sample-file>.curl
Example 'curl' command to run.
  • Substitute your own API key to execute.
  • Results placed in 'destination' folder.
<sample-file>.cleaned
The cleaned file after processing with the Pipeline API using the above rules.

The processed files will be stored in the same S3 bucket under the 'destination' folder. Note that any files generated will overwrite previous files and will automatically expire and be deleted after 3 days.

Previous: AWS API | Next: Type Validation

Related documentation

  • Dativa Pipeline API on AWS - The Dativa Pipeline API is available through the AWS marketplace (more)
  • Dativa Pipeline API: Validating basic data types - Validating incoming datasets for basic string, number, and date type formatting and range checks using the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Anonymizing data - The Dativa Pipeline API support tokenization, hashing, and encyrption of incoming datasets for anonymisation and pseudonymization (more)
  • Dativa Pipeline API: Referential Integrity - Using the Dativa Pipeline API to validate data against other known good datasets to ensure referential integrity (more)
  • Dativa Pipeline API: Handling invalid data - Invalid data can be quarantined or automatically fixed by the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Working with session data - The Dativa Pipeline API can check for gaps and overlaps in session data and automatically fix them (more)
  • Dativa Pipeline API: Reporting and monitoring data quality - The Dativa Pipeline API logs data that does not meet the defined rules and quarantines bad data (more)
  • Dativa Pipeline API: Full API reference - A field by field breakdown of the full functionality of the Dativa Data Pipeline API (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531