Dativa Pipeline API: Sample Data

Dativa Pipeline API: Sample Data

Previous: AWS API | Next: Type Validation

We have uploaded an extensive set up of sample data that we use in our automated tests suit to the s3 bucket s3://pipeline-api-demo which can be accessed by any user on a requester pays basis.

You can view these using the AWS command line interface by running the following command:

aws s3 ls s3://pipeline-api-demo/source --recursive --request-payer requester

Running the examples

Each sample CSV file is accompanied by a set of rules that can be applied, a version of the processed file after the rules have been run, and an example ‘curl’ command that can be executed to call the API with the rules as defined:

To run a sample, select a sample file and get the example ‘curl’ command associated with it e.g.:

aws s3api get-object --bucket pipeline-api-demo --key source/number/test_int_is_unique_dirty.csv.curl --output table --query 'Body' --request-payer requester /dev/stdout

returns the following:

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' --header 'x-api-key: <your-api-key-here>' -d "{ \
    \"source\": { \
        \"s3_url\": \"https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv\", \
        \"delimiter\": \",\", \
        \"encoding\": \"UTF-8\" \
    }, \
    \"destination\": { \
        \"s3_url\": \"https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv\", \
        \"delimiter\": \",\", \
        \"encoding\": \"UTF-8\" \
    }, \
    \"rules\": [ \
        { \
            \"append_results\": false, \
            \"params\": { \
                \"attempt_closest_match\": false, \
                \"decimal_places\": 1, \
                \"default_value\": 0, \
                \"fallback_mode\": \"use_default\", \
                \"fix_decimal_places\": true, \
                \"is_unique\": false, \
                \"lookalike_match\": false, \
                \"maximum_value\": 100.0, \
                \"minimum_value\": 1.0, \
                \"skip_blank\": false, \
                \"string_distance_threshold\": 0.7 \
            }, \
            \"rule_type\": \"Number\", \
            \"field\": \"TotalEpisodes\" \
        } \
    ] \
}" 'https://pipeline-api.dativa.com/clean'

Substitute your own API key in the command and run it, ensuring that the processing job starts successfully:

{
  "job_id": "e7f3de81-0dab-4bd1-b7c9-4bbca9a62309",
  "status": "PREPARING",
  "reason": "Method called"
}

Use the Job ID returned above to poll the status of the processing:

curl -X GET --header 'Accept: application/json' --header 'x-api-key: <your-api-key>' 'https://pipeline-api.dativa.com/status/e7f3de81-0dab-4bd1-b7c9-4bbca9a62309'

Ensure that the processing completes successfully:

{
    "job_id": "e7f3de81-0dab-4bd1-b7c9-4bbca9a62309", 
    "status": "COMPLETED", 
    "reason": "Processing successful", 
    "report": [
        {
            "date": "2018-06-05 09:21:45.640178", 
            "source_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv", 
            "field": "TotalEpisodes", 
            "rule": "Number", 
            "category": "modified", 
            "description": "Automatically fixed to 1 decimal places", 
            "modified_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.report.modified.totalepisodes"
        }, 
        {
            "date": "2018-06-05 09:21:46.280061", 
            "source_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv", 
            "field": "TotalEpisodes", 
            "rule": "Number", 
            "category": "replaced", 
            "description": "Replaced with default value", 
            "modified_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.report.replaced.totalepisodes"            
        }
    ], 
    "source": {
        "s3_url": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv", 
        "delimiter": ",", 
        "encoding": "UTF-8", 
        "header": 0, 
        "skiprows": 0, 
        "quotechar": "\"", 
        "strip_whitespace": true
    }, 
    "destination": {
        "s3_url": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.cleaned", 
        "delimiter": ",", 
        "encoding": "UTF-8"
    }, 
    "seconds_taken": "1.4441330432891846"
}

Check that the processed file has been generated in the destination e.g.:

 aws s3 ls s3://pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.cleaned  --recursive --request-payer requester

returns the following:

2018-06-05 10:21:47        133 destination/number/test_int_is_unique_dirty.csv.cleaned

View the Pipeline API report of the processing performed e.g.:

aws s3api get-object --bucket pipeline-api-demo --key destination/number/test_int_is_unique_dirty.csv.report --output table --query 'Body' --request-payer requester /dev/stdout

returns the following:

[
    {
        "date": "2018-06-05 09:21:45.640178",
        "source_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv",
        "field": "TotalEpisodes",
        "rule": "Number",
        "category": "modified",
        "description": "Automatically fixed to 1 decimal places",
        "modified_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.report.modified.totalepisodes"
    },
    {
        "date": "2018-06-05 09:21:46.280061",
        "source_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/source/number/test_int_is_unique_dirty.csv",
        "field": "TotalEpisodes",
        "rule": "Number",
        "category": "replaced",
        "description": "Replaced with default value",
        "modified_file": "https://s3-us-west-2.amazonaws.com/pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.report.replaced.totalepisodes"
    }
]

It should be noted that each call to the API (e.g. each running of the ‘curl’ command in the samples) will incur the standard metered charge for using the Pipeline API. Access to the sample S3 bucket is also on a ‘Requester Pays’ basis.

Useful commands

The following AWS CLI commands may be useful when navigating the sample files:

Recursively list all files in the ‘source’ directory

aws s3 ls s3://pipeline-api-demo/source --recursive --request-payer requester

View the contents of the example file for e.g. the ‘number/test_int_is_unique_dirty.csv’ demo

aws s3api get-object --bucket pipeline-api-demo --key
source/number/test_int_is_unique_dirty.csv.curl
--output table --query 'Body' --request-payer requester
/dev/stdout

Check the file created after running the ‘number/test_int_is_unique_dirty.csv’ example

aws s3 ls
s3://pipeline-api-demo/destination/number/test_int_is_unique_dirty.csv.cleaned
--recursive --request-payer requester

View the contents of the report file after running the ‘number/test_int_is_unique_dirty.csv’ example

aws s3api get-object --bucket pipeline-api-demo --key
destination/number/test_int_is_unique_dirty.csv.report
--output table --query 'Body' --request-payer requester
/dev/stdout

Note that for the above commands you must use AWS credentials for the account that the ‘Requester Pays’ access is to be billed against.

Troubleshooting

Curl command returns '{"message":"Forbidden"}' error

Ensure you have substituted the '<your-api-key-here>' text with your actual API key, obtained from the Dativa Developer Portal, in the Curl command (charges apply)

'An error occurred (AccessDenied) when calling the ListObject/GetObject operation: Access Denied'

Ensure that you have included the '--request-payer requester' option in the aws CLI command (charges apply). Only ListObject or GetObject S3 commands are allowed

Curl command returns '{"message": "Invalid request body"}' error

Ensure the JSON in the Curl command body is correctly specified

Full list of example files

We have provided a number of sample files to demonstrate the use of the Pipeline API in a publicly accessible S3 bucket (https://s3.amazonaws.com/pipeline-api-demo/source/). These consist of a set of CSV files, each illustrating one or more issues that can be resolved using the Pipeline API:

Folder

File

Example

anonymization

emails.csv

TOKENIZATION

DATE VALIDATION

STRING VALIDATION

emails_rotating_encrypted.csv

TOKENIZATION

DATE VALIDATION

STRING VALIDATION

DECRYPTION

date

check_epoch_time.csv

DATE VALIDATION

date_format.csv

DATE VALIDATION

date_format1.csv

DATE VALIDATION

date_is_unique_dirty.csv

DATE VALIDATION

generic

US1_movies_dirty.csv

REFERENTIAL INTEGRITY

STRING VALIDATION

email_test.csv

STRING VALIDATION

ip_list.csv

STRING VALIDATION

UNIQUENESS

names_blank.csv

STRING VALIDATION

test_cities_dirty1.csv

REFERENTIAL INTEGRITY

STRING VALIDATION

test_cities_dirty_skip_blank.csv

REFERENTIAL INTEGRITY

lookup

test_cities_dirty.csv

REFERENTIAL INTEGRITY

test_cities_dirty_windows1252.csv

REFERENTIAL INTEGRITY

test_short_original.csv

REFERENTIAL INTEGRITY

number

test_int_is_unique_dirty.csv

NUMBER VALIDATION

test_int_range_dirty.csv

NUMBER VALIDATION

test_int_range_dirty1.csv

NUMBER VALIDATION

test_int_range_dirty2.csv

NUMBER VALIDATION

test_int_range_dirty3.csv

NUMBER VALIDATION

test_int_range_dirty_4.csv

NUMBER VALIDATION

session

NewSession rule checking file.csv

SESSION VALIDATION

NewSession rule checking file1.csv

SESSION VALIDATION

Session Test 2 only 2 records.csv

SESSION VALIDATION

Session_Test1.csv

SESSION VALIDATION

allowed_gap_seconds.csv

SESSION VALIDATION

alternate_overlap_second.csv

SESSION VALIDATION

check_epoch_time.csv

SESSION VALIDATION

check_epoch_time_insert_new_dirty1.csv

SESSION VALIDATION

check_epoch_time_threshold_dirty.csv

SESSION VALIDATION

clean_overlap_full_with_zero.csv

SESSION VALIDATION

gap_first.csv

SESSION VALIDATION

gap_first_time.csv

SESSION VALIDATION

gap_second.csv

SESSION VALIDATION

gap_two.csv

SESSION VALIDATION

overlap_child.csv

SESSION VALIDATION

overlap_first.csv

SESSION VALIDATION

overlap_first_2.csv

SESSION VALIDATION

overlap_full.csv

SESSION VALIDATION

overlap_second.csv

SESSION VALIDATION

overlap_second_2.csv

SESSION VALIDATION

session_check.csv

SESSION VALIDATION

string

test_string_dirty.csv

STRING VALIDATION

test_string_dirty1.csv

STRING VALIDATION

test_string_is_unique_dirty.csv

STRING VALIDATION

unique

duplication_test.csv

UNIQUENESS

names_blank.csv

UNIQUENESS

Each sample CSV file is accompanied by a set of rules that can be applied, a version of the processed file after the rules have been run, and an example ‘curl’ command that can be executed to call the API with the rules as defined:

Filename
Contents
<sample-file>
Sample CSV data file
<sample-file>.rules
Rules to apply in JSON format
<sample-file>.curl
Example ‘curl’ command to run.
  • Substitute your own API key to execute.
  • Results placed in ‘destination’ folder.
<sample-file>.cleaned
The cleaned file after processing with the Pipeline API using the above rules.

The processed files will be stored in the same S3 bucket under the ‘destination’ folder. Note that any files generated will overwrite previous files and will automatically expire and be deleted after 3 days.

Previous: AWS API | Next: Type Validation

Related documentation

  • Dativa Pipeline API on AWS - The Dativa Pipeline API is available through the AWS marketplace (more)
  • Dativa Pipeline API: Validating basic data types - Validating incoming datasets for basic string, number, and date type formatting and range checks using the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Anonymizing data - The Dativa Pipeline API support tokenization, hashing, and encyrption of incoming datasets for anonymisation and pseudonymization (more)
  • Dativa Pipeline API: Referential Integrity - Using the Dativa Pipeline API to validate data against other known good datasets to ensure referential integrity (more)
  • Dativa Pipeline API: Handling invalid data - Invalid data can be quarantined or automatically fixed by the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Working with session data - The Dativa Pipeline API can check for gaps and overlaps in session data and automatically fix them (more)
  • Dativa Pipeline API: Reporting and monitoring data quality - The Dativa Pipeline API logs data that does not meet the defined rules and quarantines bad data (more)
  • Dativa Pipeline API: Full API reference - A field by field breakdown of the full functionality of the Dativa Data Pipeline API (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Genie House
Burchetts Green Road
Maidenhead
SL6 6QS

Registered in England & Wales, number 10202531