sql-data-profiler

0.2.6 • Public • Published

npm version Build Status

sql-data-profiler

sql-data-profiler is a utility module that generates sql code to profile data sets (eg. tables) in Redshift.

Installation

npm install sql-data-profiler --save

Quick start

var sql_data_profiler = require('sql-data-profiler');

Table profiler

Generates a sql statement that provides basic stats on the data in the "contacts" table.

var data_profiler = sql_data_profiler.data_profiler;
 
var options = {
    target_table: 'contacts',
    target_columns: ['email', 'a_industry']
};
var sql_code = data_profiler(options);

which accepts the following options

  • target_table
  • target_columns
  • results_table
  • calculate_frequency
  • use_perm_table
  • truncate_table

Distribution analysis

** What does this do?**

var distribution_analyzer = sql_data_profiler.distribution_analyzer;
 
var sql_code = distribution_analyzer({
    target_table: 'contacts',
    target_variable: {
        data_set: 'SELECT DISTINCT contact_id FROM events WHERE meta_event = \'conversion\'',
        join: {
            source_column: 'contact_id',
            target_column: 'contact_id'
        }
    },
    target_columns: [
        {
            name: 'a_name',
            max_number_of_values: 10,
        },
        {
            name: 'a_gender',
            max_number_of_values: 10
        },
        {
            name: 'a_address',
            max_number_of_values: 10,
            transformation: 'has_any_value'
        }
    ]
});

Table stats

The following stats are calculated for each column.

Stats Description
count_total number of records in the table
count_not_null number of records where the value for the specified column is not null
fill_rate number of non-null values divided by number of records
count_distinct number of distinct values
dupe_rate number of distinct values divided by number of records where the value for the specified column is not null
maximum_value
minimum_value
most_frequent_value_1
most_frequent_value_1_frequency
most_frequent_value_2
most_frequent_value_2
most_frequent_value_3
most_frequent_value_3

TO DO

  • handle different data type (eg. boolean)
  • performance improvement

Readme

Keywords

none

Package Sidebar

Install

npm i sql-data-profiler

Weekly Downloads

4

Version

0.2.6

License

MIT

Last publish

Collaborators

  • pcothenet
  • anselme