Owl Analytics Admin Guide

Introduction to Owl Analytics

Purpose

Owl-Analytics software provides Machine Learning first and Rules second based approach to Data integrity / Data Quality of datasets. Owl is unobtrusive to your current Data Science tools. Data Quality is an essential prerequisite for a defensible Governance Program.

Who Should Use This Guide

This guide is intended for Administrators of Owl-Analytics software

Administrators: The administrators will learn how to install and configure the application to fit the business needs, incorporate Owl within corporate security/infrastructure and schedule Owlcheck jobs to run on a recurring basis.

Terminology

  • Owl-web - a Tomcat web server that users can log into and see the results of data quality scans run. The web application essentially displays the results of all data quality scans that have been run.

  • Owl-core - the main jar file doing the processing of the data.

  • Owl-agent: remote execution of a job via agent.

  • Owlcheck - main execution point of jobs. Owlcheck is a shell script

  • Hoot - Results of an Owlcheck

  • Metastore - Postgres repository used to store all information surfaced in owl-web.

  • Dataset - The name given to a Dataset (DS) at Owlcheck execution time.

Scope

This guide covers planning, setup, administration, securing, and working with owl-analytics software. It covers two main functional components known as Owl-Core and Owl-Web.

Owl Features

Data Quality

Business Functions

Row count validation (ex: today 50% record volume)

Downstream data impacts to biz

Column type validation (ex: is this type the same)

Detect which Models are affected most by DQ issues

Mixed col, data shifting (ex: is the type mixed)

Cataloging of data assets (which datasets have been Owlchecked)

Col outlier detection (ex: is the value actually correct)

Usage ranked catalog

Records removed / added detection

Model performance

Col format consistency (column shaping the same)

Annotation of DQ issues discovered

Null / Empty check

Distributed rule generation

Validate against Source (check col values against source values, authoritative source use-case)

suppression of false positive

Auto Schema evolution

correlation matrix

Schema change detection

Scorecard

Incremental / Micro-batch

Built in Alerting

Stream quality

In this depiction the owlcheck driver and job gets pushed to the cluster (deploymode = cluster flag was sent), so all communication connects from any node on the cluster back to the metastore.

Last updated