What is Data Profiling ?
Data Profiling is a process whereby one examines the data available in an existing database and collects statistics and information about that data.
Typical types of metadata sought are:

® Domain whether the data in the column conforms to the defined values or range of values it is expected to take

®        Type: Alphabetic or numeric

®        pattern: a North American phone number should be (999)999-9999

®        frequency counts: most of our customers should be in Loa Angeles; so the largest number of occurrences of state code should be LA

®        Statistics:

®        Interdependency:

Broadly speaking, most vendors who provide tools in the data profiling space divide the functionality into three categories. The names for these categories often differ depending on the vendor, but the overall process is in three steps, which must be executed in order:

®        Column Profiling (Including the statistics and domain examples provided above)

®        Dependency Profiling, which identifies intra-table dependencies. Dependency profiling is related to the normalization of a data source, and addresses whether or not there are non-key attributes that determine or are dependent on other non-key attributes. The existence of transitive dependencies here may be evidence of second-normal form.

®        Redundancy Profiling, which identifies overlapping values between tables. This is typically used to identify candidate foreign keys within tables, to validate attributes that should be foreign keys (but that may not have constraints to enforce integrity), and to identify other areas of data redundancy. Example: redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.

Column profiling provides critical metadata which is required in order to perform dependency profiling, and as such, must be executed before dependency profiling. Similarly, dependency profiling must be performed before redundancy profiling. While the output of previous steps may not be interesting to an analyst depending on his or her purpose, the analyst will most likely be obliged to move through these steps anyway.


Links
Links-2
2002-2006 Westech Data Conversion Services. All rights reserved.
Home
Contact Us
Data Conversion

Data Management

Data Mining

Data Modeling

Data Refining

Data Cleansing

Data Movement

Data Profiling

Data Security

Data warehousing