|
|
data
breakdown, 33
consolidation, 53
as data profile input, 126-129
decay, 92
delivery delays, 13
discovery from, 153-154
duplicate, 268
extracting, 58-59, 126-129
flattening, 59
integration, 7, 62
loading, 61
matching, 56-57
moving/ restructuring, 52-62
normal forms of, 181-184
as precious resource, 3-5
qualifying, 82-83
rejection, 11
replication, 6-7
standardized representation, 170
using, 62-63
data accuracy
air quality analogy, 34
business case for, 103-118
characteristics, 29
consistency, 29-30
content, 29
costs, 108
as data quality assurance cornerstone, 257-259
decay, 50-51
defined, 29-32
form, 29
as fundamental requirement, 3, 23
improvement effects, 40
lack of, 3
object-level, 31-32
percentages, 34
problems, occurrence of, 67
summary, 41-42
total, 34-35
value of, 103-107
data capture processes, 89-92
auto-assist in recording process, 91-92
distance between event and recording, 90
error checking in recording process, 92
evaluation factors, 90
fact availability at recording, 90
feedback to recorder, 91
information verification at recording, 91
motivation of person doing recording, 91
number of handoffs after recording, 90
remedies, 95
skill of person doing recording, 91
time between event and recording, 90
data cleansing, 59-60
adding, 96-97
defined, 59
leaving out rows and, 60
problems, 59
routines, 59, 60
as short-term remedy, 97
tools, 21-22, 53
uses, 96-97
as value-level remedy, 171
data elements
analysis, 37
delay-prone, 50
matching, 56-57
revealing significant variances, 89
use of, 33
value indicators, 46
See also values
data entry
data rules checked during, 233
deliberate errors, 47-48
flawed processes, 44-46
forms, 45
as inaccuracy source, 44-49
mistakes, 44
null problem, 46
processes, 45
system problems, 48-49
windows, 45
data events analysis, 89-94
conversion to information products, 93-94
data capture processes, 89-92
data decay, 92
data movement/ restructuring processes, 92-93
points of examination, 89
data gathering (complex data rules), 239-240
business procedures, 240
database-stored procedures, 239
source code scavenging, 239
speculation, 240
See also complex data rule analysis
data gathering (simple data rules), 221-224
business procedures, 223-224
database-stored procedures, 222-223
source code scavenging, 221-222
speculation, 224
See also simple data rule analysis
data management
lack of, 1
team training, 15
technology, 1
data marts, 61
data models, 189-190
building, 209
developing, 209-210
for primary/foreign key pairs identification, 200
validating, 210
data monitoring, 20-21
adding, 96
continuous checking, 100-101
database, 21
defined, 20
post-implementation, 99-101
transaction, 20-21
validation, 100
data profiling, 20
"analysis paralysis," 142
analysts, 123, 127, 134
analytical methods, 136-140
approaches, 20
assertion testing, 137
bottom-up approach, 131
column property analysis, 132, 143-172
complex data rules, 240-244
conclusion, 83
as core competency technology, 142
data rule analysis, 134-135, 215-245
data type and, 159
defined, 20, 53, 119
discovery, 136
emergence, 119-120, 141
errors, 82
extraction for, 126-129
as foundation for remedies, 258-259
general model, 123-130
goals, 122
important databases, 140
inputs, 124-129
iterations and backtracking, 139
for knowledge base creation, 122
metadata verification, 139
methodology, 130-135
model illustration, 123
output, 20
overview, 121-142
participants, 123-124
process, 122
process steps, 131-132
products, 53
of secondary data stores, 140
software support, 139-140
steps diagram, 131
structure analysis, 132-134, 173-214
technology, 119-120, 122, 258
text columns, 163
value rule analysis, 135, 246-254
visual inspection, 138
when to use, 140-141
data profiling outputs, 129-130
facts, 130
latency, 130
metadata, 129
data profiling repository
business objects, 272
content, 272-278
data rules, 217, 277
data source, 273-274
defined, 121
domains, 273
inconsistency points in, 167
information, 124
issues, 278
schema definition, 272
synonyms, 276
table definitions, 274-276
value rules, 277-278
data quality
awareness, 9, 10-12
characterization of state, 9
defined, 24
definitions, 24-27
emergence, 70
as everyone's job, 257
facts, 130
high, moving to position of, 256-257
improvement requirements, 14-15
issues management, 80-102
as maintenance function, 104
as major corporate issue, 255-256
money spent on, 105
as universally poor, 10
visibility, 1
data quality assessment project, 110-112
age of application, 112
future costs potential, 111
hidden costs potential, 111
identified costs, 111
importance to corporation, 111
likelihood of major change, 112
primary value, 259
pure, 257
robustness of implementation, 112
See also business case
data quality assurance, 67-79
activities, 75-78
comparison, 74-75
data accuracy as cornerstone, 257-259
department, 69-71
educational materials, 18
elements, 16
experts and consultants, 17
as explicit effort, 256-257
as full-time task, 70
functions, 71
group, 69-71
implementation, 118
initiatives, 23
methodologies, 18
organizing, 257
program components, 71
program goals, 68
program structure, 69-78
project services, 75-77
rationale, 105
software development parallel, 70
software tools, 18-22
stand-alone assessments, 77
summary, 78-79
teach and preach function, 77-78
team, 68, 76, 77
technology, 16-22
data quality assurance methods, 71-75
comparison illustration, 72
inside-out, 72-73
outside-in, 73-74
types of, 71-72
data quality problems, 3-23
fixing requirements, 14-15
hiding, 11
impact, 12-14
liability consequences, 12
reasons for not addressing, 12
scope, 14
data rule analysis, 134-135
complex, 135, 237-245
definitions, 216-220
simple, 134-135, 215-236
data rule checkers, 234
data rule repository, maintaining, 235
data rules
in assertion testing, 137
column properties vs., 220
data profiling repository, 217, 277
dates, 226-227
defined, 134, 215, 217, 238
derived-value, 229
durations, 227
evaluation, 232-234
exceptions, 219
execution, 225-226, 241
hard, 218-219, 238
loose definition, 219
multiple-rows/same column, 230
as negative rules, 217
object subgrouping columns, 227-228
process rules vs., 219-220, 238
relationships, 215
soft, 218-219, 238
sources, 137
syntax examples, 218
tight definition, 219
types of, 226-230, 241-244
work flow, 228-229
See also complex data rules; simple data rules
data source, 273-274
data transformation routines, building, 170-171
data types, 157-159
character, noncharacter data in, 157-158
defined, 157
profiling and, 159
typical, 158
See also column properties
data warehouses, 61
database management systems (DBMSs), 22
correct data, 22
for structural role enforcement, 134
database monitors, 21
database procedures
complex data rule analysis, 239
simple data rules analysis, 222-223
databases, 6
data integration, 7
definitions, 190
demands on, 8
design anticipation, 27-28
errors, 49
factors, 10
flexibility, 28-29
importance, 8
quality, 9
source, 54, 56-57, 59, 169-170, 212
target, 56-57
date(s)
columns, 195-196
complex data rules, 241-242
domain, 146
extreme numbers on, 252
simple data rules, 226-227
decay
in cause investigation, 92
problems, 92
decay-prone elements, 50-52
accuracy, over time, 51
characteristics, 50
decay rate, 52
handling, 51-52
decision-making efficiency, 41
decisions
based on hard facts, 115-116
based on intuition, 116-117
based on probable value, 116
defensive checkers, 96
deliberate errors, 47-48
correct information not given, 47-48
correct information not known, 47
falsifying for benefit, 48
See also errors
denormalization, 59
cases of, 182-183
use of, 59
denormalized form, 182-183
denormalized keys, 179
denormalized tables, 182, 183
data repetition, 183
in relational applications, 214
derived columns, 179
derived-value rules, 229
descriptor columns, 195
discovery, 136
of column properties, 153-154
of functional dependencies, 191, 196-197
homonym, 204
in structure analysis, 191-192
of synonyms, 203-204
discrete value list, 160-161
domains, 145-148
concept, 145
data profiling repository, 273
date, 146
defined, 145
external standards, 147
macro-issues, 148
metadata repository, 145
micro-issues, 148
special, 164-165
unit of measure, 147-148
zip code, 147
domain synonyms, 186-187
defined, 186
existence, 187
structural value and, 186
testing for, 205
See also synonyms
duplicate data, 268
durations, 227
|
|