If you receive errors when attempting to view this white paper, please install the latest version of
Adobe Reader.
"
Data Cleansing and Synchronization Services
The pace with which companies are forced to operate and to compete globally
has taxed exisitng systems and increased their inefficiencies."
Source : PM ATLAS Business Group, LLC
Four Critical Success Factors to Cleansing Data
Data Cleansing is also known as :
Data,
Data Cleansing,
Consumer Data,
Precise Data Cleansing,
Data Quality,
Data Cleansing Tools,
Data Cleansing Solutions,
Accurate Data,
Source Data,
Data Warehousing,
Data Scrubbing,
Outsource Data Cleansing,
Data Bureau Services,

Data Cleansing Software,
Data Transformation,
Consumer Database Cleaning,
Cleansing Data Warehouse,
Data Cleansing Services,
Data Discovery,
Data Profiling,
Data Analysis,
Data Processing,
Financials Research Data Cleaning,
Master Data Management,
Data Integration,
Data Sets,
Data Integrity,
Data Cleansing Projects,
After presenting a case study “Supply Chain Data- Dont let it be
your weakest link!'” with Alcon Laboratories at the TCU Global
Supply Chain conference in April 2005 there were a flurry of
questions from the attendees. While the presentation was
mainly focused on the value of quality data in the supply chain
and in todays automated mass sharing of information with internal
and external customers, we had several questions including:
- Why is data important now, it has been around forever?
- There aren't enough internal resources but how could outsiders/
contractors possible know their data well enough to
cleanse it?
- How would you even know where to start?
This article addresses how we all got here and what we can do
about it. There are four critical success factors of 1) Scope, 2)
Team, 3) Process and 4) Technology and this article will give
you the stepping stones to be successful in cleaning your critical
data.
Executive Summary of the issue: Think of data cleansing as
system immunization.
Competition for business has fuelled system complexities. We
are clearly in the information age. There is more data being sent
around the world today than ever, and this is only going to grow
exponentially. To have the competitive edge today you have to be
able to share clean, accurate data quickly. It's not enough just to
have a good product, your customers have to recognize the right
product, color and size from a internet catalogue, the system has
to process the correct item on the order and send the information
to distribution systems, distribution has to pick and ship the right
item. All of this is dependent on each system being able to share
and communicate the same product item data.
The pace with which companies are forced to operate and to compete globally has taxed existing
systems and increased their inefficiencies. Mergers and Acquisitions have introduced new
ERP implementations, forced incorporation of legacy systems, merged processes, combined
products, and consolidated customers. Technology has moved forward but so has the creativity
of the humans forcing the data requirements into fields and applications to meet their business
needs to keep existing customers satisfied or to capture new sales. Gartner research firm
indicates that the average Fortune 100 organization has more than eight data stores, 15 information
platforms, 10 critical systems, and hundreds to thousands of business applications. This
is not an unusual problem; almost all large Fortune 500 companies have been putting Band-
Aids on their data and systems for years. Unfortunately rubber bands and bailing wire isn't going
to fix their problems now, there is no silver bullet to fixing data.
What one system or company may call a product, another may call a component; bring together
unit of measure irregularities, calendar invoicing differences, selling product under three
different item codes and your company has data integrity issues. Now think about how you are
propagating bad data through interfaces with internal systems or even worse, through EDI,
XML, etc to your external customers, suppliers and vendors.
Typical ERP implementations are measured on getting them completed on-time and within
budget. Find an implementation including metrics of data integrity and process and you will be
hard pressed. Project teams would rather perform system configuration than perform manual
data review and validation. No wonder, according to Ovum Research, 50 % of data warehousing
projects fail. If more companies would incorporate the concepts of data quality and data
integrity into their project lifecycle, the IT department would have less data to clean up post golive.
We have all heard this before: "Fix the records on the fly." So basically, the old "garbage
in - garbage out" pretty much says it all.
Systems that support human decision-making should be systems which possess clean data.
To get valid data, companies must create a common business language- business rules. Once
established, now bringing disparate data sources together is easy and sets the foundations for
business performance management (BPM).
So with all of that said the next round of questions seems simple....How do I know my data is
"dirty"? This question is not as simple as it appears, most management think because they are
using the data in production systems to run their business then it must be clean....or clean
enough...right? Wrong!
What is dirty data?
Dirty data occurs when reality is different than the data captured and stored,
therefore it is dirty. It can also be described as anomalies in the data values that renders a
wrong representation in:
- Sales order processing and reporting
- Manufacturing and logistics
- Administration, e.g keep track of employees and customers
- Business processes, e.g. using customer address lists for direct mailings, using unique
item identification for processing orders
- Analysis and interpretation to support decision making and generate new information
and knowledge
Why is this an issue now when data has been around since the beginning of the company?
- Industry Changes
- Increased need for competitive advantage
- Regulatory compliance (Homeland Security, etc.)
- - Raised visibility
- Sarbanes-Oxley
- HIPPA
- ER/ES
- Global Data Sharing Requirements
- Electronic Commerce
- Spreadsheets versus master data repository
- Supply chain collaboration & optimization
- - And the list goes on...
Simply put data impacts:
- Ability to process orders
- Ability to share data among internal systems
- Ability to share data electronically with external sources
- Reliable reports
- Correct information for good decision making
At a 2004 UCCNet conference, a presenter from Wal-mart stated that only 30% of their EDI orders
are able to processed automatically. The other 70% requires someone at Wal-mart to manually
manipulate the data inputs into their system.
"Data Quality problems will cost U.S. businesses more than $600 billion a year"
Data Warehouse Institute, 2002. Data is either an asset or liability to your company, which is it for
your company?
1) Scope - How to know what is critical data to your business
Before we can talk about a team doing the work, we have to take one step back and determine
"what" is the work. Having a project team start anything without scope definition or a work plan is
a waste of valuable company time and money. So take the time to analyze what is really needed,
clearly answer the following questions and then plan the work. Trust me, we could have saved
some companies hundreds of thousands of dollars if they had known the ROI in taking this step.
A data assessment needs to be completed and scope definition should clearly be defined by the
following parameters:
- What data is critical to your business, shared externally (B2B, EDI, Barcodes, RFID, etc) or
passed to other internal systems?
- Who cares about your data? Internal? External? Who is the sponsor? At a minimum the list
should include your C- Level, if they are making business decisions on reports from any of
your systems then they are the ones with most to risk. How can you make good decisions on
bad data?
- How many systems have master data? Master data is static data such as product number,
product attributes, customer or vendor records. Unless you are implementing an ERP or CRM
system, initial data cleansing efforts on transactional or historical data would be discouraged.
- How many integrations do you have that feed your master data to other systems (internal or
external) Create a master list by system and data field name. Do you have a data warehouse,
one central location for master data? Is now the time to make that move?
- Now identify the data owner and the business owners for each field. A data owner is the person
that is setting up or editing master data. The business owners are the ones that determined
what the data means, such as what data values are acceptable (UOM standards, EA vs
BX) or analyzes/uses the data outputs.
- What level of cleanliness is needed for our business, industry or to meet federal regulations?
Depending on your industry baselines, you may need to be 100% clean on critical (shared)
data. Many customers, such as DOD or Wal-mart are threatening fines if data is not clean.
- Do you have any internal business intelligence tools, Product information management tools?
If so, these will make the physical data pulls, and cleansing steps easier. If not, no big deal,
you just need to be aware that you will have some overhead for some "behind the scenes"
technical tasks that are required in the process.
Dirty data manifests itself in many different anomalies, below are just a few:
- Discrepancies in the structure of the data items and specified format
- Irregularities
- Integrity constraint violations
- Contradictions
- Duplicates
- Invalid
- Missing values (part or whole records)
- Orphaned data
Examples of data anomalies:
- Multiple addresses for IBM
- Same addresses for IBM
- 10005 Park Lane
- 10005 Park LN
- 1005 Park Lane
- Multiple ways to identify a vendor...Is it Coca Cola, Cocacola, CocaCola - Uppercase,
lowercase, one word, two words?
- Same product with multiple variations of the SKU
What causes dirty data?
- Duplicate or overlapping data from :
- Business Mergers and Acquisitions
- Combined products
- Inherited Legacy Systems
- Poor or undisciplined data capture
- Internal static data repositories
- Automating, merging or implementing new ERP and CRM solutions
- External file feeds
- Default entries are used during information gathering for missing data elements
- Aged data
- No initiative to maintain clean data
So simply put, "dirty data" creates errors that are having to be manually identified, intercepted and
corrected by you, your employees and your customers. Yes, even your customers!
So after completing the data assessment, you now should know what the work is....How many
fields, how many systems and how many people are involved in the decision making and cleansing
process
2) The People - The right people with the right skills, knowledge level and ability to make
decisions are critical to success!
Staffing a Data Cleansing Team is not about how many warm bodies can you throw at it! We have
seen so many customers think that it is a menial, low level work or that if they purchase an application
that it can do the work for them. Even better, We have been told "We have a data entry group
that is working on this part time"...so how do they think they got dirty data in the first place.
Hopefully, by the end of this section you will understand why it takes more than a cool new application
or monkey to do the work. Just remember what the first step is with any application... a person
has to analyze and make a conscious decision of what is the right answer (data value, process),
when and where change is needed in order to implement or configure the system. There is no way
to avoid the human decision making step of getting to what is the right answer for a data field.
Now, let's talk about what is needed to be successful in creating a Data Cleansing Team, it's bigger
than a breadbox. The Data Project Team is different than the Data Cleansing Team, and all
data projects at a minimum should consist of the following roles:
- Sponsor - Is the decision maker, responsible for Sr Management awareness and approval,
budget approval and issue/conflict resolution.
- Subject Matter Experts (SME's):
- Data Owners - Responsible for the maintenance or entry of data into systems.
- Business Owners - Responsible to communicate business needs, business processes
and approved values per data field.
- IT Support - Responsible to provide access to necessary systems, architect of systems and
integrations, and test environments.
- Data Cleansing Team - Responsible for facilitating data values and process reviews, system
data analysis to find anomalies, documenting approved/allowed data values and business processes,
obtain approval from all stakeholders for a specific field, cleanse the master data, support
testing of data being passed to other systems, support functional requirements specifications
for any system changes, support SOP changes/development for processes and develop
maintenance routines or reports for ongoing data maintenance.
While the majority of the Data Project Team is only needed about 10% of the time, the Data
Cleansing Team should consist of 100% dedicated team members . The Data Cleansing Team
should consist of the following roles/skill sets:
Project Manager - The PM creates and manages budget and schedule, resource allocation,
issue/risk management, communication and facilitation with sponsors and SME's, metrics and
status reporting.
Functional Analyst(s) - These analysts facilitate and capture the business needs, are process
engineers and technical writers, as well as understand basic data and technical designs.
They are responsible for gathering the business process and developing/documenting the
business processes and data values, functional requirement documentation, SOP's development,
edits, test scripts and testing.
Data/Technical Analyst(s) - These analysts are strong technical resources that can write/
code SQL, understand data structures, tables and values, data warehouses and farming data.
They are responsible for "snorkeling the data" based on what the Functional Analyst has
documented and look for anomalies, provide feedback and support to documentation effort,
pull data for functional analyst and team reviews, write and run routines to cleanse the data in
the systems, write integrity or exception reports for every field that supports the business/data
rule.
3) The Data Cleansing Process - Now to the hard work
Not to belabor the data cleansing process, this is another article all by itself, but the frameworks
and concepts that we use are listed below. We just want to stress... have a process defined before
you start or your team will struggle and your data project will never end.
Process Options-
- Field by Field - This process is simple and most often used, especially if you are cleansing
data in production system and involves the Data Project Team about 10% and the Data
Cleansing team 100%.
- The simplest way to manage this is via a scorecard. Excel works just fine, or you could
use MS Project.
- List every data field as an individual work unit
- Then setup the milestones/gates for measurement. We use the following for column
headers and time estimates to build the scorecard.
- Functional BA Assigned
- Analysis Start Date - Stagger start dates. A seasoned functional BA should be
able to start 2 to 3 fields per week.
- Analysis Due Date
- Draft Data/Business Rule
- Core Team Meeting - Presentation of findings and recommendations
- Approval/Distribution of business rule to greater team
- Technical BA assigned
- Data Cleansed per business rule
- Integrity reports written, tested and moved to production
- Data Owners trained in daily maintenance of report
- The process takes about 6 weeks for each field to move from analysis to production
support. This means at any given time 1 Senior Functional BA will have 12 to 18 fields
opened at a time.
- Duration planning - If you have 50 fields to cleanse then you are looking at a 3 to 4
month effort, but if you have 200 fields to cleanse then you really need to add more
Functional and Technical BA's. The workload is really 1 Technical BA to 1 Functional
BA.
- Then monitor and measure to your dates, some fields will be easy and will take less
time than benchmarked on the schedule. Others will be more complex and may have a
longer run time to complete. If you know the complexity of the field is high, then add
more analysis time up front.
- Mass Cleaning - This is a larger scale, more integrated process and requires about 30% to
50% involvement of the Data Project Team and 100% of the Data Cleansing Team.
- The concept here is to schedule workshops for each group of related fields/same business
owners, analyze data and identify known/approved values as a team.
- Functional BA documents the data and business rules for that group of fields
- Functional BA sends the documentation to users for approval and completes a technical
spec detailing the changes, and creates test specifications for users.
- Technical BA codes all data changes based on the documentation in test environment.
- Data and Business Owners test the data loads
- Functional BA tests the data load and captures all edits from the data and business
owners and creates a technical spec
- This process is iterative until Data and Business Owners accepts the data in test.
- Then it is promoted into a integrated environment for testing with the same iterative
test cycle.
- Then final load is into Production as "Go Live"
- Duration Planning is more difficult since it is dependent on how many different data
sets you working and if you are working them single threaded or parallel. For example:
Your data sets are Vendor records, Inventory and Purchasing. You have the option to
work just Vendor records until it is ready and moved to production, or you could opt to
have 3 Data Cleansing Teams each responsible for one data set and run the cleansing
parallel.
- Since this is an integrated project and cross departmental, it is highly recommended
that MS Project be used for schedule development and project control.
4) Technology- How much to use or not to use... that is the question
For years company's have been struggling with implementing new technology to solve all of
these issues, IT has been in the spot light for years spending millions of dollars to provide
"business solutions" such as PIM's, PDM's, CRM's, ERP's, SVC's, etc. With the belief , that if
you haven't implemented at least one major enterprise system in the last 5 years then you have
lost your competitive edge.
But what in reality has happened, is that these new complex, technical solutions are only as
good as the data that is stored in them. They can only have "seamless integration" if the data
can be shared without human intervention. So is it no wonder that 90% of all IT projects fail? At
some point aren't the human decisions that form and run a business required as part of the solution.
And doesn't the electronic sharing of data only work when the data is correct?
We believe in using technology when it makes sense to do so, as long as everyone understands
that the root of data cleansing is the Data and Business Owners who define what is
needed. If you have internal data tools already implemented, then use them. If you don't, then
don't run out and buy one just for this project. Implementing a PIM or PDM, is a whole separate
project. Most vendors will only import clean data, so you still have to do the hard work first.
Not to worry, most any technical analyst can link to tables via ODBC and extract out the data
needed for analysis. Most data cleansing vendors have built their own proprietary tools that
they load and run to do the project and then remove once the project is complete.
In summary, cleansing data is a big commitment of time and resources. Businesses are completely
dependent on their data. It is a critical corporate asset and needs to be treated that way.
From one perspective, a business is only its data - its customer data, its employee data, its
product data, its financial data. Even its processes are represented by data. Compromise the
data and you compromise the business. It's that simple.