Data Cleaning
Data cleaning, also called data cleansing deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data quality problems are present in single data collections, such as files and databases, due to misspellings during data entry, missing information or other invalid data. When multiple data sources need to be integrated, e.g., in data warehouses, distributed database systems or global web-based information systems, the need for data cleaning increases significantly. This is because the sources often contain redundant data in different representations. In order to provide access to accurate and consistent data, consolidation of different data representations and elimination of duplicate information become necessary. In general, data cleaning involves several phases.
Data analysis:
In order to detect which kinds of errors and inconsistencies are to be removed, a detailed data analysis is required.
Definition of transformation workflow and mapping rules:
Depending on the number of data sources, their degree of heterogeneity and the dirtiness of the data, a large number of data transformation and cleaning steps may have to be executed. Sometime, a schema translation is used to map sources to a common data model. Early data cleaning steps can correct single-source instance problems and prepare the data for integration. Later steps deal with schema/data integration and cleaning multi-source instance problems, e.g., duplicates. For data warehousing, the control and data flow for these transformation and cleaning steps should be specified within a workflow that defines the ETL process.
Verification:
The correctness and effectiveness of a transformation definitions should be tested and evaluated, e.g., on a sample or copy of the source data, to improve the definitions if necessary. Multiple iterations of the analysis, design and verification steps may be needed, e.g., since some errors only become apparent after applying some transformations.
Transformation:
Execution of the transformation steps by running the workflow for loading and refreshing related databases.
Backflow of cleaned data:
After (single-source) errors are removed, the cleaned data should also replace the dirty data in the original sources in order to give applications the improved data and to avoid redoing the cleaning work for future data extractions.
B2B Data Exchange
InfoTelica can develop custom B2B data exchange applications for the corporate to synchronize data across business units via seamless data exchange in real-time or batch data processes. Besides supporting both open data standards and custom data formats, InfoTelica’s B2B data exchange applications provide data quality for both structured and unstructured data, generate response files, send back invalid data, log exchanges and make error trapping.
Data Integration
Data Integration can also be named as System or Product Integration. When a new product/system are bought or developed, we must integrate this system to our existing systems to provide synchronization between systems. To synchronize systems or products, some studies on data level must be done such as data migration, data consistency, data synchronization, data quality and data replication.
When we have to make some changes or upgrade version of one of our existing systems to meet changing business requirements, we must apply all of these changes to all our own systems. To do change management in our enterprise, the same studies, named data integration entirely, on data level must also be done.
Data Migration
Data migration is the process of making an exact copy of an organization’s current data from one format to another format and from one device to another device without disrupting or disabling active applications.
Data Migration Reasons:
Server or storage replacement or upgrade.
Server or storage consolidation
Data center relocation
Server or storage maintenance.
Workload balancing.
Performance improvement.
Data Migration Stages:
Pre Migration.
Analyzing.
Mapping.
Normalizing.
Transforming.
Testing Backup.
Migration.
Post Migration.
Quality Control.
Backup.
Cleanup.
Update Cataloging Guidelines.
|
Data Migration Methodology:
Plan.
Determine migration requirements.
Identify current storage environment.
Create migration plan.
Develop design requirements.
Create migration architecture.
Develop test plan.
Migrate.
Communicate deployment plan.
Validate hardware and software requirements.
Customize migration procedures.
Run pre-validation test.
Perform migration.
Verify migration completion.
Validate.
Run post-validation test.
Perform knowledge transfer.
Communicate project information.
Create report on migrate statistics.
Conduct migration close out meeting.
|
|
|