Cleaning Up Mistakes in Legacy Data to Improve Searching

 <  Day Day Up  >  

Speaking of free-form fields, how do you search in fields in which users have been allowed to type in whatever they wanted? I've done a search of a table of 50,000 entries, and found the name of the city in which the client was located (Las Vegas) spelled 75 different ways . Suffice it to say that you might never find some of your customers if you inherit data that looks like this.

I know that you'd never allow such a thing to happen. However, if you inherit a legacy system, like maybe one that you're migrating to Visual Basic .NET and SQL Server, you might want to clean up the data.

When you inherit a legacy application, you often inherit a lot of really, really sloppy data. As a result, your combo boxes don't work, your reports don't make sense, and you look sloppy, too. Here's a tool to fix the data and your reputation.

Migrating Legacy FoxPro Data to SQL Server

I've done a lot of work upgrading FoxPro 2.6 applications to Visual FoxPro 6.0, Visual FoxPro 7.0, and now Visual FoxPro 8.0. It's a tribute to Visual FoxPro that these applications have been working so well for so long. It's also a tribute to the patience of the users that they've been able to get so much use out of software that wasn't always very well-designed.

After moving a user 's data to new tables and building some very nice screens, I noticed that the combo boxes were coming up blank on many records. I thought I had entered all of the values the users said were valid, but because values that aren't in the combo box's list don't appear in the control, something wasn't right. I used

 

 SELECT DISTINCT (FieldName) FROM (TableName) 

to look at the data, and the variety was amazing. I found the name of their city spelled 47 different ways.

This poses real problems. You can't count how many people live in Hoboken, Hooboken, Hobooken, Haboken, and so forth and get an accurate count of what you think you're getting. And it just looks sloppy.

I love FoxPro, but DBFs have problems, especially on Local Area Networks. The main problem lies not in DBFs themselves , but in the CDX files used to enable Rushmore and thereby gain enormous performance enhancements. When you USE a DBF in a networked environment, FoxPro brings part of the index to the workstation. When your user makes a change, it's propagated to all other users. The result can be a slow FoxPro application. (It's hard for me to use the words slow and FoxPro in the same sentence , but it's true.) There goes our competitive advantage.

Add to that the result when you pull the plug on your workstation. The result is often a corrupted index file. And worst of all, the solution is to rebuild the index, an operation that requires exclusive use of the table. So all users have to be logged off ”including the ones who went to lunch without exiting the application. You get a lot of exercise walking around the building looking for the offending user. Often it's someone who feels put upon by the requests to close the application, which means that you're soon dealing with a hostile user, a real nightmare for a technician who's also a salesman .

In addition, anyone can get into a DBF and see what's there. In some applications, security is a real issue. DBFs have no protection from prying eyes (there's Cryptor; but I've implemented it, and although it's a technical marvel, implementation is at best a mixed blessing). There's one white paper on implementing security in FoxPro systems published on the Microsoft site, and I wrote it. However, as proud as I am of the techniques described therein, the paper deals with data access issues, but does nothing to solve this fundamental problem. SQL solves it straight out of the box.

Finally, users have actually sued me when they lost their data and didn't back it up. Worse, they won the lawsuit, because the judge was as ignorant as my client. With SQL Server, backup is a one-word command; and what's even better, it's their command.

I don't write applications based on DBFs any more, for all of these reasons. So part of every legacy application rewrite is the migration of their data to SQL Server or to MSDE. And in every single one, I find that the data is a mess. So while migrating it to SQL, I normalize the tables and clean up the data. It's just part of the job.

Migrating to SQL

The migration to SQL requires adding CREATE CURSOR commands to the LOAD event of all of my screens. The actual acquisition of the data and sending back the changes is relatively easy. So's the CREATE CURSOR step. In fact, the migration of a DBF-based application to MSDE or SQL Server is not difficult. The only hard part is cleaning up the data.

Cleaning Up the Data

I built a screen to help the users clean up their data. After all, they have a pretty good idea of what the correct spelling should be. And in the case of client names , they're probably the only people who could clean up their data. Besides, it was a fixed-price job.

The screen appears in Figure 9.11. This screen, called AUDITOR.SCX , runs in the directory where the user's data files are located. I have a list ( TabList.DBF ) of the tables that need to be audited because there are other tables in the directory that aren't user data.

Figure 9.11. The Auditor screen.
graphics/09fig11.jpg

When I pick a table, the program loads its fields and data types into an array and displays them in the Field picker (see Listing 9.22).

Listing 9.22. Loading the FieldPicker Combo Box with Text Column Names
 IF EMPTY ( THIS.Value )    RETURN ENDIF WITH THISFORM.Combo2 .Clear SELECT 10 USE ( THIS.Value ) ALIAS TheTable FOR I = 1 TO FCOUNT()         .AddListItem ( FIELD(I) )          nRow = .NewItemID          Fld  = FIELD(I)         .AddListItem ( TYPE (Fld), nRow, 2 ) ENDFOR USE ENDWITH 

I can audit other fields, and their names do appear in the list, but generally the character fields are the targets. After I've picked a character field, I can click on Audit Selected Field to see the counts of disparate values found in the named field (see Listing 9.23).

Listing 9.23. Displaying Unique Values and Counts for the Selected Field
 IF EMPTY ( THISFORM.Combo2.Value )    RETURN ENDIF IF USED ( "THETABLE" )    USE IN  THETABLE ENDIF SELECT Viewer ZAP WITH THISFORM Cmd = [SELECT TRANSFORM(] ;         + .Combo2.Value + [) AS Data, COUNT(*) AS Kount FROM ] ;         + .Combo1.Value ;         + [ GROUP BY ] + .Combo2.Value ;         + [ ORDER BY 2 DESC INTO CURSOR C1] &Cmd SELECT Viewer ZAP APPEND FROM DBF( [C1] ) GO TOP USE IN C1 USE IN ALLTRIM( ( .Combo1.Value ) ) .Grid1.Refresh ENDWITH 

If the user highlights a value that appears to be incorrect, clicking on the Search and Replace button will bring up a screen that permits specifying the replacement value:

 

 DO FORM FixData  WITH   ;   THISFORM.Combo1.Value,;   THISFORM.Combo2.Value,;   Viewer.Data,          ;   THISFORM.DataType     ;  TO Result IF Result = [Changed]    THISFORM.cmdGo.Click ENDIF 

Figure 9.12 shows the screen.

Figure 9.12. The Search and Replace screen.

graphics/09fig12.jpg


This screen displays the highlighted value and lets the user enter a replacement value. The replacement value is used immediately to replace the offending string with the correct one in the selected field.

The table name, field name, bad value, and replacement value are all written to a table called CHANGES.DBF . This table will be used later to do a batch replacement. Why this is necessary will be made clear shortly.

The program can also print out a summary of all values found in all fields. The program to do this is called AUDIT.PRG (see Listing 9.24). It uses a single SQL SELECT statement to prepare the data, then prints a report of its findings. The SQL statement is simply

 

 SELECT (Fld), COUNT(*) FROM (Table) GROUP BY (Fld) ORDER BY 2 DESC 

It's the quickest way to see what's wrong with your data. It's also interesting to see how many Smiths, for example, are found in a file of a few thousand names. And it's remarkably fast. Nothing runs like a Fox <g>.

Listing 9.24. The AUDIT Program
 * Program-ID..: AUDIT.PRG PARAMETERS Tbl IF PCOUNT() = 0    Tbl = GETFILE ( [DBF], [Table name?] )    IF EMPTY ( Tbl )           RETURN    ENDIF    Tbl = JUSTSTEM( Tbl )   ELSE    Tbl = JUSTSTEM( Tbl )    IF NOT FILE ( Tbl + [.DBF] )       MESSAGEBOX( [No such table: ] + Tbl + [.DBF], 64, _Visual FoxPro.Caption, 3000 )       RETURN    ENDIF ENDIF SET TEXTMERGE TO &Tbl..Txt ON NOSHOW SELECT 10 USE ( Tbl ) ALIAS ( Tbl ) FOR I = 1 TO FCOUNT()         Fld = FIELD(I)         IF TYPE ( Fld ) = [M]            LOOP         ENDIF         \<<REPLICATE([-],10) + [ ] + Fld + REPLICATE([-],10) + CHR(13)>>         Cmd = [SELECT ]    + Fld + [, COUNT(*) AS Kount FROM ] + Tbl ;                 + [ GROUP BY ] + Fld + [ ORDER BY 2 DESC INTO CURSOR C1]         &Cmd         GO TOP         IF RECCOUNT() > 75            \<<TRANSFORM(RECCOUNT()) + [ records found - first 5 follow:]>>                 SCAN WHILE RECNO() <= 5                 \<<PADL(TRANSFORM(EVALUATE(Fld)),50) + [ - ] + TRANSFORM(Kount)>>                 ENDSCAN          ELSE                 SCAN                 \<<PADL(TRANSFORM(EVALUATE(Fld)),50) + [ - ] + TRANSFORM(Kount)>>                 ENDSCAN         ENDIF         USE         SELECT ( Tbl ) ENDFOR SET TEXTMERGE TO CREATE CURSOR Reporter ( Line Char(132) ) APPEND FROM &Tbl..Txt SDF REPORT FORM Reporter PREVIEW USE 

REPORTER.FRX is a two-column report (see Figure 9.13) with a table called LINE.DBF , which contains a single field called LINE 80 characters wide. It's just right for dumping the contents of an ASCII file into a report and printing it "two-up."

Figure 9.13. The Audit Report.

graphics/09fig13.gif


Running the Data Conversion in Batch Mode

I often build replacement software while the old version is still in use. So every day, more data is being added to the system. And every day, if errors were being allowed into the tables in the past, more erroneous data is being added to those tables. Also, I'm not taking responsibility for the old system ”just the new one. So the data that I'm cleaning up is for the new system, not the old one. On conversion day, I've already run the data cleanup dozens of times, so I know it's going to work.

Running the Conversion

The program shown in Listing 9.25 converts the data on the Big Day. I don't want anything to go wrong on that day, so I run this program pretty much every day during the development process. If the users are creating new problems during data entry, this will point them out and give me time to fix them before we go live. And you'd be surprised the number of little details that no one spots until they've been looking at the converted data for a month and finally see a problem. So this ongoing conversion test gives everyone time to think clearly and get it right.

The program, which is shown in Listing 9.25, does more than simply apply the corrections in CHANGES.DBF , so I'll provide a running commentary .

Listing 9.25. The Convert Program That Actually Cleans the Data
 * Program-ID....: Convert.prg * Purpose.......: Convert application tables to new format *                  and apply cleanup changes CLOSE ALL SET EXCLUSIVE ON SET STRICTDATE TO 0 SET STATUS BAR ON SET TALK ON SET SAFETY OFF SET CPDIALOG OFF StartTime = TIME() ON KEY LABEL F1 *                && I don't want to see Help, even accidentally PUBLIC oDataMgr oDataMgr = NEWOBJECT ( "DATAMGR", "DATAMGR.PRG" ) USE CONTROL FOR I = 1 TO FCOUNT()         Fld = FIELD(I)         PUBLIC &Fld ENDFOR SCATTER MEMVAR USE ConnString = ALLTRIM(m.ConnString) PUBLIC Supervisor AS Integer PUBLIC DataAccess AS String Supervisor = 1 DataAccess = [SQL] =SQLDISCONNECT(0) ConnString = [Driver={SQLServer};Server=(local);Database=Data;Uid=Les;Pwd=lp;] oDataMgr.Handle = SQLStringConnect ( m.ConnString ) IF oDataMgr.Handle < 1    MessageBox ( [Couldn't connect to SQL Server] + CHR(13)        ;          + [Connection string: ] + CHR(13) + ConnString,                ;            64, _Visual FoxPro.Caption ) ENDIF lcPath = [\UserComputer\c-drive\userdir\] lcPath = GETDIR([Where are the original tables?]) IF EMPTY ( lcPath )    RETURN ENDIF lcPath = ADDBS ( lcPath ) _Visual FoxPro.Caption = [Migrating data from ] + lcPath *(1) Remove all entries for the highest key value in each table SELECT 99 USE KEYFIELDS ZAP *(2) Remove duplicate cases and create MOVES.DBF MakeMove()        && detailed after this listing; moves Add1, Add2 to MOVES table CLOSE TABLES *(3) Remote duplicate MEMBERS USE MEMBERS ZAP APPEND FROM ( lcPath + [MEMBER] ) GO TOP REPLACE ALL KEYVALUE WITH RECNO() GO BOTTOM INSERT INTO KEYFIELDS VALUES ( [MEMBERS], MEMBERS.KeyValue ) SELECT                        ;   AlienNum,                 ;   COUNT(*)                        ;   FROM MEMBERS                 ;  HAVING COUNT(*) > 1        ;  GROUP BY AlienNum        ;  ORDER BY 2 DESC                ;  INTO ARRAY laAliens FOR I = 1 TO ALEN ( laAliens, 1 )         IF VAL(laAliens(I,1)) = 0            LOOP         ENDIF         GO TOP         LOCATE FOR AlienNum = laAliens(I,1)         FOR J = 1 TO laAliens(I,2) - 1                 CONTINUE                 DELETE NEXT 1         ENDFOR ENDFOR *(4) Convert tables that added a KeyValue field: USE TABLIST FldList = [,MOVES, STATUS,] SCAN FOR ( [,] + TRIM(TableName) + [,] ) $ FldList          lcFile = TRIM(TableName)          SELECT 0          USE ( lcFile ) * Don't change MOVES, which is a new file.          IF lcFile <> [MOVES]                  ZAP                  lcOldFile = lcFile * Deal with tables whose names have changed here                  IF lcFile    = [CONTACTS]                     lcOldFile = [CONTACT]                  ENDIF                  APPEND FROM ( lcPath + lcOldFile )                  GO TOP          ENDIF * Add key field values here          REPLACE ALL KEYVALUE WITH RECNO()          GO BOTTOM          INSERT INTO KEYFIELDS VALUES ( lcFile, EVAL(lcFile + [.KeyValue]) )          GO TOP * Change all check number to seven digits with leading zeroes          IF ( [,] + lcFile + [,] ) $ [CHECKS,PAYMENTS,]                 REPLACE ALL CHECKNUM WITH TRANSFORM ( VAL(CHECKNUM),[@L #######])          ENDIF          USE ENDSCAN USE ESL REPLACE ALL ;  Class                WITH STRTRAN(Class,'AM','am'),        ;  Class                WITH STRTRAN(Class,'PM','pm') USE *(6) Convert tables with no KeyValue field that changed their name USE GRANTS ZAP APPEND FROM ( lcPath + [GRANT] ) USE *(7) Convert tables that required a field name change *    because a reserved word was used as a field name. USE ALLOCATE ZAP SELECT 0 CREATE CURSOR CarryAll ( Allocate Char(4), CALLOCATE Char(4), Descrip Char(25) ) APPEND FROM x && ( lcPath +    [Allocate] ) REPLACE ALL CALLOCATE WITH ALLOCATE SELECT ALLOCATE APPEND FROM DBF([CarryAll]) USE IN CarryAll USE *(8) Convert all other tables USE TABLIST ExcludedList = [,NOTES,MOVES,MEMBERS,CASES,ESL,PREMIUM,STATUS,UPGRADE,] SCAN FOR NOT ( [,] + ALLTRIM(UPPER(TableName)) + [,] ) $ ExcludedList              *****************************************         SELECT 0         USE ( ALLTRIM(TABLIST.TableName) )         ZAP         APPEND FROM ( lcPath + ALLTRIM(TABLIST.TableName) )         USE         SELECT TABLIST ENDSCAN USE *(9) Fix all data previously corrected by users SELECT DISTINCT TableName        ;   FROM CHANGES                        ;   INTO ARRAY FixerUppers         ;  ORDER BY TableName FOR EACH TName IN FixerUppers         SELECT * FROM Changes WHERE TableName = TName INTO ARRAY laChanges         SELECT 5         USE ( TName )         FOR I = 1 TO  ALEN ( laChanges, 1 )                 F = laChanges[I,2]                 D = VARTYPE  ( F )                 O = ALLTRIM(laChanges[I,3])                 N = ALLTRIM(laChanges[I,4])                 L = []                 R = []                 Cmd = [UPDATE ] + TName + [ SET ] + F + [ = ] + L + N + R ;                    + [ WHERE ] + F + [ = ] + L + O + R                 WAIT WINDOW Cmd NOWAIT                 &Cmd         ENDFOR ENDFOR WAIT CLEAR *(10) Upload all tables to SQL USE TABLIST ExcludedList = [,NOTES,FOXUSER,TABLIST,USERS,] SCAN FOR NOT ( [,] + UPPER(TableName) + [,] ) $ ExcludedList          Tbl = ALLTRIM(TABLIST.TableName)          oDataMgr.CloseIfOpen  ( Tbl )          oDataMgr.MakeSQLTable ( Tbl )          oDataMgr.Upload       ( Tbl ) ENDSCAN oDataMgr.CloseIfOpen  ( [KeyFields] ) oDataMgr.MakeSQLTable ( [KeyFields] ) oDataMgr.Upload               ( [KeyFields] ) *(11) Create the CLASSES table, which doesn't exist in the old system SQLEXEC ( oDataMgr.Handle, [DROP TABLE   CLASSES] ) SQLEXEC ( oDataMgr.Handle, [CREATE TABLE CLASSES ( Class Character(15) )] ) SQLEXEC ( oDataMgr.Handle, [INSERT INTO CLASSES VALUES ( 'Computer Lab   ')]) SQLEXEC ( oDataMgr.Handle, [INSERT INTO CLASSES VALUES ( 'Level I 11:15am')]) SQLEXEC ( oDataMgr.Handle, [INSERT INTO CLASSES VALUES ( 'Level II 6:00pm')]) * Final data cleanup... Cmd = [DELETE MEMBERS WHERE CaseNum NOT IN ( SELECT CaseNum FROM Cases )] lr = SQLEXEC ( oDataMgr.Handle, Cmd ) IF lr < 0    MESSAGEBOX( [SQL command failed:] + CHR(13) + Cmd, 16 ) ENDIF * Remove records containing fatal data errors that old system couldn't delete Cmd = [DELETE MEMBERS WHERE KeyValue IN ( 123,6110,6189 )] lr = SQLEXEC ( oDataMgr.Handle, Cmd ) IF lr < 0    MESSAGEBOX( [SQL command failed:] + CHR(13) + Cmd, 16 ) ENDIF * Create a key value for records with a blank value in a required field * in cases where we couldn't reconstruct the correct value, * but don't want to discard the data Cmd = [UPDATE MEMBERS SET AlienNum = LEFT(Last,4) + LEFT(First,4) ;       WHERE KeyValue IN ( 314,323,1141)] lr = SQLEXEC ( oDataMgr.Handle, Cmd ) IF lr < 0    MESSAGEBOX( [SQL command failed:] + CHR(13) + Cmd, 16 ) ENDIF * Throw away orphans if all attempts to build a relation have failed: Cmd = [DELETE ESL WHERE aliennum NOT IN ( SELECT aliennum FROM members )] lr = SQLEXEC ( oDataMgr.Handle, Cmd ) IF lr < 0    MESSAGEBOX( [SQL command failed:] + CHR(13) + Cmd, 16 ) ENDIF * Remove ESL records with blank case numbers Cmd = [DELETE ESL WHERE casenum = '          '] lr = SQLEXEC ( oDataMgr.Handle, Cmd ) IF lr < 0    MESSAGEBOX( [SQL command failed:] + CHR(13) + Cmd, 16 ) ENDIF WAIT CLEAR Took = [ Took ] + TRANSFORM ( TIME() - StartTime ) + [ seconds] MESSAGEBOX( [Ta-da!] + CHR(13) + Took, 64, [Done] ) SET TALK OFF SET STATUS BAR OFF 

Data Cleanup: Conclusion

Data cleanup is an essential part of the migration of a legacy application to its new home. If you use a tool like the conversion program presented in this section, the data starts out its new life on the right foot .

 <  Day Day Up  >  


Visual Fox Pro to Visual Basic.NET
Visual FoxPro to Visual Basic .NET
ISBN: 0672326493
EAN: 2147483647
Year: 2004
Pages: 130
Authors: Les Pinter

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net