Parsing Large Files

   

The method above works well for relatively simple files, but what if you need to parse a large file with a basic structure that has a few subtle variations? One method is to use the ereg() function[1] to seek out patterns of data.

[1] See Chapter 4 for details on the ereg() function.

This next script parses a large text file and places its contents into a database. The file is a list of 789 multiple-choice questions that appear on the FAA's private-pilot written exam.[2]

[2] The questions are not under copyright and are freely available.

The document follows a basic structure, in that it has numbered questions followed by three possible answers. Some of the questions reference a figure, while other questions reference multiple figures. The goal of the script is to parse the text file and place each of the questions into a MySQL database, along with the corresponding choices for the answers. If there is a figure, or figures, that is referenced by the question, then the script also places it in the database.

The document is available on the Internet at av-info.faa.gov/data/airmanknowledge/par.txt. The document is very large, and the structure of some questions is slightly different, especially when referenced figures are mentioned.

The script takes the questions and places all of the information in a database, so that it is easier to manage the data and port it into a format usable by other applications.

This script makes use of another regular expression function called preg_split(). preg_split() is able to break up a string into an array, similar to the commonly used explode() function. Instead of breaking up the string based on a character, it breaks up the string based on a regular expression match.

Look at the file at av-info.faa.gov/data/airmanknowledge/par.txt to become familiar with the general struture of the data before reading over the script descriptions. Things will seem much more clear.

The script is comprised of three files:

  • parsefile.sql The SQL statements required to create and set up the database.

  • parsefile.php The PHP script that parses the text file and places the data into a database.

  • displaydata.php The PHP script that reads the data from the database and displays it to the user. This script is just an example of what you can do with the data once it has been placed in the database.

Script 6 3 parsefile.sql
  1.  //mysqladmin create faa  2.  3.  grant ALL on faa.* to php@localhost identified by "password";  4.  5.  create table questions(  6.    id INT NOT NULL,  7.    code varchar(8),  8.    question TEXT,  9.    primary key(id)); 10. 11.  create table answers( 12.    question INT NOT NULL, 13.    letter CHAR(1), 14.    answer TEXT); 15. 16.  create table figures( 17.    question INT NOT NULL, 18.    figure VARCHAR(3), 19.    additional VARCHAR(24)); 20. 21.  create table info( 22.    id INT NOT NULL, 23.    version VARCHAR(8), 24.    date VARCHAR(10), 25.    bank VARCHAR(16)); 

Script 6-3. parsefile.sql Line-by-Line Explanation

LINE

DESCRIPTION

1

Use the mysqladmin command line program to create a database called "faa".

3

Grant the required privileges so that a user can access the database using the username "php" and the password "password".

5 9

Create a table to hold the text of the questions.

11 14

Create a table to hold the answers. There are three answers for each question.

16 19

Create a table to hold any figure references contained in a question.

21 25

Create a table to hold the version and date information for the question bank.

Script 6 4 parsefile.php

[View full width]

   1. <?   2. function connect() {   3.   if(!mysql_connect("localhost","php","password")) {   4.     die("<h1>Could Not Connect To MySQL Server</h1>");   5.   } elseif(!mysql_select_db("faa")) {   6.     die("<h1>Could Not Access DB</h1>");   7.   }   8.   return 1;   9. }  10.  11. function parse($data) {  12.   global $question_num;  13.   //IF REFERRING TO A FIGURE, account for inconsistent way the FAA labels these  14.   if(ereg("(\(Refer to figure )([0-9]+)(\.\)|(\)\.))", $data, $regs)) {  15.     $data2 = preg_split("/(\(Refer to figure )([0-9]+)(\.\)|(\)\.))/", $data);  16.     sql_query("insert into figures values('$question_num','" . $regs[2] . "', NULL)");  17.     return trim($data2[1]);  18.   //IF REFERRING TO MULTIPLE FIGURES, account for inconsistent way FAA labels these  19.   } elseif (ereg("(\(Refer to figures |\(Refer to figure )([0-9]+|[0-9]+) (, and |and  graphics/ccc.giffigure )([0-9]+)(\.)(\))", $data, $regs)) {  20.     $data2 = preg_split("/(\(Refer to figures |\(Refer to figure )([0-9]+|[0-9]+) (, and  graphics/ccc.gif|and figure )([0-9]+)(\.)(\))/", $data);  21.     sql_query("insert into figures values('$question_num','" . $regs[2] . "', NULL)");  22.     sql_query("insert into figures values('$question_num','" . $regs[4] . "', NULL)");  23.     return trim($data2[1]);  24.   //IF REFERRING TO A FIGURE WITH AREA(S), account for inconsistent way FAA labels these  25.   } elseif(ereg("(\(Refer to figure )([0-9]+)(, | )([a-zA-Z]+)( [ a-zA-Z0-9]+)(\.)(\))",  graphics/ccc.gif$data, $regs)) {  26.     $data2 = preg_split("/(\(Refer to figure )([0-9]+)(, | )([a-zA-Z]+)( [ a-zA-Z0-9]+)(\.)(\ graphics/ccc.gif))/", $data);  27.     sql_query("insert into figures values('$question_num','" . $regs[2] . "', 'area " .  graphics/ccc.gif$regs[5] . "')");  28.     return trim($data2[1]);  29.   //FUNKY SEMICOLON AND MULTIPLE FIGURES AREAS SCREWING EVERYTHING UP...  30.   } elseif(ereg("(\(Refer to figure )([0-9]+)(, | )([a-zA-Z]+)( [ a-zA-Z0-9]+)(; and  graphics/ccc.giffigure )([0-9]+)(\.)(\))", $data, $regs)) {  31.     $data2 = preg_split("/(\(Refer to figure )([0-9]+)(, | )([a-zA-Z]+)( [ a-zA-Z0-9]+)(;  graphics/ccc.gifand figure )([0-9]+)(\.)(\))/", $data);  32.     sql_query("insert into figures values('$question_num','" . $regs[2] . "', 'area " .  graphics/ccc.gif$regs[5] . "')");  33.     sql_query("insert into figures values('$question_num','" . $regs[7] . "', NULL)");  34.     return trim($data2[1]);  35.   //EVERYTHING ELSE  36.   } else {  37.     return $data;;  38.   }  39. }  40.  41. function sql_query($query){  42.   if(!mysql_query($query)) {  43.     die("<h1>Error with QUERY: $query</h1>");  44.   }  45.   //echo "<P>$query";  46. }  47.  48. /***** MAIN *****/  49. //$file = "http://av-info.faa.gov/data/airmanknowledge/par.txt";  50. $file = "par.txt";  51. $contents = file($file);  52. $size = sizeof($contents);  53. $questions = "not started";  54.  55. $answer_let = "A";  56. $question = "";  57.  58. connect();  59. mysql_query("delete from questions");  60. mysql_query("delete from answers");  61. mysql_query("delete from figures");  62. mysql_query("delete from info");  63. insert into info values ('1',NULL,NULL,NULL);  64. for($i = 0; $i < $size; $i++) {  65.   if(trim($contents[$i]) != ""){  66.     $data = $contents[$i];  67.     if(ereg("^Version", $data)) {  68.       $versioninfo = explode(":", $data);  69.       $version = trim($versioninfo[1]);  70.       sql_query("update info set version = '$version' where id = '1'");  71.     } elseif(ereg("[0-9]+/[0-9]+/[0-9]+", $data)) {  72.       $date = trim($data);  73.       sql_query("update info set date = '$date' where id = '1'");  74.     } elseif(ereg("^Bank", $data)) {  75.       $bankinfo = explode(":", $data);  76.       $bank = trim($bankinfo[1]);  77.       sql_query("update info set bank = '$bank' where id = '1'");  78.     } elseif(ereg("^[0-9]+\.", $data)) {  79.       if($questions == "started") {  80.         sql_query("update questions set question = '" . addslashes($question) . "' where id =  graphics/ccc.gif'$question_num'");  81.       $question = "";  82.     }  83.      $questions = "started";  84.      $data2 = preg_split("/\s+/", $data);  85.      $question_num = rtrim($data2[0], ".");  86.      $code = $data2[1];  87.      sql_query("insert into questions values('$question_num', '$code', NULL)");  88.    }elseif(ereg("(^A\.)|(^B\.)|(^C\.)", $data)) {  89.      $answer = preg_split("/(^A\.)|(^B\.)|(^C\.)/", $data);  90.      sql_query("insert into answers values ('$question_num','$answer_let','" . addslashes( graphics/ccc.gif$answer[1]) . "')");  91.      $answer_let++;  92.      if($answer_let == "D"){  93.        $answer_let = "A";  94.      }  95.    } else {  96.      if($questions == "started") {  97.        $question .= parse($data);  98.      }  99.     } 100.   } 101. } 102. echo "Operation Completed with No Errors"; 103. ?> 

Script 6-4. parsefile.php Line-by-Line Explanation

LINE

DESCRIPTION

2 9

Define a function, connect(), that is used to establish the connection to the database. Replace the mysql_connect variables with those for your MySQL server.

11 39

Define a function, parse(), that parses the data to extract figure references. Since we may want to hyperlink the figure references to actual figures, we need to parse the data using the ereg() function to look for these figure references in the text. This function takes one argument, $data, which is data that has been read by the main program and established as question data.

12

Allow the $question_num variable to be read by this function.

13

Provide a comment explaining the case that is to follow.

14

If $data matches the ereg() expression, then execute lines 15 17. If not, go on to line 19.

This line tries to match the data that has a reference to a figure. Since the text is inconsistent in how it labels figures, we need to use a regular expression match that takes into account several variances found in the text. The regular expression matches against the following (where # is any range of numerals):

  • (Refer to figure #.)

  • (Refer to figure #)

  • (Refer to figure #).

Notice also that we use the $regs argument to the ereg() function. The $regs argument, when used with ereg(), causes ereg() to return any matches that occur in the parentheses of the ereg search as items in the $regs array. For example, if the text was "(Refer to Figure 8). How many licks does it take to get to the center…", then ereg() would return the following:

  • $regs[0] = "(Refer to Figure 8.) How many licks does it take…" //the complete match

  • $regs[1] = "(Refer to figure "

  • $regs[2] = "8"

  • $regs[3] = ".)"

Nothing else is returned, because we did not specify anything else within parentheses in the ereg() expression.

15

Since the ereg() function matched on line 14, this line separates the figure reference from the actual question. The preg_split() function, as noted earlier, works like explode(). In this case, it takes the match above and "explodes" it so that $data is broken into two parts. The first part is the figure reference, and the second part is the actual question. If we follow the example of $data being "(Refer to Figure 8.) How many licks does it take…", then this line would create an array with two items:

  • $data2[0] = "(Refer to Figure 8.)"

  • $data2[1] = "How many licks does it take…"

16

Place the figure reference into the database using the $question_num variable as a pseudo-foreign key.

17

Return the actual questions, sans the figure reference, back to the main program. Since file() breaks the text file into an array based on each line, it is possible that the question spans more than one line, and that a line is only a fragment of the question. The main program concatenates the question parts from each line into a full question and later places the full question into the database.

18

Provide a comment explaining the case that is to follow.

19

This case is much the same as the previous, only it tries to match against questions that have multiple figure references, for example:

  • (Refer to figures # and #)

  • (Refer to figure # and figure #)

  • (Refer to figures #, and #)

Since the text has, at most, two figures per question, we need to extract both of the figure numbers to place in the database. Again we use the $regs argument to ereg(), and by carefully grouping the regular expression with parentheses, we know that the figure number will be found in the second and fourth groupings, $regs[2] and $regs[4].

20

As before, we need to split the text to remove the figure reference from the actual question, using the same regular expression as in the previous line for the split.

21 22

Place the multiple figure references in the database.

23

Return the actual question portion of the data back to the main program for further processing.

24

Provide a comment explaining the case that is to follow.

25

If you looked at the text file, you may have noticed that some questions with figure references also have a specific area of the figure that was to be referenced. This ereg() function takes these area references into account.

26

Split the text to extract the actual question portion.

27

Place the figure reference into the database, along with the area reference to the figure.

28

Return the question to the main program.

29

Provide a colorful comment explaining the case that is to follow.

30

Finally, we take into account multiple figures that include areas, with an odd semicolon or comma thrown in for good measure. This ereg() function matches against text such as the following:

  • (Refer to figure #, area A; and figure #)

  • (Refer to figure # area #, and figure #)

  • (Refer to figure # area A, and #)

31

Once again, split the text to extract the actual question.

32 33

Place the figure/area references into the database.

34

Return the actual question to the main program.

35

Provide a comment explaining the case that is to follow.

36 38

If there is no figure reference, then we assume that the remaining data is just lines of the question and return it to the main program to be appended to the other parts of the question that have already been sent.

39

End the function declaration.

41 46

Define a function, sql_query(), that attempts to send the query to the database. If there is an error, the script is killed. There is an echo line that is commented out. Uncomment this line if you'd like the query statements to be printed to the browser.

48

Begin the main program.

49 50

Define the $file variable that will be used to open the file. Each of these lines does the same thing. If you copied the file to your local machine, then uncomment line 50 and comment line 49. If you have a fast connection to the Internet, do the opposite.

51

Use the file() function to read the contents of $file into an array.

52

Determine how many lines are in the file by seeing how many records are in the array.

53

Define a variable, $questions, and assign it the value "not started". This is done because there is some extraneous information at the beginning of the file that must be parsed over before we can start parsing the questions themselves.

55

Define a variable, $answer_let, and assign it the value of "A". This is used to track the letter of the current answer (A, B, or C).

56

Define a variable, $question, and give it a blank value. This will be used to hold and append the question fragments as they are read from the file.

58

Establish a connection to the database using the connect() function.

59 62

Execute queries to delete the contents of the database. Since the database is only designed to hold one version of questions at a time, you must ensure that there is no existing data in the database.

63

Insert a "default" row into the info table as a placeholder.

64 102

Loop through each line in the text file and parse it.

65

If the line is not blank, then continue. Otherwise, go to line 101.

66

Assign the value of the current line ($contents[$i]) to the $data variable to save some typing.

67 70

If the line starts with "Version", then explode the line on the ":" character, trim any whitespace, and place the version number of the test bank into the database by sending the SQL to the sql_query() function.

71 73

If the line contains a date format for example, "6/11/2002" then trim the whitespace around the text and place it into the database.

74 77

If the line starts with "Bank", then explode the line on the ":" character, trim any whitespace, and place the bank information in the database.

78

If the line begins with a number immediately followed by a period, then we know, from studying the text, that it is a question number. Continue through line 87.

79 82

If the $question variable has been set to started, then we know we have some question data already assembled. Place that question in the database and clear the text from our $question variable, since we will be placing a new question in that variable.

83

Set the $questions variable to started (it may already be set that way, but if this is the first question encountered, it has not been).

84

Split the current line up along any whitespace. The initial line for a question looks like this:

 

"1.

A01

PVT"

 

This breaks the line into three array elements and assigns it to the $data2 variable.

85

Get the question number from the $data2 array, trim off the trailing period, and assign it to the $question_num variable.

86

Get the question code from the $data2 variable and assign it to the $code variable.

87

Insert the question number and code into the database. We leave the actual question part out at this time and only place a NULL into the database in its place. A little later, we'll put the complete question in, since it spans multiple lines.

88

If the line begins with an "A.", "B.", or "C.", then we know that this is one of the multiple choice answers to the question. Continue to line 95.

89

Split the answer letter from the actual answer by using the preg_split() function.

90

Insert the answer into the database using the question number as a foreign key. Use the $answer_let variable as the answer letter.

91

Increment the $answer_let variable, since we know the next answer letter is going to be one more than the current.

92 94

If the $answer_let variable is "D", then set it back to "A", since we know each question only has three possible answers, "A", "B", or "C".

95 99

If none of the above has matched, then we know we have an actual question, since it is not preceded by any of the above matches. Send this part of the text to be parsed by the parse() function.

100

End the if statement started on line 65.

101

End the for loop started on line 64.

102

If we've made it this far without the script dying, then we know that no errors have occurred. Print out a message telling yourself so!

103

End the PHP for the script. Now run displaydata.php to see the result of your labor.

This next script, shown in Figure 6-3, merely displays what we have put in the database. It's just a quick example to show you how thoroughly parsefile.php parsed the large text document.

Script 6 5 displaydata.php

[View full width]

  1.  <html>  2.  <head><title>Display Parsed File From Database</title>  3.  <style type=text/css>  4.  h1, h2, h3, p, td, a {font-family: verdana, arial, helvetica, sans-serif;}  5.  .navy {color: navy; }  6.  </style>  7.  </head>  8.  <body>  9.  <? 10. 11.    function connect() { 12.      if(!mysql_connect("localhost","php","password")) { 13.        die("<h1>Could not connect to MySQL Server</h1>"); 14.      } elseif(!mysql_select_db("faa")) { 15.        die("<h1>Could Not Access DB</h1>"); 16.      } 17.    return 1; 18.  } 19. 20.  function display($sql) { 21.    $color = "#FFFFFF"; 22.    connect(); 23.    $result = mysql_query($sql); 24.    while($row = mysql_fetch_array($result)) { 25.      $question_num = $row['id']; 26.      echo "<tr bgcolor=\"$color\"><td class=navy>" . $question_num . "</td><td><pre>" .  graphics/ccc.gifstripslashes($row['question'] . "</pre>"); 27.      $sql2 = "select * from figures where question = '$question_num' order by figure"; 28.      $result2 = mysql_query($sql2); 29.      if(mysql_num_rows($result2) > 0) { 30.        while($row2 = mysql_fetch_array($result2)) { 31.          echo "<br>See Figure: " . $row2['figure'] . " " . $row2['additional']; 32.        } 33.      } 34.      echo "</td></tr>"; 35.      $sql3 = "select * from answers where question = '$question_num' order by letter"; 36.      $result3 = mysql_query($sql3); 37.      while($row3 = mysql_fetch_array($result3)) { 38.        echo "<tr bgcolor=\"$color\"><td colspan=2>"; 39.        echo $row3['letter'] . ". " . stripslashes($row3['answer']); 40.        echo "</td></tr>"; 41.      } 42.      if($color == "#FFFFFF") { 43.        $color = "#FFFF80"; 44.      } else { 45.        $color= "#FFFFFF"; 46.      } 47.      echo "</td></tr>"; 48.      flush(); 49.    } 50.  } 51.  ?> 52.  <table border=1 cellspacing=0 cellpadding=5> 53.  <? 54.  if(!isset($next)) { 55.    $sql = "select * from questions limit 0, 200"; 56.    $next = 0; 57.  } else { 58.    $sql = "select * from questions order by id limit $next, 200"; 59.  } 60.  display($sql); 61.  $next += 200; 62.  ?> 63.  </table> 64.  <p><a href="displaydata.php?next=<?=$next?>">Next 200 Questions</a> 65.  </body> 66.  </html> 
Figure 6-3. displaydata.php

graphics/06fig03.jpg

Script 6-5. displaydata.php Line-by-Line Explanation

LINE

DESCRIPTION

1 9

Print out the beginning HTML for the script, including some style sheet formatting.

11 18

Define a function, connect(), to connect to the database.

20 50

Define a function, display(), that displays the data in the database. The display() function takes one argument, $sql, which is the initial SQL statement used to display a set number of questions.

21

Initialize the $color variable. This is used to alternate colors for each of the questions to make the page more readable.

22

Connect to the database using the connect() function.

23

Generate an SQL result using the $sql variable that was passed to the function.

24 34

Display the question and any figure references associated with the question.

35 41

Display the answers to the associated question.

42 46

Change the $color variable so that it is the opposite of the current color.

47

Close out the row.

48

Flush content to the browser so a timeout does not occur.

49

End the while loop started on line 24.

50

End the function declaration.

51

Get out of PHP mode for a moment.

52

Print out the beginning of a table.

53

Get back into PHP mode.

54 56

If the $next variable has not been set, then create an SQL statement to display the first 200 questions in the test bank and set $next to ")" to initialize it.

57 59

If the $next variable has been set, then we know the user has pushed the "Next…" link at the bottom of the page. Create an SQL statement to show the next 200 questions, depending on what questions they have already seen.

60

Execute the display() function using the generated SQL from the previous lines.

61

Increment the $next variable.

62

Get out of PHP mode again.

63

Close the table.

64

Print a link to display the next 200 questions in the bank.

65 66

Close out the HTML for the page.


   
Top


Advanced PHP for Web Professionals
Advanced PHP for Web Professionals
ISBN: 0130085391
EAN: 2147483647
Year: 2005
Pages: 92

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net