Hack99.Put Wikipedia on Your PSP


Hack 99. Put Wikipedia on Your PSP

Use MySQL and PHP to build a dictionary from Wikipedia that fits in your hip pocket.

Wikipedia (http://www.wikipedia.org) is probably the single most informative site on the Internet. It's a user-contributed encyclopedia and dictionary. What's even better is that you can download the entire contents of Wikipedia and use it for your own purposes.

In my case, I wanted the Wikipedia dictionary on my PSP. Being a PHP hacker, of course I had to use PHP and MySQL; I created a set of static pages from Wikipedia and then downloaded those pages to my PSP memory stick. It's not dynamic, but it still impresses my buddies when I can look up grok on my PSP.

Figure 10-15 shows the basic flow of the processing in this hack. The Wiktionary contents are loaded into the MySQL database [Hack #1]. An elaborate dict.php script takes the contents of the database and creates a set of specially formatted HTML pages appropriate to the PSP.

Figure 10-15. The processing flow of the PSP dictionary creator


10.6.1. The Code

Save the code in Example 10-10 as dict.php.

Example 10-10. Downloading the current Wikipedia to create static HTML
 <?php require_once( "DB.php" ); require_once( "Text/Wiki.php" ); $g_wiki = new Text_Wiki(); $g_wiki->enableRule('html'); $g_wiki->enableRule('list'); function wikiToHTML( $text ) {    global $g_wiki;   $text = preg_replace( "/\=\=\=\s* Pronunciation.*?\n\=\=\=/is", "\n===", $text     );   $text = preg_replace( "/\=\=\=\=\=\s*(.*?)\s*\=\=\=\=\=/", "\n+++++ $1", $text     );   $text = preg_replace( "/\=\=\=\=\s*(.*?)\s*\=\=\=\=/", "++++ $1", $text );   $text = preg_replace( "/\=\=\=\s*(.*?)\s*\=\=\=/", "+++ $1", $text );   $text = preg_replace( "/\=\=\s*(.*?)\s*\=\=/", "++ $1", $text );   $text = preg_replace( "/\=\s*(.*?)\s*\=/", "++ $1", $text );   $text = preg_replace( "/\[\[image:.*?\]\]/i", "", $text );   $text = preg_replace( "/\[\[it:.*?\]\]/i", "", $text );   $text = preg_replace( "/\[\[.*?\|(.*?)\]\]/", "$1", $text );   $text = preg_replace( "/\[\[(.*?)\]\]/", "$1", $text );   $text = preg_replace( "/\[(.*?)\]/", "$1", $text );   $text = preg_replace( "/\n\#([^#])/", "\n# $1", $text );   $text = preg_replace( "/\n\*([^*])/", "\n* $1", $text );   $text = preg_replace( "/\<\!\-\-.*?\-\-\>/mi", "", $text );   $text = preg_replace( "/\n\|.*?\|\s*\n/", "", $text );   $text = preg_replace( "/\n\{\|.*\n/", "", $text );   $text = preg_replace( "/\n\|\}.*\n/", "", $text );   $text = preg_replace( "/\n\|\}\n/", "", $text );   $text = preg_replace( "/\{\{.*?\}\}/", "", $text );   $text = preg_replace( "/\|\}/", "", $text );   $text = preg_replace( "/\|.*?\|/", "", $text );   $text = preg_replace( "/\'\'\'\'\'\'/", "'''\n'''", $text );   return $g_wiki->transform( $text, 'Xhtml' ); } function goodWord( $word ) {   if ( preg_match( "/^[A-Za-z]/", $word ) )   {     if ( preg_match( "/[^A-Za-z.-]/", $word ) ) return false; if ( preg_match( "/\-$/", $word ) ) return false; if ( preg_match( "/\[.]$/", $word ) ) return false; if ( preg_match( "/^.-/", $word ) ) return false; if ( preg_match( "/^.[.]/", $word ) ) return false; $cutword = preg_replace( "/[^A-Za-z]/", "", $word ); if ( strlen( $cutword ) < 4 ) return false; if ( strlen( $cutword ) > 20 ) return false;     return true;    }    return false; } function goodText( $text ) {    if ( preg_match( "/#REDIRECT/i", $text ) )      return false;    return true; } $g_words = array(); $g_wurl = array(); $dsn = 'mysql://root:password@localhost/wp'; $db =& DB::Connect( $dsn, array() ); if (PEAR::isError($db)) { die($db->getMessage()); } $blocksize = 100; $total_html = ""; $block = 0; $block_id = 0; function writeBlock( $block, $html ) {    $fh = fopen( "pages/words/".$block.".html", "w" );    fwrite( $fh, "<html><head>\n" );    fwrite( $fh, "<link rel=\"stylesheet\" type=\"text/css\" href=\"../default.css\      " />\n" );    fwrite( $fh, "</head><body><div style='width:478px'>\n" );    fwrite( $fh, $html );    fwrite( $fh, "</div></body></html>\n" );    fclose( $fh ); } $res = $db->query( "SELECT cur_title as word, cur_text as text FROM cur WHERE cur_namespace=0"); while ( $res->fetchInto( $row, DB_FETCHMODE_ASSOC ) ) {    $word = $row['word'];    $text = $row['text'];    if ( goodWord( $word ) && goodText( $text ) )    {      $c1 = strtolower( $word[0] );  if ( !isset( $g_words[ $c1 ] ) ) $g_words[ $c1 ] = array();  $c2 = strtolower( $word[1] );  if ( !isset( $g_words[ $c1 ][ $c2 ] ) ) $g_words[ $c1 ][ $c2 ] = array();  $oword = $word;  $word = strtolower( $word );  $g_words[ $c1 ][ $c2 ] []= $oword;  $g_wurl[ $word ] = "../words/".$block_id.".html#".$block;  print( "$word\n" );  $total_html .= "<a name=\"".$block."\" />";  $total_html .= "<div class='word-header'>".$oword."</div>";  $total_html .= "<table width='100%' cellspacing='0' cellpadding='0'><tr><td> ";  $total_html .= wikiToHTML( $text );  $total_html .= "</td></tr></table>";  if ( $block >= $blocksize )  {    writeBlock( $block_id, $total_html );    $block_id++;    $block = 0;    $total_html = "";  }  else        $block++;   } } writeBlock( $block_id, $total_html ); ob_start(); ?> <html><head><title>Index</title> <link rel="stylesheet" type="text/css" href="default.css" /> </head><body><div style="width:478px;"> <div > <?php foreach( array_keys( $g_words ) as $c1 ) { ?> <a href="lev1/<?php echo( $c1 ); ?>.html"><?php echo( $c1 ); ?></a> <?php } ?> </div></div></body></html> <?php $index = ob_get_clean(); $ih = fopen( "pages/index.html", "w" ); fwrite( $ih, $index ); fclose( $ih ); ob_start(); foreach( array_keys( $g_words ) as $c1 ) { ?> <a href="../lev1/<?php echo( $c1 ); ?>.html"><?php echo( $c1 ); ?></a> <?php } $c1header = ob_get_clean(); foreach( array_keys( $g_words ) as $c1 ) {   ob_start(); ?> <html><head><title><?php echo( $c1 ); ?></title> <link rel="stylesheet" type="text/css" href="../default.css" /> </head><body><div style="width:478px;"> <div ><?php echo( $c1header ); ?></div> <?php foreach( array_keys( $g_words[$c1] ) as $c2 ) { ?> <a href="../lev2/<?php echo( $c1.$c2 ); ?>.html"><?php echo( $c1.$c2 ); ?></a> <?php } ?> </div></body></html> <?php $html = ob_get_clean(); $fh = fopen( "pages/lev1/".$c1.".html", "w" ); fwrite( $fh, $html ); fclose( $fh ); } foreach( array_keys( $g_words ) as $c1 ) { ob_start(); foreach( array_keys( $g_words[$c1] ) as $c2 ) { ?> <a href="<?php echo( $c1.$c2 ); ?>.html"><?php echo( $c1.$c2 ); ?></a> <?php } $c2header = ob_get_clean(); foreach( array_keys( $g_words[$c1] ) as $c2 ) {    $words = $g_words[ $c1 ][ $c2 ];    ob_start(); ?> <html><head><title><?php echo( $c1.$c2 ); ?></title> <link rel="stylesheet" type="text/css" href="../default.css" /> </head><body><div style="width:478px;"> <div ><?php echo( $c1header ); ?></div> <div ><?php echo( $c2header ); ?></div> <?php foreach( $words as $word ) { ?> <a href="<?php echo( $g_wurl[ strtolower( $word ) ] ); ?>"><?php echo( $word ); ?></a> <?php } ?> </div></body></html> <?php $html = ob_get_clean(); $fh = fopen( "pages/lev2/".$c1.$c2.".html", "w" ); fwrite( $fh, $html ); fclose( $fh );  }  }  ?> 

There are four primary sections of this code. The first reads the data from the database. The second section iterates through all of the entries, discarding those it doesn't like and then cleaning up the ones it does like while converting them into HTML.

The HTML for each word entry is created in blocks of 100 words. If the script were to create a file for each word, even though the total size in bytes of the output would be the same, it would exceed the capacity of most memory sticks. So, the script groups words into files of 100 words. The script keeps track of which words are in which file using the g_wurl hash table, which has a URL as the value for each word as the key.

The remaining two sections of the script output the first and second levels of letter pages. The first-level pages have the letters of the alphabet across the top and links to the letter plus a second letter in the row below that. The second-level pages have all of the words that start with a particular two-letter combination. This breakdown into first and second letters was to keep the size of each page manageable.

10.6.2. Running the Hack

This hack requires the Text_Wiki PEAR module [Hack #2]. After that, you need to download the most recent English dictionary (Wiktionary) database from Wikipedia (the relevant URLs are http://download.wikipedia.org/ and http://download.wikipedia.org/wiktionary/).

The next step is to load the dictionary into your MySQL database:

 mysqladmin --user=root --password=password create wp mysql --user=root --password=password create wp < 20050623_cur_table.sql 

The name of the file will change based on when you download the dictionary from Wikipedia.


With your dictionary in place, run the dict.php script:

 % php dict.php aant aave abta acas acats aclu acme acronym … zygapophysial zygapophysis zygote zymurgy zythum zyzzyva % 

Grab some coffee; the script will take a while to finish. The Wikipedia databases are very large, and processing them takes a while. For example, on my G4 PowerBook, it took about an hour to create all of the HTML files.

Next, download the HTML files to your PSP memory stick. Put your PSP into USB mode and then attach it to your computer.

Bring up the PSP browser and surf to file://common/index.html. You should see something like Figure 10-16.

Figure 10-16. The home page of the dictionary


From here, you can select an initial letter and then a second letter (to be more specific and selective). That brings up a page of words that start with the first two letters that you selected. This page is shown in Figure 10-17.

Figure 10-17. The drill-down page after selecting the first and second characters


Now, find a word that interests you and click on it. That will take you to the detail page, which defines the word. This page is shown in Figure 10-18.

Pretty sweet!

As Wikipedia keeps growing, the size of the dictionary will change depending on when you run this process. When I ran this script, the dictionary HTML was around 30 MB in size, which fit easily onto even the smallest memory stick.

10.6.3. See Also

  • "Read RSS Feeds on Your PSP" [Hack #90]

Figure 10-18. The display after selecting "folklore"




PHP Hacks
PHP Hacks: Tips & Tools For Creating Dynamic Websites
ISBN: 0596101392
EAN: 2147483647
Year: 2006
Pages: 163

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net