2007-06-08

Hitting the transaction limit

It looks like it will be necessary to occasionally break down and re-establish the connection to the database in the importDump.php utility. Running it for too long results in the following:
C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance>"C:\Program Files\Zend\Core\bin
\php.exe" ImportDump.php s:\enwiki-20070402-pages-articles.xml
100 (50.09 pages/sec 50.09 revs/sec)
200 (51.75 pages/sec 51.75 revs/sec)
300 (50.43 pages/sec 50.43 revs/sec)
400 (50.78 pages/sec 50.78 revs/sec)
500 (51.62 pages/sec 51.62 revs/sec)
600 (52.31 pages/sec 52.31 revs/sec)
700 (51.62 pages/sec 51.62 revs/sec)
800 (51.35 pages/sec 51.35 revs/sec)
900 (51.24 pages/sec 51.24 revs/sec)
1000 (51.57 pages/sec 51.57 revs/sec)
1100 (51.85 pages/sec 51.85 revs/sec)
1200 (52.11 pages/sec 52.11 revs/sec)
1300 (51.62 pages/sec 51.62 revs/sec)
1400 (51.11 pages/sec 51.11 revs/sec)
1500 (50.16 pages/sec 50.16 revs/sec)
1600 (50.30 pages/sec 50.30 revs/sec)
1700 (50.09 pages/sec 50.09 revs/sec)
1800 (50.31 pages/sec 50.31 revs/sec)
1900 (50.39 pages/sec 50.39 revs/sec)
2000 (50.37 pages/sec 50.37 revs/sec)
2100 (50.34 pages/sec 50.34 revs/sec)
2200 (50.30 pages/sec 50.30 revs/sec)
2300 (50.41 pages/sec 50.41 revs/sec)
2400 (50.51 pages/sec 50.51 revs/sec)
2500 (49.89 pages/sec 49.89 revs/sec)
2600 (50.17 pages/sec 50.17 revs/sec)
2700 (50.34 pages/sec 50.34 revs/sec)
2800 (50.08 pages/sec 50.08 revs/sec)
2900 (50.18 pages/sec 50.18 revs/sec)
3000 (49.94 pages/sec 49.94 revs/sec)
3100 (48.71 pages/sec 48.71 revs/sec)
3200 (48.58 pages/sec 48.58 revs/sec)
3300 (48.43 pages/sec 48.43 revs/sec)
3400 (48.49 pages/sec 48.49 revs/sec)
3500 (48.51 pages/sec 48.51 revs/sec)
3600 (48.54 pages/sec 48.54 revs/sec)
3700 (48.52 pages/sec 48.52 revs/sec)
3800 (25.39 pages/sec 25.39 revs/sec)
3900 (15.09 pages/sec 15.09 revs/sec)
4000 (10.41 pages/sec 10.41 revs/sec)
4100 (8.62 pages/sec 8.62 revs/sec)
4200 (6.98 pages/sec 6.98 revs/sec)
4300 (4.98 pages/sec 4.98 revs/sec)
4400 (3.79 pages/sec 3.79 revs/sec)
4500 (3.20 pages/sec 3.20 revs/sec)
4600 (2.66 pages/sec 2.66 revs/sec)
4700 (2.45 pages/sec 2.45 revs/sec)
4800 (2.25 pages/sec 2.25 revs/sec)
4900 (2.13 pages/sec 2.13 revs/sec)
5000 (1.99 pages/sec 1.99 revs/sec)
5100 (1.83 pages/sec 1.83 revs/sec)
5200 (1.70 pages/sec 1.70 revs/sec)
5300 (1.61 pages/sec 1.61 revs/sec)
5400 (1.56 pages/sec 1.56 revs/sec)
5500 (1.49 pages/sec 1.49 revs/sec)
5600 (1.43 pages/sec 1.43 revs/sec)
5700 (1.37 pages/sec 1.37 revs/sec)
5800 (1.34 pages/sec 1.34 revs/sec)
5900 (1.30 pages/sec 1.30 revs/sec)
6000 (1.28 pages/sec 1.28 revs/sec)
6100 (1.23 pages/sec 1.23 revs/sec)
6200 (1.19 pages/sec 1.19 revs/sec)
6300 (1.13 pages/sec 1.13 revs/sec)
6400 (1.09 pages/sec 1.09 revs/sec)
6500 (1.04 pages/sec 1.04 revs/sec)
6600 (1.01 pages/sec 1.01 revs/sec)
6700 (0.98 pages/sec 0.98 revs/sec)
6800 (0.96 pages/sec 0.96 revs/sec)
6900 (0.95 pages/sec 0.95 revs/sec)
7000 (0.94 pages/sec 0.94 revs/sec)
7100 (0.91 pages/sec 0.91 revs/sec)
7200 (0.89 pages/sec 0.89 revs/sec)
7300 (0.87 pages/sec 0.87 revs/sec)
7400 (0.86 pages/sec 0.86 revs/sec)
7500 (0.86 pages/sec 0.86 revs/sec)
7600 (0.87 pages/sec 0.87 revs/sec)
7700 (0.87 pages/sec 0.87 revs/sec)
7800 (0.88 pages/sec 0.88 revs/sec)
7900 (0.86 pages/sec 0.86 revs/sec)
8000 (0.86 pages/sec 0.86 revs/sec)
exception 'com_exception' with message 'Source: Microsoft OLE DB Provider for SQL Server
Description: Cannot start more transactions on this session.' in C:\PHP\includes\adodb\drivers\adodb-ado5.inc.php:290
Stack trace:
#0 C:\PHP\includes\adodb\drivers\adodb-ado5.inc.php(290): com->BeginTrans()
#1 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\DatabaseADODB.php(1135): ADODB_ado->BeginTrans()
#2 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\JobQueue.php(170): DatabaseADODB->begin()
#3 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\HTMLCacheUpdate.php(81): Job::batchInsert(Array)
#4 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\HTMLCacheUpdate.php(46): HTMLCacheUpdate->insertJobs(Object(ResultWrapper))
#5 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\Title.php(2424): HTMLCacheUpdate->doUpdate()
#6 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\Article.php(2644): Title->touchLinks()
#7 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\SpecialImport.php(363): Article::onArticleCreate(Object(Title))
#8 [internal function]: WikiRevision->importOldRevision()
#9 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\DatabaseADODB.php(1048): call_user_func_array(Array, Array)
#10 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\SpecialImport.php(510): DatabaseADODB->deadlockLoop(Array)
#11 [internal function]: WikiImporter->importRevision(Object(WikiRevision))
#12 C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance\importDump.php(58): call_user_func(Array, Object(WikiRevision))
#13 [internal function]: BackupReader->handleRevision(Object(WikiRevision), Object(WikiImporter))
#14 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\SpecialImport.php(761): call_user_func_array(Array, Array)
#15 [internal function]: WikiImporter->out_revision(Resource id #55, 'revision')
#16 C:\Inetpub\wwwroot\wiki193\mediawiki\includes\SpecialImport.php(426): xml_parse(Resource id #55, ']] and [[Botswa...', 0)
#17 C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance\importDump.php(109): WikiImporter->doImport()
#18 C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance\importDump.php(91): BackupReader->importFromHandle(Resource id #54)
#19 C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance\importDump.php(129): BackupReader->importFromFile('s:\bauch\enwiki...')
#20 {main}
C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance>

2007-06-07

Testing (and fixing) the Maintenance Utilities

With the wiki up and stable for a couple of weeks now, it was time to try to push some limits. To do this, I downloaded a dump of the English Wikipedia. The command-line tool ImportDump.php works for a while (albeit slower than I would have liked). Here's what prints to the console:
C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance>"C:\Program Files\Zend\Core\bin
\php.exe" ImportDump.php s:\enwiki-20070402-pages-articles.xml
100 (3.27 pages/sec 3.27 revs/sec)
200 (1.90 pages/sec 1.90 revs/sec)
300 (1.40 pages/sec 1.40 revs/sec)
400 (1.13 pages/sec 1.13 revs/sec)
500 (1.04 pages/sec 1.04 revs/sec)
600 (1.01 pages/sec 1.01 revs/sec)
700 (0.79 pages/sec 0.79 revs/sec)
800 (0.74 pages/sec 0.74 revs/sec)
900 (0.68 pages/sec 0.68 revs/sec)
1000 (0.66 pages/sec 0.66 revs/sec)
1100 (0.63 pages/sec 0.63 revs/sec)
1200 (0.62 pages/sec 0.62 revs/sec)
1300 (0.59 pages/sec 0.59 revs/sec)
1400 (0.59 pages/sec 0.59 revs/sec)
A database query syntax error has occurred.
The last attempted database query was:
"COMMIT"
from within function "Database::deadlockLoop".
MySQL returned error "3902: e (jmolafbwikidev)"
This turns out to require some fixes in my DatabaseADODB.php. In particular, in the deadlockLoop() method. It also looks like my indexes on the page table need a little tweeking. After that, rebuildrecentchanges.php doesn't want to work at all
C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance>"C:\Program Files\Zend\Core\bin
\php.exe" rebuildrecentchanges.php
PHP Notice:  Undefined variable: wgDBadminuser in C:\Inetpub\wwwroot\wiki193\med
iawiki\maintenance\rebuildrecentchanges.php on line 15
PHP Notice:  Undefined variable: wgDBadminpassword in C:\Inetpub\wwwroot\wiki193
\mediawiki\maintenance\rebuildrecentchanges.php on line 16
Loading from page and revision tables...
A database query syntax error has occurred.
The last attempted database query was:
"INSERT  INTO [recentchanges] (rc_timestamp,rc_cur_time,rc_user,rc_user_text,rc_
namespace,rc_title,rc_comment,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_
last_oldid,rc_type) SELECT TOP 5000  rev_timestamp,rev_timestamp,rev_user,rev_us
er_text,page_namespace,page_title,rev_comment,rev_minor_edit,0,page_is_new,page_
id,rev_id,0, IF(page_is_new != 0, 1, 0)  FROM [page],[revision]   WHERE (rev_tim
estamp > '20070531214411') AND (rev_page=page_id)  ORDER BY rev_timestamp DES
C"
from within function "rebuildRecentChangesTablePass1".
MySQL returned error "170: n (jmolafbwikidev)"
The fix for this is to promote the method conditional from the Database class to the DatabaseADODB class, where it is implemented as:
 function conditional( $cond, $trueVal, $falseVal ) {
  return " CASE WHEN $cond THEN $trueVal ELSE $falseVal END ";
 }
Running again reveals another problem:
C:\Inetpub\wwwroot\wiki193\mediawiki\maintenance>"C:\Program Files\Zend\Core\bin
\php.exe" rebuildrecentchanges.php
PHP Notice:  Undefined variable: wgDBadminuser in C:\Inetpub\wwwroot\wiki193\med
iawiki\maintenance\rebuildrecentchanges.php on line 15
PHP Notice:  Undefined variable: wgDBadminpassword in C:\Inetpub\wwwroot\wiki193
\mediawiki\maintenance\rebuildrecentchanges.php on line 16
Loading from page and revision tables...
Updating links...
A database query syntax error has occurred.
The last attempted database query was:
"SELECT rev_id FROM [revision] WHERE rev_page=87 AND rev_timestamp<'200706052
02808' ORDER BY rev_timestamp DESC LIMIT 1"
from within function "".
MySQL returned error "170: n (jmolafbwikidev)"
 
It looks like we need to add some alternative SQL to rebuildRecentChangesTablePass2() in rebuildrecentchanges.inc as follows:
if ( $wgDBtype == 'adodb' ) {
 $sql2 = "SELECT TOP 1 rev_id FROM $revision " .
  "WHERE rev_page={$lastCurId} ".
  "AND rev_timestamp < '{$emit}' ORDER BY rev_timestamp DESC";
} else {
 $sql2 = "SELECT rev_id FROM $revision " .
  "WHERE rev_page={$lastCurId} ".
  "AND rev_timestamp<'{$emit}' ORDER BY rev_timestamp DESC LIMIT 1";
}
And that fixes it.

2007-06-05

Changes required

In addition to the obvious (addition of the code to support Microsoft SQL Server), some other files need to change, and a few other files need to be added to the system.
includes/Autoloader.php (change)
This file needs just a couple of additions, as follows
static $localClasses = array(
    # Includes
    'DatabaseADODB' => 'includes/DatabaseADODB.php',
    ...
    'SearchADODB' => 'includes/SearchADODB.php',
    ...
includes/BagOStuff.php (change)
In SqlBagOStuff::get($key), the SQL needs a little fix, from
"SELECT value,exptime FROM $0 WHERE keyname='$1'", $key);
to
"SELECT value,exptime FROM $0 WHERE keyname=$1", $key);
Microsoft SQL Server is thrown off by what become extraneous single quotes. A similar change must be made to SqlBagOStuff::delete($key,$time=0). In the methods _serialize(&$data) and _unserialize($serial), I've had to remove the calls to gzdeflate and ginflate, since I couldn't get them to behave consistently.
includes/Database.php
To this base class, I've added two new abstract methods
public function setFetchModeAssoc() {
}
public function setFetchModeNum() {
}

These are used very little, and only by a couple of the Special: pages. By default, the ADODB connection operates in numeric fetch mode (i.e., it returns records in an array subscripted by numbers). After a call to setFetchModeAssoc the connection operates in associative fetch mode (i.e., it returns records in an array subscripted by field names). This is different from the MySQL connection, which returns arrays that can be indexed both ways.
includes/GlobalFunctions.php In function wfShellExec($cmd, &$retval=null), I changed the line
$cmd = '"' . $cmd . '"'; 
to
$cmd = 'cmd /C ' . '"' . str_replace( '/', '\\', $cmd) . '"';

In includes/MagicWord.php, I changed a line in MagicWord::initRegex() from
$case = $this-mCaseSensitive ? '' : 'iu';
to
$case = $this->mCaseSensitive ? '' : 'i';
I also had to rewrite matchAndRemove(&$text) and matchStartAndRemove(&$text)
In includes/Pager.php, method IndexPager::reallyDoQuery($offset,$limit,$ascending) the line
$res = $this->mDb->select( $tables, $fields, $conds, $fname, $options );
must be wrapped up like this:
$this->mDb->setFetchModeAssoc();
$res = $this->mDb->select( $tables, $fields, $conds, $fname, $options );
$this->mDb->setFetchModeNum();

2007-05-25

What the wiki looks like

The only page that looks much different from Wikipedia or any other MediaWiki-based wiki is the Special:Version page. I've included a screen-grab of that so you can see how my version is different from the mainstream MediaWiki engine. In particular, notice IIS instead of Apache, the ADOdb library and SQL Server instead of the MySQL library and MySQL, and a slightly different version of PHP (this is the version from Zend Core)

2007-05-17

More speed tests

Having been pleased with the effects of the Zend Optimizer, I began to wonder what the effects would be of running the entire MediaWiki software on the commercial Zend Core engine. Here are some samples (all times in seconds):
PagePHP No optimization PHPw/Optimizer Zend Core Wikipedia
2003 invasion of Baghdad3.323.200.690.34
2003 Invasion of Iraq20.765.456.664.09
Air Force Special Operations Command1.160.630.480.21
World Geodetic System1.321.751.300.36
United States Department of Justice2.030.940.770.54
United States Air Force6.477.242.171.17
The Sunshine Boys1.930.820.620.37
Ruby on Rails1.7515.880.650.25
PowerPoint3.992.071.241.68
Lee Meredith0.850.530.370.18
Arnold Schwarzenegger1573.914.03
Philippines2555.63.51
Note that these are the times reported by PHP (as can be seen with the view source command on the resulting web page). Also note that the times vary significantly depending on what the server may be doing at the time. The times reported for my server are for uncached results, whereas the times reported for Wikipedia are presumably usually for cached results -- except for Arnold, it looks like I found him out of the cache this time around!

2007-05-11

Version 1.10 on SQL Server

MediaWiki 1.10 is out. That's my cue to start trying to get my changes integrated into the shared source code. I've started on that with a Bugzilla, Bug # 9767 I pretty much have 1.10 all working, except the thumbnail functionality stopped working. In my configuration, ImageMagick 6.3.2-5 is used to generate the thumbnails. Now, instead of thumbnails, I get "Error creating thumbnail: The input line is too long". The first part of this message, up to the colon, is generated by the PHP code. The second part, however, appears to come from ImageMagick itself. In my modified MediaWiki version 1.9.3, this had been working -- so I need to figure out what's changed. (... a little time goes by ...) OK, It's not ImageMagick's fault. The problem had to do with the documented (http://news.php.net/php.internals/21796) flaw with PHP's invocation of cmd.exe. My implementation of the wfShellExec function in GlobalFunctions.php now contains this:
...
   } elseif (php_uname( 's' ) == 'Windows NT' ) {
      # This is a hack to work around PHP's flawed invocation of cmd.exe
      # http://news.php.net/php.internals/21796
      $cmd = 'cmd /C ' . '"' . str_replace( '/', '\\', $cmd) . '"';
   }
   wfDebug ( "wfShellExec: $cmd\n" );

   $output = array();
   $retval = 1; // error by default?
   exec( $cmd, $output, $retval ); // returns the last line of output.
   return implode( "\n", $output);
}
Note that this differs in two ways from the distributed version of GlobalFunctions.php:
  1. Prefixes the command with "cmd /C"
  2. Replaces all occurrences of slash with backslash
I did the first, as I remember, to get Batik to work (for processing SVGs), which is driven by a batch file. I did the second to get some consistency in the direction the slashes are facing. The MediaWiki code is a little bit unclear when it decides to stick slashes into strings that represent filenames, and sometimes uses pieces of a URL (forward slash) to append subdirectory names to existing directories (backward slash). All I do is format the result as though it's a filename (which it will always be in this situation, since we're trying to execute something). Naturally, there will be a problem if there ever arises a situation in which the slashes are being used for something else, but I haven't seen this happen. Presumably they would show up uuencoded if they were being used (say) as part of an argument to a command.

Making it fast

Until today, I have been reasonably disappointed at the speed at which the pages in my wiki have been coming up. Wikipedia uses lots of stunts which aren't really open to me, like using Squid together with lots & lots of slaved servers. Right now, I'm just running on a two-processor (2 x 3GHz Pentium) box with 2GB of memory. SQL Server, IIS, etc. are all running locally. In this configuration, pages were taking anywhere from 2 to 10 times as long to display as the corresponding pages from Wikipedia. The fix is in. I suspected that a PHP accelerator would work, and it does -- beyond my expectations. All of a sudden, this wiki is snappy! The trick is getting a PHP accelerator that works in our configuration (Windows 2003 SE, IIS 6, PHP 5, etc.) The one that does it is Zend Optimizer. It's a free download. Thanks, Zend. If you're running any PHP on IIS, give it a try. I think you'll be impressed. Here are some example timings for my wiki, along with times from wikipedia.org:
Page titleTime before optimizerTime after optimizerWikipedia time
Philippines25 seconds5 seconds4 seconds
Arnold Schwarzenegger15 seconds7 seconds5 seconds

2007-04-02

The long pause

I disappeared from the Blogosphere for a while. No need to panic, the project is going very well. Pretty much everything is working on SQL Server now. My intent was to record in the blog everything that was going on as it was happening. Instead, I got so wrapped up in working on the code I didn't have much energy left to do the blogging. My intent now is to get my work rolled into the MediaWiki baseline once MediaWiki 1.10 releases. The MediaWiki principals seem to be working fairly arduously on making the 1.10 release happen, and don't need to be distracted by worrying about incorporating completely new capabilities that come at them from out of the blue. Not surprisingly, there turned out to be a little bit of scope creep. MediaWiki doesn't seem to get much use on Microsoft Windows servers, so I discovered some things that needed to be done to make that work better too -- like getting the ImageMagick and Batik image management working (by default they don't). I also needed to work to get HTML Tidy working. It wasn't until pretty late in the game that I realized that was even going to be necessary. In particular, there's a lot of templates used in Wikipedia and related sites that just plain won't work right now without HTML Tidy. Having posted this latest status report now, it remains for me to go back and retrace the steps it took to get to a working MediaWiki engine on Microsoft SQL Server in upcoming blog posts.