August 2007

Well, I closed my books, Google searches, and my editor last Monday, and turned in my code.

 The biggest thing I think I learned this summer is not to spend too long getting caught up on a single aspect/feature of your project if you can’t figure it out. In my case, I spent entirely too long in the beginning and middle stages of the project basically just reading documentation, trying to build a complete picture in my mind of how MW stores, locates, and processes uploaded files, without writing any test code myself or anything. I did this partly because I was essentially afraid I’d miss some existing work that would be useful, or else not make use of some battle-tested code someone had taken the time to design. My efforts were really pretty fruitless, though, because I was trying to grasp too much at the same time. I found later that documentation is much more useful when your cursor is sitting halfway through a line and you need to figure out what function to use next. Tackling the same problems in this way, one step at a time, was of course much easier.

I also simply didn’t realize how much more complicated it would be to start transcoding uploaded media asynchronously immediately at upload time rather than periodically polling for items in a queue. When I wrote my proposal, it said “A second script (running perhaps as a cron job) will routinely monitor the job queue…” I had done a project like that in the past, but once I was accepted and had started working, I didn’t think it would be acceptable for there to be a gap of wasted time where an upload had occurred and there was an idle transcoding node that could be processing it but wasn’t. (That project also didn’t have to simultaneously monitor and control an encoder and decoder, just one app that did both.)

My proposal also makes no commitment to distributing the transcoding load among numerous transcoding nodes, but this too I decided was a must-have feature if my work was going to be widely used. In the end, it turned out not that complicated to implement, but in concept it did cause me to do a lot of thinking about “how should I go about this” that I could have totally skipped otherwise. (Actually, the version of my code that will be evaluated for SoC can only be used in a distributed fashion if the nodes can all write to a network filesystem that the recoded media is to be stored on, a requirement I was hoping to lift to be prepared for the possible upcoming removal of NFS from WikiMedia’s private network…but as there was no existing documented facility to write to a file repository not locally accessable, this wasn’t done.)

I didn’t expect the existing open source MPlayer -> Theora solution to be as limited as it was, but the improvements I made to the theora reference encoder overcame this unexpected roadblock, didn’t take that much time, and turned out to be a valuable opportunity to broaden my programing skills and give me some appreciation for the “power of open source.”

Finally, I spent tons of time working on MW’s support for audio and video media formats, as I was surprised to find out it really didn’t have any at the beginning of the summer. As my past posts discuss, I wrote code to more accurately detect MIME types of these media files, identify and validate them at upload time, the and beginnings of support for extracting metadata. I didn’t think I’d be doing any of this when I wrote my proposal, but what good is a system to recode uploaded videos if you can’t upload them in the first place?

All these things caused me to be behind schedule for the majority of the summer. I did produce a usable implementation of everything in time for Google’s “pencils down” deadline, but the time crunch at the end (also contributed to by confusion about the program end date…) did cause the code to suffer somewhat. Mainly, I want to add better handling for a variety of exceptional situations within the recoding daemon and within the UploadComplete hook that does the inital pushing of bits to add the job to the queue and notify an idle node of the event. Media support in MW still sucks too, and I might want to help out with that – for example, the existing MW notion of “media handlers” are bound to a file based on the file’s mime type. This works alright for jpeg files vs. djvu files, but not so much for video files, all of which can be examined by existing utilities like ffmpeg. Indeed, to test my code currently, one will need to map each video mime type they want to test to the video media handler in a config file.

I’m still awaiting any feedback from my mentor. I hope of course that I’ll get a passing evaluation, but even if I didn’t I don’t think I would consider my efforts a lost cause. Surely sooner or later, MediaWiki will have full audio and video support, and I want to continue to be a part of making that happen and ensure that my efforts will be made use of as much as possible. And I hope that sometime in the future, I will wake up, scan the day’s tech headlines, see “Wikipedia adds support for videos,” and know that I had something to do with it.


Since it’s been so long, I thought I’d just let everyone know I haven’t abandonded my work, died, or anything of the like. Blogging just doesn’t get the code written, so I haven’t been spending as much time on it.

As it’s now a week into August, I can say with certainty how the queue will be implemented. I did end up going with the idea that surfaced while authoring my last post, to have a daemon implemented in php running on each recoding node. The primary means of interfacing with it is an accompanying script that must run on the webserver of a recode node (yes, they must run apache), taking input via POST, forwarding it as commands to the daemon, and returning the outcome in an easily machine-processable format. This is accomplished over named pipes, the simplest means of interprocess communication that works for this task. All requests to and responses from the daemon get prefixed with their length to ensure complete transmission of messages. Notification of new jobs, as well as cancellation of the currently running job is then achieved by simply contacting the node (selected with aid of the DB) over http. The daemon expediently provides the notify script with results of a command so that the entire process can occur at upload time, specifically as manifsted in an UploadComplete hook. (Cancellation of a current job is used if a re-upload occurs at the time that the job generated by the old file is running.) The daemon uses a persistent instance of MPlayer for all assigned jobs, via MPlayer’s slave mode.

Although this isn’t quite as firm yet, I expect the following database design changes will facilitate operations:

  1. The image table will grow a new ENUM indicating whether the recoded editions of a particular file are available, in queue, or failed. I.E., a file is marked available when all configured recoded versions of it are finished processing.
  2. A new table will serve as the queue, and will contain only job order and the file’s database key. (I did considerable performance testing here and concluded that both columns should be indexed…retrievals will be by both order and name, and individual inserts/deletes seem to keep constant time regardless of number of records. *Very* rarely the order could get reindexed just to keep the integers from running off to infinity, perhaps as a task through the existing job queue. This all only matters if the queue has significant length anyway, but this might be the case if a wiki decides it doesn’t like its existing resample quality or size and wants to redo them all.)
  3. A new table will contain the current status of each recode node, including address to the notify script and the current job if there is one.

One of the recode daemon’s first tasks upon starting a new job is establishing which recoded versions of a given file are missing. I just like this better than storing such job details in the queue, as it greatly simplifies the queue as well as the adding of new formats in future. As a result, the daemon does need to interact with FileRepo and friends, so a good MediaWiki install must exist on recode nodes as well. In fact, much of the MW environment is actually loaded in the daemon at all times.

I’ve still got plenty to keep me busy here as the time starts getting tight.