Easychair data extraction

Have you ever heard of EasyChair? It is a free, simple and efficient way for managing a (scientific) conference: provides you all most of the tools for handling paper submitting, approving and camera-ready submitting (for further details please refer to EasyChair website).

The first issue is that some advanced features (as the complete data access as XML) is only available as a paid service. What if the data you need is already available but only as an (hugly) HTML file? I needed the whole list of accepted papers and the only option was an HTML page, formatted by DIVs and not, as some accessibility rules suggests, as a Table. First solution: copy-n-paste from HTML to a spreadsheet. More advanced: provide a script for converting such file to a “well written” HTML. In the generated file the list of papers are in a HTML table, no stylesheets are applied and all the links to authors webpages are removed.

Here we go: a simple set of sed rules to convert the list of accepted papers to a table based page.
Put all of these in a .sed file and invoke the sed commad as:

#sed -f file.sed < accepted-papers.html > accepted-papers-converted.html

The file.sed contents:

s/<br\/>/ /g
s/<style>.*<\/style>//g
s/<\/h1>/<\/h1><table>/g
s/<\/body>/<\/table><\/body>/g
s/<b>Abstract: <\/b>//g
s/<\/div><div class="paper">/<\/tr><tr><td>/g
s/<div class="paper">/<tr><td>/g
s/<span class="authors"><span>//g
s/<\/span>\. <\/span>/<\/td>/g
s/<span class="authors">//g
s/\. <\/span>/<\/td>/g
s/<span class="title">/<td>/g
s/<\/div><div class="abstract">/<\/td><td>/g
s/<a href="[^"]*">\([^<]*\)<\/a>/\1/gg

Firefox Sessionstore.js fixer

Sometimes Firefox opens with all of my tabs (and groups) empty: no session restore is provided and seems that all of my groups (more than 10 groups, with a total of 200 tabs) is simply disappeared.. it happend not only to me, as some googling shows.

What TH? Simply Firefox messed something up in your sessionstore.js file. Blogs suggest you to simply remove it and replace the backup automatically created by Firefox named sessionstore.old
But.. what if even this way gives you an empty set of tabs? Something is really wrong given the 3.8Mb sessionstore.js file!

The solution is only one: open the sessionstore.js file, fix its JSON contents and save it. But the problem here is: how can I edit a 3.8MByte JSON file without messing everything up? Continue reading

Drupal7 Image formatter

I’ve recently started using Drupal7 for a personal website project. I finally had the possibility to test on the field the new Drupal7 APIs and modules.

It’s amazing to see how much Views module is evolved into something that I can’t do without :) Great to see CCK (now Fields) in core and the increasing number of Themes supporting HTML5 (AdaptiveTheme, Omega.. etc).

The major difficulty is to locate, inside the Admin section, where the new configurations have been moved.. after a while I was able to fix Iconizer missing icons and finally, recognizing the same icons, starting to setup Drupal7 as fast as I do in Drupal6!

Among great enhancements, I found fantastic the added ability to configure a field formatter easly in the content-type settings form. What I feel missing is, for the Image field type, to limit the number of images displayed by the formatter. Since Image field is in core it’s a little bit difficult to get a patch approved… so I developed a *new* Image formatter that extends the previous one adding the “Limit images” feature.

Continue reading

Lucene with PlingStemmer

I’ve been recently working with Java Lucene and its Analyzers and for I project I worked on the client needed to use the Porter Stemmer algorithm. I used the SnowballAnalyzer, but unfortunately I found out that, as someone before me said, Porter stemmer works right on 90% of the cases, but when it fails, it fails hard! The example is the following: consider the words “organic”, “organ” and “organization” .. the three words haven’t a lot in common except of their prefix, thei do not mean the sameĀ  tihng… but for Porter (and for the Snowball Analyzer) they’re stemmed into “organ”.. in Lucene 3.1.x release there will be plenty new features allowing programmers to control and fine tune each stemming algorithm.

So, what can I do since I must use the 3.0.3 release? Well.. I created a new PlingStemmerFilter using YAGO java Pling stemmer implementation following instructions found here.

Continue reading

MySQL recovery using ibdata and ib_logfile1.. files

I recently had my server out of order and I could only access to files (thanks to providential Linux-on-usb). I manage to backup my MySQL files (ibdata, ib_logfile1, ib_logfile2, and the tables *.frm files). No sql dump to be imported into a new MySQL installation.

I remembered a good tutorial on recovering database structure and data using my backups, but I couldn’t find anymore on the web.. I went step-by-step to recover my data: simply replacing the “data” directory inside the new installation setup will give you errors about InnoDB “sequence numbers” (and MySQL will suggest you to refer to “InnoDB force recovery feature“). Continue reading