Tag Archives: data-extraction

Easychair data extraction

Have you ever heard of EasyChair? It is a free, simple and efficient way for managing a (scientific) conference: provides you all most of the tools for handling paper submitting, approving and camera-ready submitting (for further details please refer to EasyChair website).

The first issue is that some advanced features (as the complete data access as XML) is only available as a paid service. What if the data you need is already available but only as an (hugly) HTML file? I needed the whole list of accepted papers and the only option was an HTML page, formatted by DIVs and not, as some accessibility rules suggests, as a Table. First solution: copy-n-paste from HTML to a spreadsheet. More advanced: provide a script for converting such file to a “well written” HTML. In the generated file the list of papers are in a HTML table, no stylesheets are applied and all the links to authors webpages are removed.

Here we go: a simple set of sed rules to convert the list of accepted papers to a table based page.
Put all of these in a .sed file and invoke the sed commad as:

#sed -f file.sed < accepted-papers.html > accepted-papers-converted.html

The file.sed contents:

s/<br\/>/ /g
s/<style>.*<\/style>//g
s/<\/h1>/<\/h1><table>/g
s/<\/body>/<\/table><\/body>/g
s/<b>Abstract: <\/b>//g
s/<\/div><div class="paper">/<\/tr><tr><td>/g
s/<div class="paper">/<tr><td>/g
s/<span class="authors"><span>//g
s/<\/span>\. <\/span>/<\/td>/g
s/<span class="authors">//g
s/\. <\/span>/<\/td>/g
s/<span class="title">/<td>/g
s/<\/div><div class="abstract">/<\/td><td>/g
s/<a href="[^"]*">\([^<]*\)<\/a>/\1/gg
Share/Bookmark