2012. szeptember 17.

NLP Meetup - we have a new confirmed speaker

Our new confirmed speaker is András Benczúr the head of the Data Mining and Search Group of SZTAKI (the Computer Automation Research Institute of the Hungarian Academy of Sciences).

The LAWA Project: Towards a Virtual Web Observatory

The LAWA project on Longitudinal Analytics of Web Archive data builds an Internet-based experimental testbed for large-scale data analytics. Its focus is on developing a sustainable infra-structure, scalable methods, and easily usable software tools for aggregating, querying, and analyzing heterogeneous data at Internet scale for a deep understanding of Internet content characteristics (size, distribution, form, structure, evolution, dynamic).

I will show how far this (overly) ambitious project led us, what are the main achievements and blockers that we have identified. Some of the first (but really preliminary) demos are already up, http://vwo.lawa-project.eu:8080/. Some limitations of current systems for distributed data analysis, especially of Hadoop, are in part resolved. However archival institutions still lack an easy-to-deploy, high quality and stable Web scale search solution and now we are trying to gather forces in collaboration with the Stratosphere project (www.stratosphere.eu) and also for scaling the SZTAKI plagiarism detection service (www.kopi.sztaki.hu) over the BonFIRE experimental cloud.

Short bio
Andras Benczur received his Ph.D. at the Massachusetts Institute of Technology in applied mathematics in 1997. Since then he is researcher at the Institute for Computer Science and Control of the Hungarian Academy of Sciences (MTA SZTAKI) where he heads the Informatics Laboratory of 30 researchers since 2008. The lab participates in international research and national industry projects in information retrieval and business intelligence. Among others his research on Web information retrieval was honored by a Yahoo! Faculty Research Grant, he lead the KDD Cup 2007 winner team and organized the ECML/PKDD 2010 Discovery Challenge on Web Quality.

