2012. április 2.

Book review: Natural Language Annotation for Machine Learing

The book's title is misleading. Its subtitle - A Guide to Corpus Building for Applications - is more descriptive. I believe that not only machine learners, but linguists (esp. corpus and computational linguists), practitioners of the digital humanities and others who are using and/or collecting linguistic data can deepen their knowledge with the help of this terrific book.

Although O'Reilly will publish the book in Sept. 2012, it is already available as an Early Release in electronic format. Keep in mind that this is a "work in progress" version when you come across sentences starting with lower case letters and references to Chapter??? and Appendix???. Also, you will find references to chapters that are not included in the book yet. These are parts of an early release and their number doesn't distract the reading experience.

It is hard to define the target group of this title. Of course you can read it without any previous knowledge of linguistics and/or natural language processing according to the preface, but I think when you read such things in a book from a publisher of technical books, you can assume that the authors' hands were led by someone in the marketing department. You shouldn't be a linguist or an nlp guru to understand the content, but you need to have some background in the field. Previous exposure to NLTK (and the NLTK book), some basic knowledge of corpus linguistics (e.g. Corpus Linguistics by McEnry and Wilson, Corpus Linguistics by McEnry and Hardie, or Gries brilliant Quantitative Corpus Linguistics with R) is essential to understand the role of corpora in applied and academic research.

The first chapter ("The Basics") gives a detailed review of what is corpus linguistics and what is a corpus and its relation to machine learning tasks. But if you want to get a broader overview of the theory and the historical aspects of corpus linguistics, I recommend the first chapter of McEnry and Wilson. However Leech's name was mentioned in this chapter, I miss mentioning his seven maxims of annotation (again McEnry-Wilson help you out in this question). Also, we got a brief summary of the MATTER methodology, which is the main topic of the book. MATTER stands for Model, Annotate, Train, Test, Evaluate, Revise - the steps of corpus development cycle. This high level intro puts the method into context which helps to understand the following chapters - and I think it can serve as an "executive summary" too. I loved the brief section on relevance testing (precision, recall, F-measure) as these are vitally important in real world applications.

The second chapter (Defining Your Goal and Dataset) is about the 'M' in the MATTER cycle. It gives practical advices for defining the statement of purpose and expanding it to see how you can reach your goals. I like the pragmatic tone of the chapter. Sure, you have a great idea, but you have to consider the task, the available resources and you have to collect some data - so think it over and define why do you collect data, what kind of data you want to collect and how do you process the data. This process involves lot of thinking and weighting possibilities, and the book helps with going through these steps.

Chapter three (Building your Model and Specification) stays at the 'M', but it gets more realistic. It is about the formal definition of models and how to implement them (in XML). The topic - XML and various standards - seems to be boring but it is a great job and it is very refreshing to see the fragmented pieces information being complied into a compact yet enjoyable chapter (ok, maybe only linguists think this is not boring).

The fourth chapter (Applying and Adopting Annotation Standards to Your Model) gives hints about bending standards and resources to your needs. It considers technological considerations along with human factors (aka annotators), and shows best practices serves both sides well.

I do hope more chapters will be available soon. The practical focus and the vivid real world examples (e.g. named entity recognition, semantic role labeling, etc.) makes the book very accessible for a wider audience. It contains valuable information that was almost unaccessible and it took long time to collect the knowledge necessary to build corpora before. I think this title will be a great success in just like the Semantic Web for the Working Ontologist in the semantic web and enterprise ontologist community.

Nincsenek megjegyzések: