The industry seems to be conservative and it is using more traditional framework, WEKA, OpenNLP, and etc.) Although more and more developers are using nltk, it is widely know that it was designed for teaching natural language processing and one can think it is not suitable for ‘real-world’ tasks. Your NLTK Demos project is a good refutation of this view, isn it? Do you think that this situation can change? How
As long as big companies continue to standardize on Java, I don't see the situation changing much. But it seems like Python is slowly but surely making its way into more and more companies, as more developers realize its benefits. I personally love Python for its simplicity, clarity, "Guido's Time Machine", and its numerous useful 3rd party libraries, of which NLTK is just one. And if you don't know Python, NLTK is a great reason to learn it.
I'm hoping my demos at http://text-processing.com/demo/ will help convince people of NLTK's usefulness, by showing off what it's capable of. I think there's been a few developers already that have been convinced by the demos to give NLTK a try. And the API has been getting well over 2000 calls/day lately, and I've been approached about using it for commercial purposes, so there's definitely a need for NLP, no matter what programming language you're using. There's so much text being created these days, that people are looking for anything that will help them organize, aggregate, summarize, etc.
But I think the real problem isn't Java vs Python/NLTK - it's that most developers want to treat NLP as a magic black box, and don't want to learn how it's actually done. I think this explains the popularity of the desktop NLP programs, since they make it easier to do NLP without a deeper understanding. And I understand this motivation, because that was me a few years ago. But if you're going to use NLP as part of your core data processing, you need to take the time to dive in and understand what's going on, how it works, and how to use NLP effectively on your own data. I think the future of NLP is not generic tools & APIs, it'll be training internal models on custom data sets, as that's the best way to get high accuracy results.
Text Processing with nltk is very similar to your posts. But how was
Your blog also deals with Erlang, and your company’s blog mentions it too. Why are you dealing with a functional language? Why are functional languages becoming popular in natural language
All that being said, I'm not sure how well functional programming maps to natural language processing. Much of NLP these days involves training, which means constructing some kind of storable & reusable model. That kind of internal state handling lends itself very well to object oriented programming, which is used throughout NLTK. I think FP is more suited to *using* NLP, not *doing* NLP. Once you have your trained model, then you can use a functional programming style to transform your input text into output structures.
What are you working on at Weotta? For what kind of problems are you
At Weotta, we've created a "plan generator" that produces itineraries of things to do in a city, such as "Dinner, Drinks & a Show". It's a kind of multi search engine with both online & offline pieces. The online parts are powered by a custom rule based system, MongoDB & Redis, and some sentiment analysis (using text classification) to ensure quality. But it's the offline code where the real NLP happens, primarily using keyword & text classification. We classify every place & show in our system with various metrics in order to determine what it's appropriate for, and what other kinds of places/shows it can be paired with. Nearly all of our text comes from descriptions and reviews we've scraped off the web, so we also have to do a lot cleaning, parsing, and general data scrubbing to normalize the text. In a sense, we're using NLP for quality control, to ensure we produce good results by eliminating all the bad ones. There's no such thing as PageRank when it comes to local data, and sometimes all you have to go on are a few keywords. When that's the case, you need to pull out your natural language toolkit and get to work :)