Sponsorised links
2007
Topicalizer - The tool for topic extraction, text analysis and abstract generation
Free online tool for topic extraction, text analysis and abstract generation
2006
QDC 2007
Cet atelier se concentre sur les méthodes, et les techniques d’analyse et d’évaluation de qualité au sens large, en fouille de données et en gestion des connaissances. On s’intéressera ainsi à la qualité des données (nettoyage des données, méthodologies de prétraitement, métriques d’évaluation et approches algorithmiques), à l’évaluation de la qualité des modèles en fouille de données (méthodes, algorithmes, et études sur les mesures d’intérêt, agrégation de préférences, classification des connaissances, post-traitement des résultats) et gestion des connaissances (qualité des ontologies, qualité des alignements, analyse des usages).
Présentation de Theuth et de Blue Moon
Présentation de Theuth et de Blue Moon.
Un nouveau type d'algo de parsing, dit "asyntagmatique". Sans entrer dans les détails, le fait que
le parsing soit asyntagmatique débloque tout : on peut désormais tenir compte des contextes, comprendre les
déictiques, détecter les jeux de mots et les contrepèteries, reconnaître la langue d'un texte ou traduire des
textes où plusieurs langues sont mélangées, y compris dans la même phrase.
Tame the Beast by Matching Similar Strings
I described the algorithms in two classes: equivalence methods and similarity ranking methods. Equivalence methods return a Boolean result, whereas the similarity ranking methods return a numeric similarity measure or distance metric. In information retrieval systems, it is possible to mix methods to produce a faster hybrid approach. A typical approach is to employ a two-pass mechanism in which an equivalence method is used by the database as a first pass filter, and a ranked similarity method is applied to the filtered entries for the second pass. Ranked similarity methods tend to be algorithmically more complex than equivalence methods, so are usually implemented as custom code outside of the database.
The Boyer-Moore Fast String Searching Algorithm
Our algorithm has the peculiar property that, roughly speaking, the longer the pattern is, the faster the algorithm goes. Furthermore, the algorithm is ``sublinear'' in the sense that it generally looks at fewer characters than it passes. The algorithm is described in
Knuth-Morris-Pratt Algorithm
The problem: given a (short) pattern and a (long) text, both strings, determine whether the pattern appears somewhere in the text. Last time we saw how to do this with finite automata. This time we'll go through the Knuth-Morris-Pratt (KMP) algorithm, which can be thought of as an efficient way to build these automata. I also have some working C source code which might help you understand the algorithm better.
TRE home page
TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
agrep is in this library.
Sponsorised links
2005
PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning
The objective is to build a Europe-wide Distributed Institute which will pioneer principled methods of pattern analysis, statistical modelling and computational learning as core enabling technologies for multimodal interfaces that are capable of natural and seamless interaction with and among individual human users.
At each stage in the process, machine learning has a crucial role to play. It is proving an increasingly important tool in Machine Vision, Speech, Haptics, Brain Computer Interfaces, Information Extraction and Natural Language Processing; it provides a uniform methodology for multimodal integration; it is an invaluable tool in information extraction; while on-line learning provides the techniques needed for adaptively modelling the requirements of individual users. Though machine learning has such potential to improve the quality of multimodal interfaces, significant advances are needed, in both the fundamental techniques and their tailoring to the various aspects of the applications, before this vision can become a reality.
The institute will foster interaction between groups working on fundamental analysis including statisticians and learning theorists; algorithms groups including members of the non-linear programming community; and groups in machine vision, speech, haptics, brain-computer interfaces, natural language processing, information-retrieval, textual information processing and user modelling for computer human interaction, groups that will act as bridges to the application domains and end-users.
CLEF and Multilingual information retrieval
Our stemming procedure and stopword list are enhanced continuously. However, for the French and the German language, our solutions can be viewed as definitive (last update Nov. 2002).
PorterStemmer - phpguru.org
A PHP5 only class to faciliate the stemming of word. This is the process of removing suffixs from words to produce their "stems", eg national -> nation, running -> run, adoption -> adopt. This class, as the name might suggest, implements the algorithm by Martin Porter.
2004
Natural Language Toolkit
The Natural Language Toolkit is a suite of Python packages and data for natural language processing; it comes with extensive API documentation and tutorials. NLTK-Lite is the version under active development.
Charming Python: Get started with the Natural Language Toolkit
This installment introduces you to the Natural Language Toolkit, a Python library for applying academic linguistic techniques to collections of textual data. Programming that goes by the name "text processing" is a start; other capabilities for syntactic
Helmut Schmid's Homepage
Home page of Helmut Schmid, searcher in the NLP domain, and especially in Probabilistic Grammar Models.
1
(22 marks)
