James S. Adelman, Gordon D. A. Brown, and José F. Quesada.
Contextual Diversity Not Word Frequency Determines Word
Naming and Lexical Decision Times.
Psychological Science, 17(9):814-823, 2006.
Word frequency is an important predictor of word-naming and lexical decision times. It is, however, confounded with contextual diversity, the number of contexts in which a word has been seen. In a study using a normative, corpus-based measure of contextual diversity, word-frequency effects were eliminated when effects of contextual diversity were taken into account (but not vice versa) across three naming and three lexical decision data sets; the same pattern of results was obtained regardless of which of three corpora was used to derive the frequency and contextual-diversity values. The results are incompatible with existing models of visual word recognition, which attribute frequency effects directly to frequency, and are particularly problematic for accounts in which frequency effects reflect learning. We argue that the results reflect the importance of likely need in memory processes, and that the continuity between reading and memory suggests using principles from memory research to inform theories of reading.
The Integration of Morphological and Whole-Word Form
Information During Eye Fixations on Prefixed and Suffixed words.
Journal of Memory and Language, 35(6):801-820, 1996.
Two experiments explored the extent to which morphological information is used in reading. Experiment 1 showed that the surface frequency of the whole-word form of suffixed and prefixed words does not affect first-fixation duration. Experiment 2 demonstrated that cumulative root frequency affected first-fixation duration on suffixed words and second-fixation duration on prefixed words. The results demonstrate that morphological codes are used in integrating information in word reading. More precisely, morphological information is integrated at different times during the reading process as a function of the root's position within a word. The findings suggest that lexical access operates on the basis of the root only for suffixed words. (C) 1996 Academic Press, Inc.
Finite State Methods for Hyphenation.
Natural Language Engineering, 9(1):5-20, 2003.
Hyphenation is the task of identifying potential hyphenation points inwords. In this paper, three finite-state hyphenation methods for Dutch are presented and compared in terms of accuracy and size of the resulting automata.
|||Max Coltheart, Eddy J. Davelaar, Jon Torfi Jonasson, and Derek Besner. Access to the Internal Lexicon. In Stanislav Dornic, editor, Attention and Performance, volume VI, pages 535-555. Lawrence Erlbaum Associates, Hillsdale, 1977.|
T. Brants and A. Franz.
Web 1T 5-gram, 10 European Languages, Version 1. Linguistic Data Consortium, Philadelphia, London, 2009.
GRAFON: A Grapheme-to-Phoneme Conversion System for
In Dénes Vargha, editor, Proceedings of the 12th Conference
on Computational Linguistics, volume 1, pages 133-138, Morristown, NJ,
We describe a set of modules that together make up a grapheme-to-phoneme coversion system for Dutch. Modules include a syllabification program, a fast morphological parser, a lexical database, a phonological knowledge base, transliteration rules, and phonological rules. Knowledge and procedures were implemented object-orientedly. We contrast GRAFON to recent pattern recognition and rule-compiler approaches and try to show that the first fails for languages with concatenative compounding (like Dutch, German, and Scandinavian languages) while the second lacks the flexibility to model different phonological theories. It is claimed that syllables (and not graphemes/phonemes or morphemes) should be central units in a rule-based phonemisation algorithm. Furthermore, the architecture of GRAFON and its user interface make it ideally suited as a rule-testing tool for phonologists.
|||Gerard Salton and Michael J. Mc Gill. Introduction to Modern Information Retrieval. Computer Science Series. McGraw-Hill, New York, NY, 1983.|
The DWDS Corpus: A Reference Corpus for the German
Language of the 20th Century.
In Christiane Fellbaum, editor, Collocations and Idioms:
Linguistic, Lexicographic, and Computational Aspects. Continuum
Press, London, 2006.
The DWDS corpus, constructed at the Berlin-Brandenburg Academy of Sciences (BBAW) between 2000 and 2003, consists altogether of over a billion words of running text. Corpus building continues to be an activity at BBAW. The current corpus consists of two parts: a core corpus and an extended corpus. The core corpus contains approximately 100 million running words, balanced chronologically and by text genre in approximately 80,000 documents. About 40 percent of these, i.e. approximately 160,000 pages, were digitized from printed resources by the project. The remaining texts were obtained from publishing houses or donated by contributors. The core corpus is unique in German speaking countries and constitutes, for German, a resource equivalent in quality to the British National Corpus. The extended corpus contains more than 900 million text words. It is an opportunistic corpus, consisting essentially of newspaper sources from the last 15 years. Copyright clearance has been obtained from major publishing houses, enabling DWDS users to access the works of important literary and scientific authors including Heinrich Böll, Jürgen Habermas, Victor Klemperer, Karl Kraus, Siegfried Lenz, and Thomas and Heinrich Mann. All the texts of the core corpus are lemmatized and part-of-speech tagged, and can be queried with DDC (Dialing DWDS Concordancer), a linguistic search engine, on the project's web site. It is intended to gradually add further texts of the 20th century and to extend the corpora to the 21st century and also to texts before 1900.
Alexander Geyken and Thomas Hanneforth.
TAGH: A Complete Morphology for German based on Weighted
Finite State Automata.
In Finite State Methods and Natural Language
Processing, volume 4002 of Lecture Notes in Computer Science,
Berlin, Heidelberg, 2006. Springer.
TAGH is a system for automatic recognition of German word forms. It is based on a stem lexicon with allomorphs and a concatenative mechanism for inflection and word formation. Weighted FSA and a cost function are used in order to determine the correct segmentation of complex forms: the correct segmentation for a given compound is supposed to be the one with the least cost. TAGH is based on a large stem lexicon of almost 80.000 stems that was compiled within 5 years on the basis of large newspaper corpora and literary texts. The number of analyzable word forms is increased considerably by more than 1000 different rules for derivational and compositional word formation. The recognition rate of TAGH is more than 99% for modern newspaper text and approximately 98.5% for literary texts.
A Hybrid Approach to Part-of-Speech Tagging.
Final report, Kollokationen im Wörterbuch, Berlin-Brandenburgische
Akademie der Wissenschaften, 2003.
Part-of-Speech (PoS) Tagging - the automatic annotation of lexical categories - is a widely used early stage of linguistic text analysis. One approach, rule-based morphological anaylsis, employs linguistic knowledge in the form of hand-coded rules to derive a set of possible analyses for each input token, but is known to produce highly ambiguous results. Stochastic tagging techniques such as Hidden Markov Models (HMMs) make use of both lexical and bigram probabilities estimated from a tagged training corpus in order to compute the most likely PoS tag sequence for each input sentence, but provide no allowance for prior linguistic knowledge. In this report, I describe the dwdst2 PoS tagging library, which makes use of a rule-based morphological component to extend traditional HMM techniques by the inclusion of lexical class probabilities and theoretically motivated search space reduction.
Alan Kennedy, Joël Pynte, and Stéphanie Ducrot.
Parafoveal-on-Foveal Interactions in Word Recognition.
The Quarterly Journal of Experimental Psychology Section A:
Human Experimental Psychology, 55(4):1307-1337, 2002.
An experiment is reported in which participants read sequences of five words, looking for items describing articles of clothing. The third and fourth words in critical sequences were defined as "foveal" and "parafoveal" words, respectively. The length and frequency of foveal words and the length, frequency, and initial-letter constraint of parafoveal words were manipulated. Gaze and refixation rate on the foveal word were measured as a function of properties of the parafoveal word. The results show that measured gaze on a given foveal word is systematically modulated by properties of an unfixated parafoveal word. It is suggested that apparent inconsistencies in previous studies of parafoveal-on-foveal effects relate to a failure to control for foveal word length and hence the visibility of parafoveal words. A serial-sequential attention-switching model of eye movement control cannot account for the pattern of obtained effects. The data are also incompatible with various forms of parallel-processing model. They are best accounted for by postulating a process-monitoring mechanism, sensitive to the simultaneous rate of acquisition of information from foveal and parafoveal sources.
Reinhold Kliegl, Ellen Grabner, Martin Rolfs, and Ralf Engbert.
Length, Frequency, and Predictability Effects of Words on
Eye Movements in Reading.
European Journal of Cognitive Psychology, 16(1-2):262-284,
We tested the effects of word length, frequency, and predictability on inspection durations (first fixation, single fixation, gaze duration, and reading time) and inspection probabilities during first-pass reading (skipped, once, twice) for a corpus of 144 German sentences (1138 words) and a subset of 144 target words uncorrelated in length and frequency, read by 33 young and 32 older adults. For corpus words, length and frequency were reliably related to inspection durations and probabilities, predictability only to inspection probabilities. For first-pass reading of target words all three effects were reliable for inspection durations and probabilities. Low predictability was strongly related to second-pass reading. Older adults read slower than young adults and had a higher frequency of regressive movements. The data are to serve as a benchmark for computational models of eye movement control in reading.
Reinhold Kliegl, Antje Nuthmann, and Ralf Engbert.
Tracking the Mind During Reading : The Influence of Past,
Present, and Future Words on Fixation Durations.
Journal of experimental psychology. General, 135(1):12-35,
Reading requires the orchestration of visual, attentional, language-related, and oculomotor processing constraints. This study replicates previous effects of frequency, predictability, and length of fixated words on fixation durations in natural reading and demonstrates new effects of these variables related to previous and next words. Results are based on fixation durations recorded from 222 persons, each reading 144 sentences. Such evidence for distributed processing of words across fixation durations challenges psycholinguistic immediacy-of-processing and eye-mind assumptions. Most of the time the mind processes several words in parallel at different perceptual and cognitive levels. Eye movements can help to unravel these processes.
Vladimir Iosifovich Levenshtein.
Binary Codes Capable of Correcting Deletions, Insertions,
Soviet Physics-Doklady, 10(8):845-848, 1966.
Investigations of transmissions of binary information usually consider a channel model in which failures of the type 0 -> 1 and 1 -> 0 (which we will call reversals) are admitted. In the present paper (as in ) we investigate a channel model in which it is also possible to have failures of the form 0 -> Lambda, 1 -> Lambda, which are called deletions, and failures of the form Lambda -> 0, Lambda -> 1, which are called insertions (here Lambda is the empty word). For such channels, by analogy to the combinatorical problem of constructing optimal codes capable of correcting s reversals, we will consider the problem of constructing optimal codes capable of correcting deletions, insertions, and reversals.
Susan D. Lima and Albrecht Werner Inhoff.
Lexical Access During Eye Fixations in Reading: Effects
of Word-Initial Letter Sequence.
Journal of Experimental Psychology: Human Perception and
Performance, 11(3):272-285, 1985.
Two experiments tested the hypothesis that lexical access in reading is initiated on the basis of word-initial letter information obtainable in the parafoveal region. Eye movements were monitored while college students read sentences containing target words whose initial trigram (Experiment 1) or bigram (Experiment 2) imposed either a high or a low degree of constraint in the lexicon. In contradiction to our hypothesis, high-constraint words (e.g., DWARF) received longer fixations than did low-constraint words (e.g., CLOWN), despite the fact that high-constraint words have an initial letter sequence shared by few other words in the lexicon. Moreover, a comparison of fixation times in viewing conditions with and without parafoveal letter information showed that the amount of decrease in target fixation time due to prior parafoveal availability was the same for high-constraint and low-constraint targets. We concluded that increased familiarity of word-initial letter sequence is beneficial to lexical access and that familiarity affects the efficiency of foveal but not parafoveal processing.
Wayne S. Murray and Kenneth I. Forster.
Serial Mechanisms in Lexical Access: The Rank Hypothesis.
Psychological Review, 111(3):721-756, 2004.
There is general agreement that the effect of frequency on lexical access time is roughly logarithmic, although little attention has been given to the reason for this. The authors argue that models of lexical access that incorporate a frequency-ordered serial comparison or verification procedure provide an account of this effect and predict that the underlying function directly relates access time to the rank order of words in a frequency-ordered set. For both group data and individual data, it is shown that rank provides a better fit to the data than does a function based on log frequency. Extensions to a search model are proposed that account for error rates and latencies and the effect of age of acquisition, which is interpreted as an effect of cumulative frequency.
Steven T. Piantadosi, Harry Tily, and Edward Gibson.
Word lengths are optimized for efficient communication.
Proceedings of the National Academy of Sciences, 108(9):3526-3529, 2011.
We demonstrate a substantial improvement on one of the most celebrated empirical laws in the study of language, Zipf's 75-y-old theory that word length is primarily determined by frequency of use. In accord with rational theories of communication, we show across 10 languages that average information content is a much better predictor of word length than frequency. This indicates that human lexicons are efficiently structured for communication by taking into account interword statistical dependencies. Lexical systems result from an optimization of communicative pressures, coding meanings efficiently given the complex statistics of natural language use.
|||Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical report, Institut für maschinelle Sprachverarbeitung, Stuttgart, 1999.|
Regex - POSIX.2 Regular Expressions.
Linux Programmer's Manual, 2007.
[ .html ]
Regular expressions ("RE"s), as defined in POSIX.2, come in two forms: modern REs (roughly those of egrep; POSIX.2 calls these "extended" REs) and obsolete REs (roughly those of ed(1); POSIX.2 "basic" REs). Obsolete REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. POSIX.2 leaves some aspects of RE syntax and semantics open; "(!)" marks decisions on these aspects that may not be fully portable to other POSIX.2 implementations.
Tal Yarkoni, David Balota, and Melvin Yap.
Moving Beyond Coltheart's n: A New Measure of
Psychonomic Bulletin & Review, 15(5):971-979, 2008.
Visual word recognition studies commonly measure the orthographic similarity of words using Coltheart's orthographic neighborhood size metric (ON). Although ON reliably predicts behavioral variability in many lexical tasks, its utility is inherently limited by its relatively restrictive definition. In the present article, we introduce a new measure of orthographic similarity generated using a standard computer science metric of string similarity (Levenshtein distance). Unlike ON, the new measure-named orthographic Levenshtein distance 20 (OLD20)-incorporates comparisons between all pairs of words in the lexicon, including words of different lengths. We demonstrate that OLD20 provides significant advantages over ON in predicting both lexical decision and pronunciation performance in three large data sets. Moreover, OLD20 interacts more strongly with word frequency and shows stronger effects of neighborhood frequency than does ON. The discussion section focuses on the implications of these results for models of visual word recognition.
The Psychobiology of Language. Routledge, London, 1936.