Neural Network Methods for Natural Language Processing Synthesis Lectures on Human Language Technologies Editor Graeme Hirst, University of Toronto Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University of Toronto. e series consists of 50- to 150-page monographs on topics relating to natural language processing, computational linguistics, information retrieval, and spoken language understanding. Emphasis is on important new techniques, on new applications, and on topics that combine two or more HLT subfields. Neural Network Methods for Natural Language Processing Yoav Goldberg 2017 Syntax-based Statistical Machine Translation Philip Williams, Rico Sennrich, Matt Post, and Philipp Koehn 2016 Domain-Sensitive Temporal Tagging Jannik Strötgen and Michael Gertz 2016 Linked Lexical Knowledge Bases: Foundations and Applications Iryna Gurevych, Judith Eckle-Kohler, and Michael Matuschek 2016 Bayesian Analysis in Natural Language Processing Shay Cohen 2016 Metaphor: A Computational Perspective Tony Veale, Ekaterina Shutova, and Beata Beigman Klebanov 2016 Grammatical Inference for Computational Linguistics Jeffrey Heinz, Colin de la Higuera, and Menno van Zaanen 2015 iii Automatic Detection of Verbal Deception Eileen Fitzpatrick, Joan Bachenko, and Tommaso Fornaciari 2015 Natural Language Processing for Social Media Atefeh Farzindar and Diana Inkpen 2015 Semantic Similarity from Natural Language and Ontology Analysis Sébastien Harispe, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain 2015 Learning to Rank for Information Retrieval and Natural Language Processing, Second Edition Hang Li 2014 Ontology-Based Interpretation of Natural Language Philipp Cimiano, Christina Unger, and John McCrae 2014 Automated Grammatical Error Detection for Language Learners, Second Edition Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault 2014 Web Corpus Construction Roland Schäfer and Felix Bildhauer 2013 Recognizing Textual Entailment: Models and Applications Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto 2013 Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax Emily M. Bender 2013 Semi-Supervised Learning and Domain Adaptation in Natural Language Processing Anders Søgaard 2013 Semantic Relations Between Nominals Vivi Nastase, Preslav Nakov, Diarmuid Ó Séaghdha, and Stan Szpakowicz 2013 iv Computational Modeling of Narrative Inderjeet Mani 2012 Natural Language Processing for Historical Texts Michael Piotrowski 2012 Sentiment Analysis and Opinion Mining Bing Liu 2012 Discourse Processing Manfred Stede 2011 Bitext Alignment Jörg Tiedemann 2011 Linguistic Structure Prediction Noah A. Smith 2011 Learning to Rank for Information Retrieval and Natural Language Processing Hang Li 2011 Computational Modeling of Human Language Acquisition Afra Alishahi 2010 Introduction to Arabic Natural Language Processing Nizar Y. Habash 2010 Cross-Language Information Retrieval Jian-Yun Nie 2010 Automated Grammatical Error Detection for Language Learners Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault 2010 Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer 2010 v Semantic Role Labeling Martha Palmer, Daniel Gildea, and Nianwen Xue 2010 Spoken Dialogue Systems Kristiina Jokinen and Michael McTear 2009 Introduction to Chinese Natural Language Processing Kam-Fai Wong, Wenjie Li, Ruifeng Xu, and Zheng-sheng Zhang 2009 Introduction to Linguistic Annotation and Text Analytics Graham Wilcock 2009 Dependency Parsing Sandra Kübler, Ryan McDonald, and Joakim Nivre 2009 Statistical Language Models for Information Retrieval ChengXiang Zhai 2008 © Springer Nature Switzerland AG 2022 Reprint of original edition © Morgan & Claypool 2017 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Neural Network Methods for Natural Language Processing Yoav Goldberg ISBN: 978-3-031-01037-8 ISBN: 978-3-031-02165-7 paperback ebook DOI 10.1007/978-3-031-02165-7 A Publication in the Springer series SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES Lecture #37 Series Editor: Graeme Hirst, University of Toronto Series ISSN Print 1947-4040 Electronic 1947-4059 Neural Network Methods for Natural Language Processing Yoav Goldberg Bar Ilan University SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #37 ABSTRACT Neural networks are a family of powerful machine learning models. is book focuses on the application of neural network models to natural language data. e first half of the book (Parts I and II) covers the basics of supervised machine learning and feed-forward neural networks, the basics of working with machine learning over language data, and the use of vector-based rather than symbolic representations for words. It also covers the computation-graph abstraction, which allows to easily define and train arbitrary neural networks, and is the basis behind the design of contemporary neural network software libraries. e second part of the book (Parts III and IV) introduces more specialized neural network architectures, including 1D convolutional neural networks, recurrent neural networks, conditioned-generation models, and attention-based models. ese architectures and techniques are the driving force behind state-of-the-art algorithms for machine translation, syntactic parsing, and many other applications. Finally, we also discuss tree-shaped networks, structured prediction, and the prospects of multi-task learning. KEYWORDS natural language processing, machine learning, supervised learning, deep learning, neural networks, word embeddings, recurrent neural networks, sequence to sequence models ix Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 e Challenges of Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Neural Networks and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Deep Learning in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Success Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Coverage and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 What’s not Covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 A Note on Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.7 Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 PART I Supervised Classification and Feed-forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Learning Basics and Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1 Supervised Learning and Parameterized Functions . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Train, Test, and Validation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Log-linear Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Multi-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 One-Hot and Dense Vector Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Log-linear Multi-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 Training as Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 x 2.8 Gradient-based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8.2 Worked-out Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8.3 Beyond SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3 From Linear Models to Multi-layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Limitations of Linear Models: e XOR Problem . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Nonlinear Input Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Trainable Mapping Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Feed-forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 A Brain-inspired Metaphor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 In Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Representation Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Common Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.6 Regularization and Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.7 Similarity and Distance Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.8 Embedding Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 e Computation Graph Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.1 Forward Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.2 Backward Computation (Derivatives, Backprop) . . . . . . . . . . . . . . . . . . . 54 5.1.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.4 Implementation Recipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1.5 Network Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Practicalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Choice of Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.3 Restarts and Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.4 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.5 Saturation and Dead Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.6 Shuffling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.7 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.8 Minibatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 xi PART II Working with Natural Language Data . . . . . . . . 63 6 Features for Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 Typology of NLP Classification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2 Features for NLP Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2.1 Directly Observable Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2.2 Inferred Linguistic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.3 Core Features vs. Combination Features . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2.4 Ngram Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.5 Distributional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7 Case Studies of NLP Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1 Document Classification: Language Identification . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Document Classification: Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3 Document Classification: Authorship Attribution . . . . . . . . . . . . . . . . . . . . . . . 78 7.4 Word-in-context: Part of Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.5 Word-in-context: Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.6 Word in Context, Linguistic Features: Preposition Sense Disambiguation . . . . 82 7.7 Relation Between Words in Context: Arc-Factored Parsing . . . . . . . . . . . . . . . . 85 8 From Textual Features to Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1 Encoding Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1.1 One-hot Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1.2 Dense Encodings (Feature Embeddings) . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.1.3 Dense Vectors vs. One-hot Representations . . . . . . . . . . . . . . . . . . . . . . 90 8.2 Combining Dense Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.2.1 Window-based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.2.2 Variable Number of Features: Continuous Bag of Words . . . . . . . . . . . . 93 8.3 Relation Between One-hot and Dense Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4 Odds and Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4.1 Distance and Position Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4.2 Padding, Unknown Words, and Word Dropout . . . . . . . . . . . . . . . . . . . 96 8.4.3 Feature Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.4.4 Vector Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.4.5 Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4.6 Embeddings Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4.7 Network’s Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 xii 8.5 Example: Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.6 Example: Arc-factored Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 9 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.1 e Language Modeling Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.2 Evaluating Language Models: Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 9.3 Traditional Approaches to Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . 107 9.3.1 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.3.2 Limitations of Traditional Language Models . . . . . . . . . . . . . . . . . . . . . 108 9.4 Neural Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 9.5 Using Language Models for Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.6 Byproduct: Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 10 Pre-trained Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.1 Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.2 Supervised Task-specific Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.3 Unsupervised Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.3.1 Using Pre-trained Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.4 Word Embedding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.4.1 Distributional Hypothesis and Word Representations . . . . . . . . . . . . . . 118 10.4.2 From Neural Language Models to Distributed Representations . . . . . . 122 10.4.3 Connecting the Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.4.4 Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.5 e Choice of Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.5.1 Window Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.5.2 Sentences, Paragraphs, or Documents . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.5.3 Syntactic Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.5.4 Multilingual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 10.5.5 Character-based and Sub-word Representations . . . . . . . . . . . . . . . . . . 131 10.6 Dealing with Multi-word Units and Word Inflections . . . . . . . . . . . . . . . . . . . 132 10.7 Limitations of Distributional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11 Using Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.1 Obtaining Word Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.2 Word Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.3 Word Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 xiii 11.4 Finding Similar Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 11.4.1 Similarity to a Group of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.5 Odd-one Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.6 Short Document Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.7 Word Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 11.8 Retrofitting and Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 11.9 Practicalities and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 12 Case Study: A Feed-forward Architecture for Sentence Meaning Inference . . 141 12.1 Natural Language Inference and the SNLI Dataset . . . . . . . . . . . . . . . . . . . . . 141 12.2 A Textual Similarity Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 PART III Specialized Architectures . . . . . . . . . . . . . . . . . 147 13 Ngram Detectors: Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 151 13.1 Basic Convolution + Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 13.1.1 1D Convolutions Over Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 13.1.2 Vector Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 13.1.3 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 13.2 Alternative: Feature Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 13.3 Hierarchical Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 14 Recurrent Neural Networks: Modeling Sequences and Stacks . . . . . . . . . . . . . 163 14.1 e RNN Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 14.2 RNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 14.3 Common RNN Usage-patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.1 Acceptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 14.3.3 Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 14.4 Bidirectional RNNs (biRNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 14.5 Multi-layer (stacked) RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 14.6 RNNs for Representing Stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 14.7 A Note on Reading the Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 xiv 15 Concrete Recurrent Neural Network Architectures . . . . . . . . . . . . . . . . . . . . . . 177 15.1 CBOW as an RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 15.2 Simple RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 15.3 Gated Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 15.3.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 15.3.2 GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 15.4 Other Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 15.5 Dropout in RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 16 Modeling with Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.1 Acceptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.1.1 Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.1.2 Subject-verb Agreement Grammaticality Detection . . . . . . . . . . . . . . . 187 16.2 RNNs as Feature Extractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 16.2.1 Part-of-speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 16.2.2 RNN–CNN Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 191 16.2.3 Arc-factored Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 17 Conditioned Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 17.1 RNN Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 17.1.1 Training Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 17.2 Conditioned Generation (Encoder-Decoder) . . . . . . . . . . . . . . . . . . . . . . . . . . 196 17.2.1 Sequence to Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 17.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 17.2.3 Other Conditioning Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 17.3 Unsupervised Sentence Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 17.4 Conditioned Generation with Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 17.4.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 17.4.2 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 17.5 Attention-based Models in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 17.5.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 17.5.2 Morphological Inflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 17.5.3 Syntactic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 xv PART IV Additional Topics . . . . . . . . . . . . . . . . . . . . . . . 213 18 Modeling Trees with Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 215 18.1 Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 18.2 Extensions and Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 18.3 Training Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 18.4 A Simple Alternative–Linearized Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 18.5 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 19 Structured Output Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 19.1 Search-based Structured Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 19.1.1 Structured Prediction with Linear Models . . . . . . . . . . . . . . . . . . . . . . . 221 19.1.2 Nonlinear Structured Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 19.1.3 Probabilistic Objective (CRF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 19.1.4 Approximate Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 19.1.5 Reranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 19.1.6 See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 19.2 Greedy Structured Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 19.3 Conditional Generation as Structured Output Prediction . . . . . . . . . . . . . . . . 227 19.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 19.4.1 Search-based Structured Prediction: First-order Dependency Parsing . 228 19.4.2 Neural-CRF for Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . 229 19.4.3 Approximate NER-CRF With Beam-Search . . . . . . . . . . . . . . . . . . . . 232 20 Cascaded, Multi-task and Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . 235 20.1 Model Cascading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 20.2 Multi-task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 20.2.1 Training in a Multi-task Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 20.2.2 Selective Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 20.2.3 Word-embeddings Pre-training as Multi-task Learning . . . . . . . . . . . . 243 20.2.4 Multi-task Learning in Conditioned Generation . . . . . . . . . . . . . . . . . 243 20.2.5 Multi-task Learning as Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 243 20.2.6 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 20.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 20.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 20.4.1 Gaze-prediction and Sentence Compression . . . . . . . . . . . . . . . . . . . . . 245 20.4.2 Arc Labeling and Syntactic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 xvi 20.4.3 Preposition Sense Disambiguation and Preposition Translation Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 20.4.4 Conditioned Generation: Multilingual Machine Translation, Parsing, and Image Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 20.5 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 21 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 21.1 What Have We Seen? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 21.2 e Challenges Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Author’s Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 xvii Preface Natural language processing (NLP) is a collective term referring to automatic computational processing of human languages. is includes both algorithms that take human-produced text as input, and algorithms that produce natural looking text as outputs. e need for such algorithms is ever increasing: human produce ever increasing amounts of text each year, and expect computer interfaces to communicate with them in their own language. Natural language processing is also very challenging, as human language is inherently ambiguous, ever changing, and not well defined. Natural language is symbolic in nature, and the first attempts at processing language were symbolic: based on logic, rules, and ontologies. However, natural language is also highly ambiguous and highly variable, calling for a more statistical algorithmic approach. Indeed, the currentday dominant approaches to language processing are all based on statistical machine learning. For over a decade, core NLP techniques were dominated by linear modeling approaches to supervised learning, centered around algorithms such as Perceptrons, linear Support Vector Machines, and Logistic Regression, trained over very high dimensional yet very sparse feature vectors. Around 2014, the field has started to see some success in switching from such linear models over sparse inputs to nonlinear neural network models over dense inputs. Some of the neuralnetwork techniques are simple generalizations of the linear models and can be used as almost drop-in replacements for the linear classifiers. Others are more advanced, require a change of mindset, and provide new modeling opportunities. In particular, a family of approaches based on recurrent neural networks (RNNs) alleviates the reliance on the Markov Assumption that was prevalent in sequence models, allowing to condition on arbitrarily long sequences and produce effective feature extractors. ese advances led to breakthroughs in language modeling, automatic machine translation, and various other applications. While powerful, the neural network methods exhibit a rather strong barrier of entry, for various reasons. In this book, I attempt to provide NLP practitioners as well as newcomers with the basic background, jargon, tools, and methodologies that will allow them to understand the principles behind neural network models for language, and apply them in their own work. I also hope to provide machine learning and neural network practitioners with the background, jargon, tools, and mindset that will allow them to effectively work with language data. Finally, I hope this book can also serve a relatively gentle (if somewhat incomplete) introduction to both NLP and machine learning for people who are newcomers to both fields. xviii PREFACE INTENDED READERSHIP is book is aimed at readers with a technical background in computer science or a related field, who want to get up to speed with neural network techniques for natural language processing. While the primary audience of the book is graduate students in language processing and machine learning, I made an effort to make it useful also to established researchers in either NLP or machine learning (by including some advanced material), and to people without prior exposure to either machine learning or NLP (by covering the basics from the grounds up). is last group of people will, obviously, need to work harder. While the book is self contained, I do assume knowledge of mathematics, in particular undergraduate level of probability, algebra, and calculus, as well as basic knowledge of algorithms and data structures. Prior exposure to machine learning is very helpful, but not required. is book evolved out of a survey paper [Goldberg, 2016], which was greatly expanded and somewhat re-organized to provide a more comprehensive exposition, and more in-depth coverage of some topics that were left out of the survey for various reasons. is book also contains many more concrete examples of applications of neural networks to language data that do not exist in the survey. While this book is intended to be useful also for people without NLP or machine learning backgrounds, the survey paper assumes knowledge in the field. Indeed, readers who are familiar with natural language processing as practiced between roughly 2006 and 2014, with heavy reliance on machine learning and linear models, may find the journal version quicker to read and better organized for their needs. However, such readers may also appreciate reading the chapters on word embeddings (10 and 11), the chapter on conditioned generation with RNNs (17), and the chapters on structured prediction and multi-task learning (MTL) (19 and 20). FOCUS OF THIS BOOK is book is intended to be self-contained, while presenting the different approaches under a unified notation and framework. However, the main purpose of the book is in introducing the neuralnetworks (deep-learning) machinery and its application to language data, and not in providing an in-depth coverage of the basics of machine learning theory and natural language technology. I refer the reader to external sources when these are needed. Likewise, the book is not intended as a comprehensive resource for those who will go on and develop the next advances in neural network machinery (although it may serve as a good entry point). Rather, it is aimed at those readers who are interested in taking the existing, useful technology and applying it in useful and creative ways to their favorite language-processing problems. PREFACE xix Further reading For in-depth, general discussion of neural networks, the theory behind them, advanced optimization methods, and other advanced topics, the reader is referred to other existing resources. In particular, the book by Bengio et al. [2016] is highly recommended. For a friendly yet rigorous introduction to practical machine learning, the freely available book of Daumé III [2015] is highly recommended. For more theoretical treatment of machine learning, see the freely available textbook of Shalev-Shwartz and Ben-David [2014] and the textbook of Mohri et al. [2012]. For a strong introduction to NLP, see the book of Jurafsky and Martin [2008]. e information retrieval book by Manning et al. [2008] also contains relevant information for working with language data. Finally, for getting up-to-speed with linguistic background, the book of Bender [2013] in this series provides a concise but comprehensive coverage, directed at computationally minded readers. e first chapters of the introductory grammar book by Sag et al. [2003] are . also worth reading. As of this writing, the progress of research in neural networks and Deep Learning is very fast paced. e state-of-the-art is a moving target, and I cannot hope to stay up-to-date with the latest-and-greatest. e focus is thus with covering the more established and robust techniques, that were proven to work well in several occasions, as well as selected techniques that are not yet fully functional but that I find to be established and/or promising enough for inclusion. Yoav Goldberg March 2017 xxi Acknowledgments is book grew out of a survey paper I’ve written on the topic [Goldberg, 2016], which in turn grew out of my frustration with the lack organized and clear material on the intersection of deep learning and natural language processing, as I was trying to learn it and teach it to my students and collaborators. I am thus indebted to the numerous people who commented on the survey paper (in its various forms, from initial drafts to post-publication comments), as well as to the people who commented on various stages of the book’s draft. Some commented in person, some over email, and some in random conversations on Twitter. e book was also influenced by people who did not comment on it per-se (indeed, some never read it) but discussed topics related to it. Some are deep learning experts, some are NLP experts, some are both, and others were trying to learn both topics. Some (few) contributed through very detailed comments, others by discussing small details, others in between. But each of them influenced the final form of the book. ey are, in alphabetical order: Yoav Artzi, Yonatan Aumann, Jason Baldridge, Miguel Ballesteros, Mohit Bansal, Marco Baroni, Tal Baumel, Sam Bowman, Jordan Boyd-Graber, Chris Brockett, MingWei Chang, David Chiang, Kyunghyun Cho, Grzegorz Chrupala, Alexander Clark, Raphael Cohen, Ryan Cotterell, Hal Daumé III, Nicholas Dronen, Chris Dyer, Jacob Eisenstein, Jason Eisner, Michael Elhadad, Yad Faeq, Manaal Faruqui, Amir Globerson, Fréderic Godin, Edward Grefenstette, Matthew Honnibal, Dirk Hovy, Moshe Koppel, Angeliki Lazaridou, Tal Linzen, ang Luong, Chris Manning, Stephen Merity, Paul Michel, Margaret Mitchell, Piero Molino, Graham Neubig, Joakim Nivre, Brendan O’Connor, Nikos Pappas, Fernando Pereira, Barbara Plank, Ana-Maria Popescu, Delip Rao, Tim Rocktäschel, Dan Roth, Alexander Rush, Naomi Saphra, Djamé Seddah, Erel Segal-Halevi, Avi Shmidman, Shaltiel Shmidman, Noah Smith, Anders Søgaard, Abe Stanway, Emma Strubell, Sandeep Subramanian, Liling Tan, Reut Tsarfaty, Peter Turney, Tim Vieira, Oriol Vinyals, Andreas Vlachos, Wenpeng Yin, and Torsten Zesch. e list excludes, of course, the very many researchers I’ve communicated with through their academic writings on the topic. e book also benefited a lot from—and was shaped by—my interaction with the Natural Language Processing Group at Bar-Ilan University (and its soft extensions): Yossi Adi, Roee Aharoni, Oded Avraham, Ido Dagan, Jessica Ficler, Jacob Goldberger, Hila Gonen, Joseph Keshet, Eliyahu Kiperwasser, Ron Konigsberg, Omer Levy, Oren Melamud, Gabriel Stanovsky, Ori Shapira, Micah Shlain, Vered Shwartz, Hillel Taub-Tabib, and Rachel Wities. Most of them belong in both lists, but I tried to keep things short. e anonymous reviewers of the book and the survey paper—while unnamed (and sometimes annoying)—provided a solid set of comments, suggestions, and corrections, which I can safely say dramatically improved many aspects of the final product. anks, whoever you are! xxii ACKNOWLEDGMENTS And thanks also to Graeme Hirst, Michael Morgan, Samantha Draper, and C.L. Tondo for orchestrating the effort. As usual, all mistakes are of course my own. Do let me know if you find any, though, and be listed in the next edition if one is ever made. Finally, I would like to thank my wife, Noa, who was patient and supportive when I disappeared into writing sprees, my parents Esther and Avner and brother Nadav who were in many cases more excited about the idea of me writing a book than I was, and the staff at e Streets Cafe (King George branch) and Shne’or Cafe who kept me well fed and served me drinks throughout the writing process, with only very minimal distractions. Yoav Goldberg March 2017 1 CHAPTER 1 Introduction 1.1 THE CHALLENGES OF NATURAL LANGUAGE PROCESSING Natural language processing (NLP) is the field of designing methods and algorithms that take as input or produce as output unstructured, natural language data. Human language is highly ambiguous (consider the sentence I ate pizza with friends, and compare it to I ate pizza with olives), and also highly variable (the core message of I ate pizza with friends can also be expressed as friends and I shared some pizza). It is also ever changing and evolving. People are great at producing language and understanding language, and are capable of expressing, perceiving, and interpreting very elaborate and nuanced meanings. At the same time, while we humans are great users of language, we are also very poor at formally understanding and describing the rules that govern language. Understanding and producing language using computers is thus highly challenging. Indeed, the best known set of methods for dealing with language data are using supervised machine learning algorithms, that attempt to infer usage patterns and regularities from a set of pre-annotated input and output pairs. Consider, for example, the task of classifying a document into one of four categories: S, P, G, and E. Obviously, the words in the documents provide very strong hints, but which words provide what hints? Writing up rules for this task is rather challenging. However, readers can easily categorize a document into its topic, and then, based on a few hundred human-categorized examples in each category, let a supervised machine learning algorithm come up with the patterns of word usage that help categorize the documents. Machine learning methods excel at problem domains where a good set of rules is very hard to define but annotating the expected output for a given input is relatively simple. Besides the challenges of dealing with ambiguous and variable inputs in a system with illdefined and unspecified set of rules, natural language exhibits an additional set of properties that make it even more challenging for computational approaches, including machine learning: it is discrete, compositional, and sparse. Language is symbolic and discrete. e basic elements of written language are characters. Characters form words that in turn denote objects, concepts, events, actions, and ideas. Both characters and words are discrete symbols: words such as “hamburger” or “pizza” each evoke in us a certain mental representations, but they are also distinct symbols, whose meaning is external to them and left to be interpreted in our heads. ere is no inherent relation between “hamburger” and “pizza” that can be inferred from the symbols themselves, or from the individual letters they 2 1. INTRODUCTION are made of. Compare that to concepts such as color, prevalent in machine vision, or acoustic signals: these concepts are continuous, allowing, for example, to move from a colorful image to a gray-scale one using a simple mathematical operation, or to compare two different colors based on inherent properties such as hue and intensity. is cannot be easily done with words—there is no simple operation that will allow us to move from the word “red” to the word “pink” without using a large lookup table or a dictionary. Language is also compositional: letters form words, and words form phrases and sentences. e meaning of a phrase can be larger than the meaning of the individual words that comprise it, and follows a set of intricate rules. In order to interpret a text, we thus need to work beyond the level of letters and words, and look at long sequences of words such as sentences, or even complete documents. e combination of the above properties leads to data sparseness. e way in which words (discrete symbols) can be combined to form meanings is practically infinite. e number of possible valid sentences is tremendous: we could never hope to enumerate all of them. Open a random book, and the vast majority of sentences within it you have not seen or heard before. Moreover, it is likely that many sequences of four-words that appear in the book are also novel to you. If you were to look at a newspaper from just 10 years ago, or imagine one 10 years in the future, many of the words, in particular names of persons, brands, and corporations, but also slang words and technical terms, will be novel as well. ere is no clear way of generalizing from one sentence to another, or defining the similarity between sentences, that does not depend on their meaning— which is unobserved to us. is is very challenging when we come to learn from examples: even with a huge example set we are very likely to observe events that never occurred in the example set, and that are very different than all the examples that did occur in it. 1.2 NEURAL NETWORKS AND DEEP LEARNING Deep learning is a branch of machine learning. It is a re-branded name for neural networks—a family of learning techniques that was historically inspired by the way computation works in the brain, and which can be characterized as learning of parameterized differentiable mathematical functions.¹ e name deep-learning stems from the fact that many layers of these differentiable function are often chained together. While all of machine learning can be characterized as learning to make predictions based on past observations, deep learning approaches work by learning to not only predict but also to correctly represent the data, such that it is suitable for prediction. Given a large set of desired inputoutput mapping, deep learning approaches work by feeding the data into a network that produces successive transformations of the input data until a final transformation predicts the output. e transformations produced by the network are learned from the given input-output mappings, such that each transformation makes it easier to relate the data to the desired label. ¹In this book we take the mathematical view rather than the brain-inspired view. 1.3. DEEP LEARNING IN NLP 3 While the human designer is in charge of designing the network architecture and training regime, providing the network with a proper set of input-output examples, and encoding the input data in a suitable way, a lot of the heavy-lifting of learning the correct representation is performed automatically by the network, supported by the network’s architecture. 1.3 DEEP LEARNING IN NLP Neural networks provide a powerful learning machinery that is very appealing for use in natural language problems. A major component in neural networks for language is the use of an embedding layer, a mapping of discrete symbols to continuous vectors in a relatively low dimensional space. When embedding words, they transform from being isolated distinct symbols into mathematical objects that can be operated on. In particular, distance between vectors can be equated to distance between words, making it easier to generalize the behavior from one word to another. is representation of words as vectors is learned by the network as part of the training process. Going up the hierarchy, the network also learns to combine word vectors in a way that is useful for prediction. is capability alleviates to some extent the discreteness and data-sparsity problems. ere are two major kinds of neural network architectures, that can be combined in various ways: feed-forward networks and recurrent/recursive networks. Feed-forward networks, in particular multi-layer perceptrons (MLPs), allow to work with fixed sized inputs, or with variable length inputs in which we can disregard the order of the elements. When feeding the network with a set of input components, it learns to combine them in a meaningful way. MLPs can be used whenever a linear model was previously used. e nonlinearity of the network, as well as the ability to easily integrate pre-trained word embeddings, often lead to superior classification accuracy. Convolutional feed-forward networks are specialized architectures that excel at extracting local patterns in the data: they are fed arbitrarily sized inputs, and are capable of extracting meaningful local patterns that are sensitive to word order, regardless of where they appear in the input. ese work very well for identifying indicative phrases or idioms of up to a fixed length in long sentences or documents. Recurrent neural networks (RNNs) are specialized models for sequential data. ese are network components that take as input a sequence of items, and produce a fixed size vector that summarizes that sequence. As “summarizing a sequence” means different things for different tasks (i.e., the information needed to answer a question about the sentiment of a sentence is different from the information needed to answer a question about its grammaticality), recurrent networks are rarely used as standalone component, and their power is in being trainable components that can be fed into other network components, and trained to work in tandem with them. For example, the output of a recurrent network can be fed into a feed-forward network that will try to predict some value. e recurrent network is used as an input-transformer that is trained to produce informative representations for the feed-forward network that will operate on top of it. Recurrent networks are very impressive models for sequences, and are arguably the most exciting 4 1. INTRODUCTION offer of neural networks for language processing. ey allow abandoning the markov assumption that was prevalent in NLP for decades, and designing models that can condition on entire sentences, while taking word order into account when it is needed, and not suffering much from statistical estimation problems stemming from data sparsity. is capability leads to impressive gains in language-modeling, the task of predicting the probability of the next word in a sequence (or, equivalently, the probability of a sequence), which is a cornerstone of many NLP applications. Recursive networks extend recurrent networks from sequences to trees. Many of the problems in natural language are structured, requiring the production of complex output structures such as sequences or trees, and neural network models can accommodate that need as well, either by adapting known structured-prediction algorithms for linear models, or by using novel architectures such as sequence-to-sequence (encoder-decoder) models, which we refer to in this book as conditioned-generation models. Such models are at the heart of stateof-the-art machine translation. Finally, many language prediction tasks are related to each other, in the sense that knowing to perform one of them will help in learning to perform the others. In addition, while we may have a shortage of supervised (labeled) training data, we have ample supply of raw text (unlabeled data). Can we learn from related tasks and un-annotated data? Neural network approaches provide exciting opportunities for both MTL (learning from related problems) and semi-supervised learning (learning from external, unannotated data). 1.3.1 SUCCESS STORIES Fully connected feed-forward neural networks (MLPs) can, for the most part, be used as a drop-in replacement wherever a linear learner is used. is includes binary and multi-class classification problems, as well as more complex structured prediction problems. e nonlinearity of the network, as well as the ability to easily integrate pre-trained word embeddings, often lead to superior classification accuracy. A series of works² managed to obtain improved syntactic parsing results by simply replacing the linear model of a parser with a fully connected feed-forward network. Straightforward applications of a feed-forward network as a classifier replacement (usually coupled with the use of pre-trained word vectors) provide benefits for many language tasks, including the very well basic task of language modeling³ CCG supertagging,⁴ dialog state tracking,⁵ and pre-ordering for statistical machine translation.⁶ Iyyer et al. [2015] demonstrate that multi-layer feed-forward networks can provide competitive results on sentiment classification and factoid question answering. Zhou et al. [2015] and Andor et al. [2016] integrate them in a beam-search structured-prediction system, achieving stellar accuracies on syntactic parsing, sequence tagging and other tasks. ²[Chen and Manning, 2014, Durrett and Klein, 2015, Pei et al., 2015, Weiss et al., 2015] ³See Chapter 9, as well as Bengio et al. [2003], Vaswani et al. [2013]. ⁴[Lewis and Steedman, 2014] ⁵[Henderson et al., 2013] ⁶[de Gispert et al., 2015] 1.3. DEEP LEARNING IN NLP 5 Networks with convolutional and pooling layers are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input. For example, in a document classification task, a single key phrase (or an ngram) can help in determining the topic of the document [Johnson and Zhang, 2015]. We would like to learn that certain sequences of words are good indicators of the topic, and do not necessarily care where they appear in the document. Convolutional and pooling layers allow the model to learn to find such local indicators, regardless of their position. Convolutional and pooling architecture show promising results on many tasks, including document classification,⁷ shorttext categorization,⁸ sentiment classification,⁹ relation-type classification between entities,¹⁰ event detection,¹¹ paraphrase identification,¹² semantic role labeling,¹³ question answering,¹⁴ predicting box-office revenues of movies based on critic reviews,¹⁵ modeling text interestingness,¹⁶ and modeling the relation between character-sequences and part-of-speech tags.¹⁷ In natural language we often work with structured data of arbitrary sizes, such as sequences and trees. We would like to be able to capture regularities in such structures, or to model similarities between such structures. Recurrent and recursive architectures allow working with sequences and trees while preserving a lot of the structural information. Recurrent networks [Elman, 1990] are designed to model sequences, while recursive networks [Goller and Küchler, 1996] are generalizations of recurrent networks that can handle trees. Recurrent models have been shown to produce very strong results for language modeling,¹⁸ as well as for sequence tagging,¹⁹ machine translation,²⁰ parsing,²¹ and many other tasks including noisy text normalization,²² dialog state tracking,²³ response generation,²⁴ and modeling the relation between character sequences and part-of-speech tags.²⁵ ⁷[Johnson and Zhang, 2015] ⁸[Wang et al., 2015a] ⁹[Kalchbrenner et al., 2014, Kim, 2014] ¹⁰[dos Santos et al., 2015, Zeng et al., 2014] ¹¹[Chen et al., 2015, Nguyen and Grishman, 2015] ¹²[Yin and Schütze, 2015] ¹³[Collobert et al., 2011] ¹⁴[Dong et al., 2015] ¹⁵[Bitvai and Cohn, 2015] ¹⁶[Gao et al., 2014] ¹⁷[dos Santos and Zadrozny, 2014] ¹⁸Some notable works are Adel et al. [2013], Auli and Gao [2014], Auli et al. [2013], Duh et al. [2013], Jozefowicz et al. [2016], Mikolov [2012], Mikolov et al. [2010, 2011]. ¹⁹[Irsoy and Cardie, 2014, Ling et al., 2015b, Xu et al., 2015] ²⁰[Cho et al., 2014b, Sundermeyer et al., 2014, Sutskever et al., 2014, Tamura et al., 2014] ²¹[Dyer et al., 2015, Kiperwasser and Goldberg, 2016b, Watanabe and Sumita, 2015] ²²[Chrupala, 2014] ²³[Mrkšić et al., 2015] ²⁴[Kannan et al., 2016, Sordoni et al., 2015] ²⁵[Ling et al., 2015b] 6 1. INTRODUCTION Recursive models were shown to produce state-of-the-art or near state-of-the-art results for constituency²⁶ and dependency²⁷ parse re-ranking, discourse parsing,²⁸ semantic relation classification,²⁹ political ideology detection based on parse trees,³⁰ sentiment classification,³¹ targetdependent sentiment classification,³² and question answering.³³ 1.4 COVERAGE AND ORGANIZATION e book consists of four parts. Part I introduces the basic learning machinery we’ll be using throughout the book: supervised learning, MLPs, gradient-based training, and the computationgraph abstraction for implementing and training neural networks. Part II connects the machinery introduced in the first part with language data. It introduces the main sources of information that are available when working with language data, and explains how to integrate them with the neural networks machinery. It also discusses word-embedding algorithms and the distributional hypothesis, and feed-forward approaches to language modeling. Part III deals with specialized architectures and their applications to language data: 1D convolutional networks for working with ngrams, and RNNs for modeling sequences and stacks. RNNs are the main innovation of the application of neural networks to language data, and most of Part III is devoted to them, including the powerful conditioned-generation framework they facilitate, and attention-based models. Part IV is a collection of various advanced topics: recursive networks for modeling trees, structured prediction models, and multi-task learning. Part I, covering the basics of neural networks, consists of four chapters. Chapter 2 introduces the basic concepts of supervised machine learning, parameterized functions, linear and log-linear models, regularization and loss functions, training as optimization, and gradient-based training methods. It starts from the ground up, and provides the needed material for the following chapters. Readers familiar with basic learning theory and gradient-based learning may consider skipping this chapter. Chapter 3 spells out the major limitation of linear models, motivates the need for nonlinear models, and lays the ground and motivation for multi-layer neural networks. Chapter 4 introduces feed-forward neural networks and the MLPs. It discusses the definition of multi-layer networks, their theoretical power, and common subcomponents such as nonlinearities and loss functions. Chapter 5 deals with neural network training. It introduces the computationgraph abstraction that allows for automatic gradient computations for arbitrary networks (the back-propagation algorithm), and provides several important tips and tricks for effective network training. ²⁶[Socher et al., 2013a] ²⁷[Le and Zuidema, 2014, Zhu et al., 2015a] ²⁸[Li et al., 2014] ²⁹[Hashimoto et al., 2013, Liu et al., 2015] ³⁰[Iyyer et al., 2014b] ³¹[Hermann and Blunsom, 2013, Socher et al., 2013b] ³²[Dong et al., 2014] ³³[Iyyer et al., 2014a] 1.4. COVERAGE AND ORGANIZATION 7 Part II introducing language data, consists of seven chapters. Chapter 6 presents a typology of common language-processing problems, and discusses the available sources of information (features) available for us when using language data. Chapter 7 provides concrete case studies, showing how the features described in the previous chapter are used for various natural language tasks. Readers familiar with language processing can skip these two chapters. Chapter 8 connects the material of Chapters 6 and 7 with neural networks, and discusses the various ways of encoding language-based features as inputs for neural networks. Chapter 9 introduces the language modeling task, and the feed-forward neural language model architecture. is also paves the way for discussing pre-trained word embeddings in the following chapters. Chapter 10 discusses distributed and distributional approaches to word-meaning representations. It introduces the word-context matrix approach to distributional semantics, as well as neural language-modeling inspired wordembedding algorithms, such as GV and W2V, and discusses the connection between them and the distributional methods. Chapter 11 deals with using word embeddings outside of the context of neural networks. Finally, Chapter 12 presents a case study of a task-specific feedforward network that is tailored for the Natural Language Inference task. Part III introducing the specialized convolutional and recurrent architectures, consists of five chapters. Chapter 13 deals with convolutional networks, that are specialized at learning informative ngram patterns. e alternative hash-kernel technique is also discussed. e rest of this part, Chapters 14–17, is devoted to RNNs. Chapter 14 describes the RNN abstraction for modeling sequences and stacks. Chapter 15 describes concrete instantiations of RNNs, including the Simple RNN (also known as Elman RNNs) and gated architectures such as the Long Shortterm Memory (LSTM) and the Gated Recurrent Unit (GRU). Chapter 16 provides examples of modeling with the RNN abstraction, showing their use within concrete applications. Finally, Chapter 17 introduces the conditioned-generation framework, which is the main modeling technique behind state-of-the-art machine translation, as well as unsupervised sentence modeling and many other innovative applications. Part IV is a mix of advanced and non-core topics, and consists of three chapters. Chapter 18 introduces tree-structured recursive networks for modeling trees. While very appealing, this family of models is still in research stage, and is yet to show a convincing success story. Nonetheless, it is an important family of models to know for researchers who aim to push modeling techniques beyond the state-of-the-art. Readers who are mostly interested in mature and robust techniques can safely skip this chapter. Chapter 19 deals with structured prediction. It is a rather technical chapter. Readers who are particularly interested in structured prediction, or who are already familiar with structured prediction techniques for linear models or for language processing, will likely appreciate the material. Others may rather safely skip it. Finally, Chapter 20 presents multitask and semi-supervised learning. Neural networks provide ample opportunities for multi-task and semi-supervised learning. ese are important techniques, that are still at the research stage. However, the existing techniques are relatively easy to implement, and do provide real gains. e chapter is not technically challenging, and is recommended to all readers. 8 1. INTRODUCTION Dependencies For the most part, chapters, depend on the chapters that precede them. An exception are the first two chapters of Part II, which do not depend on material in previous chapters and can be read in any order. Some chapters and sections can be skipped without impacting the understanding of other concepts or material. ese include Section 10.4 and Chapter 11 that deal with the details of word embedding algorithms and the use of word embeddings outside of neural networks; Chapter 12, describing a specific architecture for attacking the Stanford Natural Language Inference (SNLI) dataset; and Chapter 13 describing convolutional networks. Within the sequence on recurrent networks, Chapter 15, dealing with the details of specific architectures, can also be relatively safely skipped. e chapters in Part IV are for the most part independent of each other, and can be either skipped or read in any order. 1.5 WHAT’S NOT COVERED e focus is on applications of neural networks to language processing tasks. However, some subareas of language processing with neural networks were deliberately left out of scope of this book. Specifically, I focus on processing written language, and do not cover working with speech data or acoustic signals. Within written language, I remain relatively close to the lower level, relatively well-defined tasks, and do not cover areas such as dialog systems, document summarization, or question answering, which I consider to be vastly open problems. While the described techniques can be used to achieve progress on these tasks, I do not provide examples or explicitly discuss these tasks directly. Semantic parsing is similarly out of scope. Multi-modal applications, connecting language data with other modalities such as vision or databases are only very briefly mentioned. Finally, the discussion is mostly English-centric, and languages with richer morphological systems and fewer computational resources are only very briefly discussed. Some important basics are also not discussed. Specifically, two crucial aspects of good work in language processing are proper evaluation and data annotation. Both of these topics are left outside the scope of this book, but the reader should be aware of their existence. Proper evaluation includes the choice of the right metrics for evaluating performance on a given task, best practices, fair comparison with other work, performing error analysis, and assessing statistical significance. Data annotation is the bread-and-butter of NLP systems. Without data, we cannot train supervised models. As researchers, we very often just use “standard” annotated data produced by someone else. It is still important to know the source of the data, and consider the implications resulting from its creation process. Data annotation is a very vast topic, including proper formulation of the annotation task; developing the annotation guidelines; deciding on the source of annotated data, its coverage and class proportions, good train-test splits; and working with annotators, consolidating decisions, validating quality of annotators and annotation, and various similar topics. 1.6 A NOTE ON TERMINOLOGY 1.6. A NOTE ON TERMINOLOGY 9 e word “feature” is used to refer to a concrete, linguistic input such as a word, a suffix, or a partof-speech tag. For example, in a first-order part-of-speech tagger, the features might be “current word, previous word, next word, previous part of speech.” e term “input vector” is used to refer to the actual input that is fed to the neural network classifier. Similarly, “input vector entry” refers to a specific value of the input. is is in contrast to a lot of the neural networks literature in which the word “feature” is overloaded between the two uses, and is used primarily to refer to an input-vector entry. 1.7 MATHEMATICAL NOTATION We use bold uppercase letters to represent matrices (X , Y , Z ), and bold lowercase letters to represent vectors (b). When there are series of related matrices and vectors (for example, where each matrix corresponds to a different layer in the network), superscript indices are used (W 1, W 2). For the rare cases in which we want indicate the power of a matrix or a vector, a pair of brackets is added around the item to be exponentiated: .W /2; .W 3/2. We use Œ as the index operator of vectors and matrices: bŒi is the i th element of vector b, and W Œi;j  is the element in the ith column and j th row of matrix W . When unambiguous, we sometimes adopt the more standard mathematical notation and use bi to indicate the ith element of vector b, and sPimilarly wi;Pj for elements of a matrix W . We use to denote the dot-product operator: w v D i wi vi D i wŒivŒi. We use x1Wn to indicate a sequence of vectors x1; : : : ; xn, and similarly x1Wn is the sequence of items x1; : : : ; xn. We use xnW1 to indicate the reverse sequence. x1WnŒi  D xi , xnW1Œi  D xn iC1. We use Œv1I v2 to denote vector concatenation. While somewhat unorthodox, unless otherwise stated, vectors are assumed to be row vectors. e choice to use row vectors, which are right multiplied by matrices (xW C b), is somewhat non standard—a lot of the neural networks literature use column vectors that are left multiplied by matrices (W x C b). We trust the reader to be able to adapt to the column vectors notation when reading the literature.³⁴ ³⁴e choice to use the row vectors notation was inspired by the following benefits: it matches the way input vectors and network diagrams are often drawn in the literature; it makes the hierarchical/layered structure of the network more transparent and puts the input as the left-most variable rather than being nested; it results in fully connected layer dimensions being din dout rather than dout din; and it maps better to the way networks are implemented in code using matrix libraries such as numpy. PART I Supervised Classification and Feed-forward Neural Networks 13 CHAPTER 2 Learning Basics and Linear Models Neural networks, the topic of this book, are a class of supervised machine learning algorithms. is chapter provides a quick introduction to supervised machine learning terminology and practices, and introduces linear and log-linear models for binary and multi-class classification. e chapter also sets the stage and notation for later chapters. Readers who are familiar with linear models can skip ahead to the next chapters, but may also benefit from reading Sections 2.4 and 2.5. Supervised machine learning theory and linear models are very large topics, and this chapter is far from being comprehensive. For a more complete treatment the reader is referred to texts such as Daumé III [2015], Shalev-Shwartz and Ben-David [2014], and Mohri et al. [2012]. 2.1 SUPERVISED LEARNING AND PARAMETERIZED FUNCTIONS e essence of supervised machine learning is the creation of mechanisms that can look at examples and produce generalizations. More concretely, rather than designing an algorithm to perform a task (“distinguish spam from non-spam email”), we design an algorithm whose input is a set of labeled examples (“is pile of emails are spam. is other pile of emails are not spam.”), and its output is a function (or a program) that receives an instance (an email) and produces the desired label (spam or not-spam). It is expected that the resulting function will produce correct label predictions also for instances it has not seen during training. As searching over the set of all possible programs (or all possible functions) is a very hard (and rather ill-defined) problem, we often restrict ourselves to search over specific families of functions, e.g., the space of all linear functions with din inputs and dout outputs, or the space of all decision trees over din variables. Such families of functions are called hypothesis classes. By restricting ourselves to a specific hypothesis class, we are injecting the learner with inductive bias—a set of assumptions about the form of the desired solution, as well as facilitating efficient procedures for searching for the solution. For a broad and readable overview of the main families of learning algorithms and the assumptions behind them, see the book by Domingos [2015]. e hypothesis class also determines what can and cannot be represented by the learner. One common hypothesis class is that of high-dimensional linear function, i.e., functions of the 14 2. LEARNING BASICS AND LINEAR MODELS form:¹ f .x/ D x W C b (2.1) x 2 Rdin W 2 Rdin dout b 2 Rdout : Here, the vector x is the input to the function, while the matrix W and the vector b are the parameters. e goal of the learner is to set the values of the parameters W and b such that the function behaves as intended on a collection of input values x1Wk D x1; : : : ; xk and the corresponding desired outputs y1Wk D yi ; : : : ; yk. e task of searching over the space of functions is thus reduced to one of searching over the space of parameters. It is common to refer to parameters of the function as ‚. For the linear model case, ‚ D W ; b. In some cases we want the notation to make the parameterization explicit, in which case we include the parameters in the function’s definition: f .xI W ; b/ D x W C b. As we will see in the coming chapters, the hypothesis class of linear functions is rather restricted, and there are many functions that it cannot represent (indeed, it is limited to linear relations). In contrast, feed-forward neural networks with hidden layers, to be discussed in Chapter 4, are also parameterized functions, but constitute a very strong hypothesis class—they are universal approximators, capable of representing any Borel-measurable function.² However, while restricted, linear models have several desired properties: they are easy and efficient to train, they often result in convex optimization objectives, the trained models are somewhat interpretable, and they are often very effective in practice. Linear and log-linear models were the dominant approaches in statistical NLP for over a decade. Moreover, they serve as the basic building blocks for the more powerful nonlinear feed-forward networks which will be discussed in later chapters. 2.2 TRAIN, TEST, AND VALIDATION SETS Before delving into the details of linear models, let’s reconsider the general setup of the machine learning problem. We are faced with a dataset of k input examples x1Wk and their corresponding gold labels y1Wk, and our goal is to produce a function f .x/ that correctly maps inputs x to outputs yO , as evidenced by the training set. How do we know that the produced function f ./ is indeed a good one? One could run the training examples x1Wk through f ./, record the answers yO 1Wk, compare them to the expected labels y1Wk, and measure the accuracy. However, this process will not be very informative—our main concern is the ability of f ./ to generalize well to unseen examples. A function f ./ that is implemented as a lookup table, that is, looking for the input x in its memory and returning the corresponding value y for instances is has seen and a random value otherwise, will get a perfect score on this test, yet is clearly not a good classification function as it has zero generalization ability. We rather have a function f ./ that gets some of the training examples wrong, providing that it will get unseen examples correctly. ¹As discussed in Section 1.7. is book takes a somewhat un-orthodox approach and assumes vectors are row vectors rather than column vectors. ²See further discussion in Section 4.3. 2.2. TRAIN, TEST, AND VALIDATION SETS 15 Leave-one out We must assess the trained function’s accuracy on instances it has not seen during training. One solution is to perform leave-one-out cross-validation: train k functions f1Wk, each time leaving out a different input example xi , and evaluating the resulting function fi ./ on its ability to predict xi . en train another function f ./ on the entire trainings set x1Wk. Assuming that the training set is a representative sample of the population, this percentage of functions fi ./ that produced correct prediction on the left-out samples is a good approximation of the accuracy of f ./ on new inputs. However, this process is very costly in terms of computation time, and is used only in cases where the number of annotated examples k is very small (less than a hundred or so). In language processing tasks, we very often encounter training sets with well over 105 examples. Held-out set A more efficient solution in terms of computation time is to split the training set into two subsets, say in a 80%/20% split, train a model on the larger subset (the training set), and test its accuracy on the smaller subset (the held-out set). is will give us a reasonable estimate on the accuracy of the trained function, or at least allow us to compare the quality of different trained models. However, it is somewhat wasteful in terms training samples. One could then re-train a model on the entire set. However, as the model is trained on substantially more data, the error estimates of the model trained on less data may not be accurate. is is generally a good problem to have, as more training data is likely to result in better rather than worse predictors.³ Some care must be taken when performing the split—in general it is better to shuffle the examples prior to splitting them, to ensure a balanced distribution of examples between the training and held-out sets (for example, you want the proportion of gold labels in the two sets to be similar). However, sometimes a random split is not a good option: consider the case where your input are news articles collected over several months, and your model is expected to provide predictions for new stories. Here, a random split will over-estimate the model’s quality: the training and held-out examples will be from the same time period, and hence on more similar stories, which will not be the case in practice. In such cases, you want to ensure that the training set has older news stories and the held-out set newer ones—to be as similar as possible to how the trained model will be used in practice. A three-way split e split into train and held-out sets works well if you train a single model and wants to assess its quality. However, in practice you often train several models, compare their quality, and select the best one. Here, the two-way split approach is insufficient—selecting the best model according to the held-out set’s accuracy will result in an overly optimistic estimate of the model’s quality. You don’t know if the chosen settings of the final classifier are good in general, or are just good for the particular examples in the held-out sets. e problem will be even worse if you perform error analysis based on the held-out set, and change the features or the architecture of the model based on the observed errors. You don’t know if your improvements based on the held- ³Note, however, that some setting in the training procedure, in particular the learning rate and regularization weight may be sensitive to the training set size, and tuning them based on some data and then re-training a model with the same settings on larger data may produce sub-optimal results. 16 2. LEARNING BASICS AND LINEAR MODELS out sets will carry over to new instances. e accepted methodology is to use a three-way split of the data into train, validation (also called development), and test sets. is gives you two held-out sets: a validation set (also called development set), and a test set. All the experiments, tweaks, error analysis, and model selection should be performed based on the validation set. en, a single run of the final model over the test set will give a good estimate of its expected quality on unseen examples. It is important to keep the test set as pristine as possible, running as few experiments as possible on it. Some even advocate that you should not even look at the examples in the test set, so as to not bias the way you design your model. 2.3 LINEAR MODELS Now that we have established some methodology, we return to describe linear models for binary and multi-class classification. 2.3.1 BINARY CLASSIFICATION In binary classification problems we have a single output, and thus use a restricted version of Equation (2.1) in which dout D 1, making w a vector and b a scalar. f .x/ D x w C b: (2.2) e range of the linear function in Equation (2.2) is Œ 1; C1. In order to use it for binary classification, it is common to pass the output of f .x/ through the sign function, mapping negative values to 1 (the negative class) and non-negative values to C1 (the positive class). Consider the task of predicting which of two neighborhoods an apartment is located at, based on the apartment’s price and size. Figure 2.1 shows a 2D plot of some apartments, where the x-axis denotes the monthly rent price in USD, while the y-axis is the size in square feet. e blue circles are for Dupont Circle, DC and the green crosses are in Fairfax, VA. It is evident from the plot that we can separate the two neighborhoods using a straight line—apartments in Dupont Circle tend to be more expensive than apartments in Fairfax of the same size.⁴ e dataset is linearly separable: the two classes can be separated by a straight line. Each data-point (an apartment) can be represented as a 2-dimensional (2D) vector x where xŒ0 is the apartment’s size and xŒ1 is its price. We then get the following linear model: yO D sign.f .x// D sign.x w C b/ D sign.size w1 C price w2 C b/; where is the dot-product operation, b and w D Œw1; w2 are free parameters, and we predict Fairfax if yO 0 and Dupont Circle otherwise. e goal of learning is setting the values of w1, ⁴Note that looking at either size or price alone would not allow us to cleanly separate the two groups. 2500 2.3. LINEAR MODELS 17 2000 1500 Size 1000 500 0 1000 2000 3000 Price 4000 5000 Figure 2.1: Housing data: rent price in USD vs. size in square ft. Data source: Craigslist ads, collected from June 7–15, 2015. w2, and b such that the predictions are correct for all data-points we observe.⁵ We will discuss learning in Section 2.7 but for now consider that we expect the learning procedure to set a high value to w1 and a low value to w2. Once the model is trained, we can classify new data-points by feeding them into this equation. It is sometimes not possible to separate the data-points using a straight line (or, in higher dimensions, a linear hyperplane)—such datasets are said to be nonlinearly separable, and are beyond the hypothesis class of linear classifiers. e solution would be to either move to a higher dimension (add more features), move to a richer hypothesis class, or allow for some mis-classification.⁶ ⁵Geometrically, for a given w the points x w C b D 0 define a hyperplane (which in two dimensions corresponds to a line) that separates the space into two regions. e goal of learning is then finding a hyperplane such that the classification induced by it is correct. ⁶Misclassifying some of the examples is sometimes a good idea. For example, if we have reason to believe some of the datapoints are outliers—examples that belong to one class, but are labeled by mistake as belonging to the other class. 18 2. LEARNING BASICS AND LINEAR MODELS Feature Representations In the example above, each data-point was a pair of size and price measurements. Each of these properties is considered a feature by which we classify the datapoint. is is very convenient, but in most cases the data-points are not given to us directly as lists of features, but as real-world objects. For example, in the apartments example we may be given a list of apartments to classify. We then need to make a concious decision and select the measurable properties of the apartments that we believe will be useful features for the classification task at hand. Here, it proved effective to focus on the price and the size. We could also look at additional properties, such as the number of rooms, the height of the ceiling, the type of floor, the geo-location coordinates, and so on. After deciding on a set of features, we create a feature extraction function that maps a real world object (i.e., an apartment) to a vector of measurable quantities (price and size) which can be used as inputs to our models. e choice of the features is crucial to the success of the classification accuracy, and is driven by the informativeness of the features, and their availability to us (the geo-location coordinates are much better predictors of the neighborhood than the price and size, but perhaps we only observe listings of past transactions, and do not have access to the geo-location information). When we have two features, it is easy to plot the data and see the underlying structures. However, as we see in the next example, we often use many more than just two features, making plotting and precise reasoning impractical. A central part in the design of linear models, which we mostly gloss over in this text, is the design of the feature function (so called feature engineering). One of the promises of deep learning is that it vastly simplifies the feature-engineering process by allowing the model designer to specify a small set of core, basic, or “natural” features, and letting the trainable neural network architecture combine them into more meaningful higher-level features, or representations. However, one still needs to specify a suitable set of core features, and tie them to a suitable architecture. We discuss common features for textual data in Chapters 6 . and 7. We usually have many more than two features. Moving to a language setup, consider the task of distinguishing documents written in English from documents written in German. It turns out that letter frequencies make for quite good predictors (features) for this task. Even more informative are counts of letter bigrams, i.e., pairs of consecutive letters.⁷ Assuming we have an alphabet of 28 letters (a–z, space, and a special symbol for all other characters including digits, punctuations, etc.) we represent a document as a 28 28 dimensional vector x 2 R784, where each entry xŒi represents a count of a particular letter combination in the document, normalized by the document’s length. For example, denoting by xab the entry of x corresponding to the ⁷While one may think that words will also be good predictors, letters, or letter-bigrams are far more robust: we are likely to encounter a new document without any of the words we observed in the training set, while a document without any of the distinctive letter-bigrams is significantly less likely. 2.3. LINEAR MODELS 19 letter-bigram ab: xab D #ab ; jDj (2.3) where #ab is the number of times the bigram ab appears in the document, and jDj is the total number of bigrams in the document (the document’s length). _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th _a _d _s _t d_ de e_ en er ie in n_ on re t_ th Figure 2.2: Character-bigram histograms for documents in English (left, blue) and German (right, green). Underscores denote spaces. Figure 2.2 shows such bigram histograms for several German and English texts. For readability, we only show the top frequent character-bigrams and not the entire feature vectors. On the left, we see the bigrams of the English texts, and on the right of the German ones. ere are clear patterns in the data, and, given a new item, such as: _a _d _s _t d_ de e_ en er ie in n_ on re t_ th 20 2. LEARNING BASICS AND LINEAR MODELS you could probably tell that it is more similar to the German group than to the English one. Note, however, that you couldn’t use a single definite rule such as “if it has th its English” or “if it has ie its German”: while German texts have considerably less th than English, the th may and does occur in German texts, and similarly the ie combination does occur in English. e decision requires weighting different factors relative to each other. Let’s formalize the problem in a machine-learning setup. We can again use a linear model: yO D sign.f .x// D sign.x w C b/ D sign.xaa waa C xab wab C xac wac : : : C b/: (2.4) A document will be considered English if f .x/ 0 and as German otherwise. Intuitively, learning should assign large positive values to w entries associated with letter pairs that are much more common in English than in German (i.e., th) negative values to letter pairs that are much more common in German than in English (ie, en), and values around zero to letter pairs that are either common or rare in both languages. Note that unlike the 2D case of the housing data (price vs. size), here we cannot easily visualize the points and the decision boundary, and the geometric intuition is likely much less clear. In general, it is difficult for most humans to think of the geometries of spaces with more than three dimensions, and it is advisable to think of linear models in terms of assigning weights to features, which is easier to imagine and reason about. 2.3.2 LOG-LINEAR BINARY CLASSIFICATION e output f .x/ is in the range Œ 1; 1, and we map it to one of two classes f 1; C1g using the sign function. is is a good fit if all we care about is the assigned class. However, we may be interested also in the confidence of the decision, or the probability that the classifier assigns to the class. An alternative that facilitates this is to map instead to the range Œ0; 1, by pushing the output through a squashing function such as the sigmoid .x/ D 1 1Ce x , resulting in: 1 yO D .f .x// D 1 C e .x wCb/ : (2.5) Figure 2.3 shows a plot of the sigmoid function. It is monotonically increasing, and maps values to the range Œ0; 1, with 0 being mapped to 1 2 . When used with a suitable loss function (discussed in Section 2.7.1) the binary predictions made through the log-linear model can be interpreted as class membership probability estimates .f .x// D P .yO D 1 j x/ of x belonging to the positive class. We also get P .yO D 0 j x/ D 1 P .yO D 1 j x/ D 1 .f .x//. e closer the value is to 0 or 1 the more certain the model is in its class membership prediction, with the value of 0.5 indicating model uncertainty. 1.0 0.8 0.6 0.4 0.2 0.0 -6 -4 σ(x) -2 0 2 2.3. LINEAR MODELS 21 4 6 Figure 2.3: e sigmoid function .x/. 2.3.3 MULTI-CLASS CLASSIFICATION e previous examples were of binary classification, where we had two possible classes. Binaryclassification cases exist, but most classification problems are of a multi-class nature, in which we should assign an example to one of k different classes. For example, we are given a document and asked to classify it into one of six possible languages: English, French, German, Italian, Spanish, Other. A possible solution is to consider six weight vectors wE; wF; : : : and biases, one for each language, and predict the language resulting in the highest score:⁸ yO D f .x/ D argmax x wL C bL: L2fE;F;G;I;S;Og (2.6) e six sets of parameters wL 2 R784; bL can be arranged as a matrix W 2 R784 6 and vector b 2 R6, and the equation re-written as: yO D f .x/ D x W C b prediction D yO D argmax yO Œi: i (2.7) Here yO 2 R6 is a vector of the scores assigned by the model to each language, and we again determine the predicted language by taking the argmax over the entries of yO . ⁸ere are many ways to model multi-class classification, including binary-to-multi-class reductions. ese are beyond the scope of this book, but a good overview can be found in Allwein et al. [2000]. 22 2. LEARNING BASICS AND LINEAR MODELS 2.4 REPRESENTATIONS Consider the vector yO resulting from applying Equation 2.7 of a trained model to a document. e vector can be considered as a representation of the document, capturing the properties of the document that are important to us, namely the scores of the different languages. e representation yO contains strictly more information than the prediction yO D argmaxi yO Œi: for example, yO can be used to distinguish documents in which the main language in German, but which also contain a sizeable amount of French words. By clustering documents based on their vector representations as assigned by the model, we could perhaps discover documents written in regional dialects, or by multilingual authors. e vectors x containing the normalized letter-bigram counts for the documents are also representations of the documents, arguably containing a similar kind of information to the vectors yO . However, the representations in yO is more compact (6 entries instead of 784) and more specialized for the language prediction objective (clustering by the vectors x would likely reveal document similarities that are not due to a particular mix of languages, but perhaps due to the document’s topic or writing styles). e trained matrix W 2 R784 6 can also be considered as containing learned representations. As demonstrated in Figure 2.4, we can consider two views of W , as rows or as columns. Each of the 6 columns of W correspond to a particular language, and can be taken to be a 784dimensional vector representation of this language in terms of its characteristic letter-bigram patterns. We can then cluster the 6 language vectors according to their similarity. Similarly, each of the 784 rows of W correspond to a particular letter-bigram, and provide a 6-dimensional vector representation of that bigram in terms of the languages it prompts. Representations are central to deep learning. In fact, one could argue that the main power of deep-learning is the ability to learn good representations. In the linear case, the representations are interpretable, in the sense that we can assign a meaningful interpretation to each dimension in the representation vector (e.g., each dimension corresponds to a particular language or letter-bigram). is is in general not the case—deep learning models often learn a cascade of representations of the input that build on top of each other, in order to best model the problem at hand, and these representations are often not interpretable—we do not know which properties of the input they capture. However, they are still very useful for making predictions. Moreover, at the boundaries of the model, i.e., at the input and the output, we get representations that correspond to particular aspects of the input (i.e., a vector representation for each letter-bigram) or the output (i.e., a vector representation of each of the output classes). We will get back to this in Section 8.3 after discussing neural networks and encoding categorical features as dense vectors. It is recommended that you return to this discussion once more after reading that section. 2.5. ONE-HOT AND DENSE VECTOR REPRESENTATIONS 23 O Sp It Gr Fr En (a) (b) aa ab ac ad ae af ag ah zy zz W W Figure 2.4: Two views of the W matrix. (a) Each column corresponds to a language. (b) Each row corresponds to a letter bigram. 2.5 ONE-HOT AND DENSE VECTOR REPRESENTATIONS e input vector x in our language classification example contains the normalized bigram counts in the document D. is vector can be decomposed into an average of jDj vectors, each corresponding to a particular document position i: x D 1 jDj X jDj xDŒi I i D1 (2.8) here, DŒi is the bigram at document position i , and each vector xDŒi 2 R784 is a one-hot vector, in which all entries are zero except the single entry corresponding to the letter bigram DŒi, which is 1. e resulting vector x is commonly referred to as an averaged bag of bigrams (more generally averaged bag of words, or just bag of words). Bag-of-words (BOW) representations contain information about the identities of all the “words” (here, bigrams) of the document, without considering their order. A one-hot representation can be considered as a bag-of-a-single-word. e view of the rows of the matrix W as representations of the letter bigrams suggests an alternative way of computing the document representation vector yO in Equation (2.7). Denoting 24 2. LEARNING BASICS AND LINEAR MODELS by W DŒi the row of W corresponding to the bigram DŒi, we can take the representation y of a document D to be the average of the representations of the letter-bigrams in the document: yO D 1 jDj X jDj W i D1 DŒi : (2.9) is representation is often called a continuous bag of words (CBOW), as it is composed of a sum of word representations, where each “word” representation is a low-dimensional, continuous vector. We note that Equation (2.9) and the term x W in Equation (2.7) are equivalent. To see why, consider: yDx W 0 1 D @1 jDj X jDj i D1 xDŒi A W D 1 jDj X jDj .x DŒi  i D1 W/ D 1 jDj X jDj W DŒi : i D1 (2.10) In other words, the continuous-bag-of-words (CBOW) representation can be obtained either by summing word-representation vectors or by multiplying a bag-of-words vector by a matrix in which each row corresponds to a dense word representation (such matrices are also called embedding matrices). We will return to this point in Chapter 8 (in particular Section 8.3) when discussing feature representations in deep learning models for text. 2.6 LOG-LINEAR MULTI-CLASS CLASSIFICATION In the binary case, we transformed the linear prediction into a probability estimate by passing it through the sigmoid function, resulting in a log-linear model. e analog for the multi-class case is passing the score vector through the softmax function: softmax.x /Œi  D Pe j x Œi  exŒj  : (2.11) Resulting in: 2.7. TRAINING AS OPTIMIZATION 25 yO D softmax.xW C b/ e.xW Cb/Œi yO Œi D P j e.xW Cb/Œj  : (2.12) e softmax transformation forces the values in yO to be positive and sum to 1, making them interpretable as a probability distribution. 2.7 TRAINING AS OPTIMIZATION Recall that the input to a supervised learning algorithm is a training set of n training examples x1Wn D x1; x2; : : : ; xn together with corresponding labels y1Wn D y1; y2; : : : ; yn. Without loss of generality, we assume that the desired inputs and outputs are vectors: x1Wn, y1Wn.⁹ e goal of the algorithm is to return a function f ./ that accurately maps input examples to their desired labels, i.e., a function f ./ such that the predictions yO D f .x/ over the training set are accurate. To make this more precise, we introduce the notion of a loss function, quantifying the loss suffered when predicting yO while the true label is y. Formally, a loss function L.yO ; y/ assigns a numerical score (a scalar) to a predicted output yO given the true expected output y. e loss function should be bounded from below, with the minimum attained only for cases where the prediction is correct. e parameters of the learned function (the matrix W and the biases vector b) are then set in order to minimize the loss L over the training examples (usually, it is the sum of the losses over the different training examples that is being minimized). Concretely, given a labeled training set .x1Wn; y1Wn/, a per-instance loss function L and a parameterized function f .xI ‚/ we define the corpus-wide loss with respect to the parameters ‚ as the average loss over all training examples: 1 Xn L.‚/ D n L.f .xi I ‚/; yi /: i D1 (2.13) In this view, the training examples are fixed, and the values of the parameters determine the loss. e goal of the training algorithm is then to set the values of the parameters ‚ such that the value of L is minimized: 1 Xn ‚O D argmin ‚ L.‚/ D argmin ‚ n i D1 L.f .xi I ‚/; yi /: (2.14) Equation (2.14) attempts to minimize the loss at all costs, which may result in overfitting the training data. To counter that, we often pose soft restrictions on the form of the solution. is ⁹In many cases it is natural to think of the expected output as a scalar (class assignment) rather than a vector. In such cases, y is simply the corresponding one-hot vector, and argmaxi yŒi is the corresponding class assignment. 26 2. LEARNING BASICS AND LINEAR MODELS is done using a function R.‚/ taking as input the parameters and returning a scalar that reflect their “complexity,” which we want to keep low. By adding R to the objective, the optimization problem needs to balance between low loss and low complexity: 0 1 ‚O D argmin ‚ BBBB@‚n1 Xn i D1 …lo„ss L.f .xi I ‚/; ƒ yi/ C reg‚uRl…ar.„i‚zaƒt/ionCCCCA : (2.15) e function R is called a regularization term. Different combinations of loss functions and regularization criteria result in different learning algorithms, with different inductive biases. We now turn to discuss common loss functions (Section 2.7.1), followed by a discussion of regularization and regularizers (Section 2.7.2). en, in Section 2.8 we present an algorithm for solving the minimization problem (Equation (2.15)). 2.7.1 LOSS FUNCTIONS e loss can be an arbitrary function mapping two vectors to a scalar. For practical purposes of optimization, we restrict ourselves to functions for which we can easily compute gradients (or subgradients).¹⁰ In most cases, it is sufficient and advisable to rely on a common loss function rather than defining your own. For a detailed discussion and theoretical treatment of loss functions for binary classification, see Zhang [2004]. We now discuss some loss functions that are commonly used with linear models and with neural networks in NLP. Hinge (binary) For binary classification problems, the classifier’s output is a single scalar yQ and the intended output y is in fC1; 1g. e classification rule is yO D sign.yQ/, and a classification is considered correct if y yQ > 0, meaning that y and yQ share the same sign. e hinge loss, also known as margin loss or SVM loss, is defined as: Lhinge(binary).yQ; y/ D max.0; 1 y yQ/: (2.16) e loss is 0 when y and yQ share the same sign and jyQj 1. Otherwise, the loss is linear. In other words, the binary hinge loss attempts to achieve a correct classification, with a margin of at least 1. Hinge (multi-class) e hinge loss was extended to the multi-class setting by Crammer and Singer [2002]. Let yO D yO Œ1; : : : ; yO Œn be the classifier’s output vector, and y be the one-hot vector for the correct output class. e classification rule is defined as selecting the class with the highest score: prediction D argmax yO Œi: i (2.17) ¹⁰A gradient of a function with k variables is a collection of k partial derivatives, one according to each of the variables. Gradients are discussed further in Section 2.8. 2.7. TRAINING AS OPTIMIZATION 27 Denote by t D argmaxi yŒi the correct class, and by k D argmaxi¤t yO Œi the highest scoring class such that k ¤ t. e multi-class hinge loss is defined as: Lhinge(multi-class).yO ; y/ D max.0; 1 .yO Œt yO Œk//: (2.18) e multi-class hinge loss attempts to score the correct class above all other classes with a margin of at least 1. Both the binary and multi-class hinge losses are intended to be used with linear outputs. e hinge losses are useful whenever we require a hard decision rule, and do not attempt to model class membership probability. Log loss e log loss is a common variation of the hinge loss, which can be seen as a “soft” version of the hinge loss with an infinite margin [LeCun et al., 2006]: Llog .yO ; y/ D log.1 C exp. .yO Œt yO Œk//: (2.19) Binary cross entropy e binary cross-entropy loss, also referred to as logistic loss is used in binary classification with conditional probability outputs. We assume a set of two target classes labeled 0 and 1, with a correct label y 2 f0; 1g. e classifier’s output yQ is transformed using the sigmoid (also called the logistic) function .x/ D 1=.1 C e x/ to the range Œ0; 1, and is interpreted as the conditional probability yO D .yQ/ D P .y D 1jx/. e prediction rule is: ( prediction D 0 yO < 0:5 1 yO 0:5: e network is trained to maximize the log conditional probability log P .y D 1jx/ for each training example .x; y/. e logistic loss is defined as: Llogistic.yO; y/ D y log yO .1 y/ log.1 yO/: (2.20) e logistic loss is useful when we want the network to produce class conditional probability for a binary classification problem. When using the logistic loss, it is assumed that the output layer is transformed using the sigmoid function. Categorical cross-entropy loss e categorical cross-entropy loss (also referred to as negative log likelihood) is used when a probabilistic interpretation of the scores is desired. Let y D yŒ1; : : : ; yŒn be a vector representing the true multinomial distribution over the labels 1; : : : ; n,¹¹ and let yO D yO Œ1; : : : ; yO Œn be the linear classifier’s output, which was transformed by the softmax function (Section 2.6), and represent the class membership conditional distribu- tion yO Œi D P .y D ijx/. e categorical cross entropy loss measures the dissimilarity between the true label distribution y and the predicted label distribution yO , and is defined as cross entropy: X Lcross-entropy.yO ; y/ D yŒi log.yO Œi/: (2.21) i ¹¹is formulation assumes an instance can belong to several classes with some degree of certainty. 28 2. LEARNING BASICS AND LINEAR MODELS For hard-classification problems in which each training example has a single correct class assignment, y is a one-hot vector representing the true class. In such cases, the cross entropy can be simplified to: Lcross-entropy(hard classification).yO ; y/ D log.yO Œt/; (2.22) where t is the correct class assignment. is attempts to set the probability mass assigned to the correct class t to 1. Because the scores yO have been transformed using the softmax function to be non-negative and sum to one, increasing the mass assigned to the correct class means decreasing the mass assigned to all the other classes. e cross-entropy loss is very common in the log-linear models and the neural networks literature, and produces a multi-class classifier which does not only predict the one-best class label but also predicts a distribution over the possible labels. When using the cross-entropy loss, it is assumed that the classifier’s output is transformed using the softmax transformation. Ranking losses In some settings, we are not given supervision in term of labels, but rather as pairs of correct and incorrect items x and x0, and our goal is to score correct items above incorrect ones. Such training situations arise when we have only positive examples, and generate negative examples by corrupting a positive example. A useful loss in such scenarios is the margin-based ranking loss, defined for a pair of correct and incorrect examples: Lranking(margin).x; x0/ D max.0; 1 .f .x/ f .x0///; (2.23) where f .x/ is the score assigned by the classifier for input vector x. e objective is to score (rank) correct inputs over incorrect ones with a margin of at least 1. A common variation is to use the log version of the ranking loss: Lranking(log).x; x0/ D log.1 C exp. .f .x/ f .x0////: (2.24) Examples using the ranking hinge loss in language tasks include training with the auxiliary tasks used for deriving pre-trained word embeddings (see Section 10.4.2), in which we are given a correct word sequence and a corrupted word sequence, and our goal is to score the correct sequence above the corrupt one [Collobert and Weston, 2008]. Similarly, Van de Cruys [2014] used the ranking loss in a selectional-preferences task, in which the network was trained to rank correct verb-object pairs above incorrect, automatically derived ones, and Weston et al. [2013] trained a model to score correct (head, relation, tail) triplets above corrupted ones in an informationextraction setting. An example of using the ranking log loss can be found in Gao et al. [2014]. A variation of the ranking log loss allowing for a different margin for the negative and positive class is given in dos Santos et al. [2015]. 2.7. TRAINING AS OPTIMIZATION 29 2.7.2 REGULARIZATION Consider the optimization problem in Equation (2.14). It may admit multiple solutions, and, especially in higher dimensions, it can also over-fit. Consider our language identification example, and a setting in which one of the documents in the training set (call it xo) is an outlier: it is actually in German, but is labeled as French. In order to drive the loss down, the learner can identify features (letter bigrams) in xo that occur in only few other documents, and give them very strong weights toward the (incorrect) French class. en, for other German documents in which these features occur, which may now be mistakenly classified as French, the learner will find other German letter bigrams and will raise their weights in order for the documents to be classified as German again. is is a bad solution to the learning problem, as it learns something incorrect, and can cause test German documents which share many words with xo to be mistakenly classified as French. Intuitively, we would like to control for such cases by driving the learner away from such misguided solutions and toward more natural ones, in which it is OK to mis-classify a few examples if they don’t fit well with the rest. is is achieved by adding a regularization term R to the optimization objective, whose job is to control the complexity of the parameter value, and avoid cases of overfitting: ‚O D argmin L.‚/ C R.‚/ ‚ 1 Xn D argmin ‚ n i D1 L.f .xi I ‚/; yi/ C R.‚/: (2.25) e regularization term considers the parameter values, and scores their complexity. We then look for parameter values that have both a low loss and low complexity. A hyperparameter¹² is used to control the amount of regularization: do we favor simple model over low loss ones, or vice versa. e value of has to be set manually, based on the classification performance on a development set. While Equation (2.25) has a single regularization function and value for all the parameters, it is of course possible to have a different regularizer for each item in ‚. In practice, the regularizers R equate complexity with large weights, and work to keep the parameter values low. In particular, the regularizers R measure the norms of the parameter matrices, and drive the learner toward solutions with low norms. Common choices for R are the L2 norm, the L1 norm, and the elastic-net. L2 regularization In L2 regularization, R takes the form of the squared L2 norm of the parameters, trying to keep the sum of the squares of the parameter values low: X RL2 .W / D jjW 2 jj2 D .W Œi;j /2: i;j (2.26) ¹²A hyperparameter is a parameter of the model which is not learned as part of the optimization process, but needs to be set by hand. 30 2. LEARNING BASICS AND LINEAR MODELS e L2 regularizer is also called a gaussian prior or weight decay. Note that L2 regularized models are severely punished for high parameter weights, but once the value is close enough to zero, their effect becomes negligible. e model will prefer to decrease the value of one parameter with high weight by 1 than to decrease the value of ten parameters that already have relatively low weights by 0.1 each. L1 regularization In L1 regularization, R takes the form of the L1 norm of the parameters, trying to keep the sum of the absolute values of the parameters low: X RL1 .W / D jjW jj1 D jW Œi;j j: (2.27) i;j In contrast to L2, the L1 regularizer is punished uniformly for low and high values, and has an incentive to decrease all the non-zero parameter values toward zero. It thus encourages a sparse solutions—models with many parameters with a zero value. e L1 regularizer is also called a sparse prior or lasso [Tibshirani, 1994]. Elastic-Net e elastic-net regularization [Zou and Hastie, 2005] combines both L1 and L2 regularization: Relastic-net.W / D 1RL1 .W / C 2RL2 .W /: (2.28) Dropout Another form of regularization which is very effective in neural networks is Dropout, which we discuss in Section 4.6. 2.8 GRADIENT-BASED OPTIMIZATION In order to train the model, we need to solve the optimization problem in Equation (2.25). A common solution is to use a gradient-based method. Roughly speaking, gradient-based methods work by repeatedly computing an estimate of the loss L over the training set, computing the gradients of the parameters ‚ with respect to the loss estimate, and moving the parameters in the opposite directions of the gradient. e different optimization methods differ in how the error estimate is computed, and how “moving in the opposite direction of the gradient” is defined. We describe the basic algorithm, stochastic gradient descent (SGD), and then briefly mention the other approaches with pointers for further reading. Motivating Gradient-based Optimization Consider the task of finding the scalar value x that minimizes a function y D f .x/. e canonical approach is computing the second derivative f 00.x/ of the function, and solving for f 00.x/ D 0 to get the extrema points. For the sake of example, assume this approach cannot be used (indeed, it is challenging to use this approach in function of multiple variables). An alternative approach is a numeric one: compute . the first derivative f 0.x/. en, start with an initial guess value xi . Evaluating u D f 0.xi / 2.8. GRADIENT-BASED OPTIMIZATION 31 will give the direction of change. If u D 0, then xi is an optimum point. Otherwise, move in the opposite direction of u by setting xiC1 xi Áu, where Á is a rate parameter. With a small enough value of Á, f .xiC1/ will be smaller than f .xi /. Repeating this process (with properly decreasing values of Á) will find an optimum point xi . If the function f ./ is convex, the optimum will be a global one. Otherwise, the process is only guaranteed to find a local optimum. Gradient-based optimization simply generalizes this idea for functions with multiple variables. A gradient of a function with k variables is the collections of k partial derivatives, one according to each of the variables. Moving the inputs in the direction of the gradient will increase the value of the function, while moving them in the opposite direction will decrease it. When optimizing the loss L.‚I x1Wn; y1Wn/, the parameters ‚ are considered as inputs to . the function, while the training examples are treated as constants. Convexity In gradient-based optimization, it is common to distinguish between convex (or concave) functions and non-convex (non-concave) functions. A convex function is a function whose second-derivative is always non-negative. As a consequence, convex functions have a single minimum point. Similarly, concave functions are functions whose second-derivatives are always negative or zero, and as a consequence have a single maximum point. Convex (concave) functions have the property that they are easy to minimize (maximize) using gradientbased optimization—simply follow the gradient until an extremum point is reached, and once it is reached we know we obtained the global extremum point. In contrast, for functions that are neither convex nor concave, a gradient-based optimization procedure may converge to a . local extremum point, missing the global optimum. 2.8.1 STOCHASTIC GRADIENT DESCENT An effective method for training linear models is using the SGD algorithm [Bottou, 2012, LeCun et al., 1998a] or a variant of it. SGD is a general optimization algorithm. It receives a function f parameterized by ‚, a loss function L, and desired input and output pairs x1Wn; y1Wn. It then attempts to set the parameters ‚ such that the cumulative loss of f on the training examples is small. e algorithm works, as shown in Algorithm 2.1. L.‚/ DP e gniDoa1lLo.fft.hxei algorithm is I  /; yi / over to set the parameters ‚ the training set. It works so by as to minimize the repeatedly sampling total loss a training example and computing the gradient of the error on the example with respect to the parameters ‚ (line 4)—the input and expected output are assumed to be fixed, and the loss is treated as a function of the parameters ‚. e parameters ‚ are then updated in the opposite direction of the gradient, scaled by a learning rate Át (line 5). e learning rate can either be fixed throughout the 32 2. LEARNING BASICS AND LINEAR MODELS Algorithm 2.1 Online stochastic gradient descent training. Input: - Function f .xI ‚/ parameterized with parameters ‚. - Training set of inputs x1; : : : ; xn and desired outputs y1; : : : ; yn. - Loss function L. 1: while stopping criteria not met do 2: Sample a training example xi ; yi 3: Compute the loss L.f .xi I ‚/; yi / 4: gO gradients of L.f .xi I ‚/; yi / w.r.t ‚ 5: ‚ ‚ Át gO 6: return ‚ training process, or decay as a function of the time step t.¹³ For further discussion on setting the learning rate, see Section 5.2. Note that the error calculated in line 3 is based on a single training example, and is thus just a rough estimate of the corpus-wide loss L that we are aiming to minimize. e noise in the loss computation may result in inaccurate gradients. A common way of reducing this noise is to estimate the error and the gradients based on a sample of m examples. is gives rise to the minibatch SGD algorithm (Algorithm 2.2). In lines 3–6, the algorithm estimates the gradient of the corpus loss based on the minibatch. After the loop, gO contains the gradient estimate, and the parameters ‚ are updated toward gO . e minibatch size can vary in size from m D 1 to m D n. Higher values provide better estimates of the corpus-wide gradients, while smaller values allow more updates and in turn faster convergence. Besides the improved accuracy of the gradients estimation, the minibatch algorithm provides opportunities for improved training efficiency. For modest sizes of m, some computing architectures (i.e., GPUs) allow an efficient parallel implementation of the computation in lines 3–6. With a properly decreasing learning rate, SGD is guaranteed to converge to a global optimum if the function is convex, which is the case for linear and log-linear models coupled with the loss functions and regularizers discussed in this chapter. However, it can also be used to optimize non-convex functions such as multi-layer neural network. While there are no longer guarantees of finding a global optimum, the algorithm proved to be robust and performs well in practice.¹⁴ ¹³Learning rate decay is required in order to prove convergence of SGD. ¹⁴Recent work from the neural networks literature argue that the non-convexity of the networks is manifested in a proliferation of saddle points rather than local minima [Dauphin et al., 2014]. is may explain some of the success in training neural networks despite using local search techniques. 2.8. GRADIENT-BASED OPTIMIZATION 33 Algorithm 2.2 Minibatch stochastic gradient descent training. Input: - Function f .xI ‚/ parameterized with parameters ‚. - Training set of inputs x1; : : : ; xn and desired outputs y1; : : : ; yn. - Loss function L. 1: while stopping criteria not met do 2: Sample a minibatch of m examples f.x1; y1/; : : : ; .xm; ym/g 3: gO 0 4: for i D 1 to m do 5: Compute the loss L.f .xi I ‚/; yi / 6: g O gO C gradients of 1 m L.f .xi I ‚/; yi / w.r.t ‚ 7: ‚ ‚ Át gO 8: return ‚ 2.8.2 WORKED-OUT EXAMPLE As an example, consider a multi-class linear classifier with hinge loss: yO D argmax yO Œi i yO D f .x/ D xW C b L.yO ; y/ D max.0; 1 .yO Œt yO Œk// D max.0; 1 ..xW C b/Œt .xW C b/Œk// t D argmax yŒi i k D argmax yO Œi i ¤ t: i We want to set the parameters W and b such that the loss is minimized. We need to compute the gradients of the loss with respect to the values W and b. e gradient is the collection of the 34 2. LEARNING BASICS AND LINEAR MODELS partial derivatives according to each of the variables: 0 @L.yO ;y/ @L.yO ;y/ 1 @L.yO ;y/ @L.yO ; y/ @W D BBBBBBBBBB@ W @ Œ1;1 @L.yO ;y/ W @ Œ2;1 ::: @L.yO ;y/ W @ Œ1;2 @L.yO ;y/ W @ Œ2;2 ::: @L.yO ;y/ ::: W @ Œ1;n @L.yO ;y/ W @ Œ2;n ::: @L.yO ;y/ CCCCCCCCCCA W W @ Œm;1 @ Œm;2 W @ Œm;n  @L.yO ; y/ @L.yO ;y/ @b D @bŒ1 @L.yO ;y/ @bŒ2 à @L.yO ;y/ @bŒn : More concretely, we will compute the derivate of the loss w.r.t each of the values W Œi;j  and bŒj . We begin by expanding the terms in the loss calculation:¹⁵ L.yO ; y/ D max.0; 1 .yO Œt yO Œk// D max.0; 1 ..xW C b/Œt .xW C b/Œk// X ! X !!! D max 0; 1 xŒi W Œi;t C bŒt xŒi W Œi;k C bŒk D max 0; 1 i X xŒi W Œi;t i ! X bŒt C xŒi W Œi;k C bŒk i i t D argmax yŒi i k D argmax yO Œi i ¤ t: i e first observation is that if 1 .yO Œt yO Œk/ Ä 0 then the loss is 0 and so is the gradient (the derivative of the max operation is the derivative of the maximal value). Otherwise, consider the derivative of @L @bŒi  . For the partial derivative, bŒi  is treated as a variable, and all others are consid- ered as constants. For i ¤ k; t, the term bŒi does not contribute to the loss, and its derivative it is 0. For i D k and i D t we trivially get: 8 @L ˆ< 1 @bŒi D ˆ:1 0 i Dt i Dk otherwise: ¹⁵More advanced derivation techniques allow working with matrices and vectors directly. Here, we stick to high-school level techniques. 2.8. GRADIENT-BASED OPTIMIZATION 35 Similarly, for W Œi;j , only j D k and j D t contribute to the loss. We get: @L @W Œi;j  D 8 ˆˆˆˆˆ< @. x W Œi Œi;t/ W @ Œi;t x W @. Œi Œi;k/ ˆˆˆˆˆ: W @ Œi;k 0 D xŒi j D t D xŒi j D k otherwise: is concludes the gradient calculation. As a simple exercise, the reader should try and compute the gradients of a multi-class linear model with hinge loss and L2 regularization, and the gradients of multi-class classification with softmax output transformation and cross-entropy loss. 2.8.3 BEYOND SGD While the SGD algorithm can and often does produce good results, more advanced algorithms are also available. e SGD+Momentum [Polyak, 1964] and Nesterov Momentum [Nesterov, 1983, 2004, Sutskever et al., 2013] algorithms are variants of SGD in which previous gradients are accumulated and affect the current update. Adaptive learning rate algorithms including AdaGrad [Duchi et al., 2011], AdaDelta [Zeiler, 2012], RMSProp [Tieleman and Hinton, 2012], and Adam [Kingma and Ba, 2014] are designed to select the learning rate for each minibatch, sometimes on a per-coordinate basis, potentially alleviating the need of fiddling with learning rate scheduling. For details of these algorithms, see the original papers or [Bengio et al., 2016, Sections 8.3, 8.4]. 37 CHAPTER 3 From Linear Models to Multi-layer Perceptrons 3.1 LIMITATIONS OF LINEAR MODELS: THE XOR PROBLEM e hypothesis class of linear (and log-linear) models is severely restricted. For example, it cannot represent the XOR function, defined as: xor.0; 0/ D 0 xor.1; 0/ D 1 xor.0; 1/ D 1 xor.1; 1/ D 0: at is, there is no parameterization w 2 R2; b 2 R such that: .0; 0/ w C b < 0 .0; 1/ w C b 0 .1; 0/ w C b 0 .1; 1/ w C b < 0: To see why, consider the following plot of the XOR function, where blue Os denote the positive class and green Xs the negative class. 1 0 0 1 38 3. FROM LINEAR MODELS TO MULTI-LAYER PERCEPTRONS It is clear that no straight line can separate the two classes. 3.2 NONLINEAR INPUT TRANSFORMATIONS However, if we transform the points by feeding each of them through the nonlinear function .x1; x2/ D Œx1 x2; x1 C x2, the XOR problem becomes linearly separable. 2 1 0 0 1 2 e function mapped the data into a representation that is suitable for linear classification. Having at our disposal, we can now easily train a linear classifier to solve the XOR problem. yO D f .x/ D .x/W C b: In general, one can successfully train a linear classifier over a dataset which is not linearly separable by defining a function that will map the data to a representation in which it is linearly separable, and then train a linear classifier on the resulting representation. In the XOR example the transformed data has the same dimensions as the original one, but often in order to make the data linearly separable one needs to map it to a space with a much higher dimension. is solution has one glaring problem, however: we need to manually define the function , a process which is dependent on the particular dataset, and requires a lot of human intuition. 3.3 KERNEL METHODS Kernelized Support Vectors Machines (SVMs) [Boser and et al., 1992], and Kernel Methods in general [Shawe-Taylor and Cristianini, 2004], approach this problem by defining a set of generic mappings, each of them mapping the data into very high dimensional—and sometimes even infinite—spaces, and then performing linear classification in the transformed space. Working in very high dimensional spaces significantly increase the probability of finding a suitable linear separator. One example mapping is the polynomial mapping, .x/ D .x/d . For d D 2, we get .x1; x2/ D .x1x1; x1x2; x2x1; x2x2/. is gives us all combinations of the two variables, allowing to solve the XOR problem using a linear classifier, with a polynomial increase in the number of parameters. In the XOR problem the mapping increased the dimensionality of the input (and 3.4. TRAINABLE MAPPING FUNCTIONS 39 hence the number of parameters) from 2–4. For the language identification example, the input dimensionality would have increased from 784 to 7842 D614,656 dimensions. Working in very high dimensional spaces can become computationally prohibitive, and the ingenuity in kernel methods is the use of the kernel trick [Aizerman et al., 1964, Schölkopf, 2001] that allows one to work in the transformed space without ever computing the transformed representation. e generic mappings are designed to work on many common cases, and the user needs to select the suitable one for its task, often by trial and error. A downside of the approach is that the application of the kernel trick makes the classification procedure for SVMs dependent linearly on the size of the training set, making it prohibitive for use in setups with reasonably large training sets. Another downside of high dimensional spaces is that they increase the risk of overfitting. 3.4 TRAINABLE MAPPING FUNCTIONS A different approach is to define a trainable nonlinear mapping function, and train it in conjunction with the linear classifier. at is, finding the suitable representation becomes the responsibility of the training algorithm. For example, the mapping function can take the form of a parameterized linear model, followed by a nonlinear activation function g that is applied to each of the output dimensions: yO D .x/W C b .x/ D g.xW 0 C b0/: (3.1) By taking g.x/ D max.0; x/ and W 0 D 11 11 , b0 D . 1 0 / we get an equivalent mapping to .x1 x2; x1 C x2/ for the our points of interest (0,0), (0,1), (1,0), and (1,1), successfully solv- ing the XOR problem. e entire expression g.xW 0 C b0/W C b is differentiable (although not convex), making it possible to apply gradient-based techniques to the model training, learning both the representation function and the linear classifier on top of it at the same time. is is the main idea behind deep learning and neural networks. In fact, Equation (3.1) describes a very common neural network architecture called a multi-layer perceptron (MLP). Having established the motivation, we now turn to describe multi-layer neural networks in more detail. 41 CHAPTER 4 Feed-forward Neural Networks 4.1 A BRAIN-INSPIRED METAPHOR As the name suggests, neural networks were inspired by the brain’s computation mechanism, which consists of computation units called neurons. While the connections between artificial neural networks and the brain are in fact rather slim, we repeat the metaphor here for completeness. In the metaphor, a neuron is a computational unit that has scalar inputs and outputs. Each input has an associated weight. e neuron multiplies each input by its weight, and then sums¹ them, applies a nonlinear function to the result, and passes it to its output. Figure 4.1 shows such a neuron. Output y1 Neuron ∫ Input x1 x2 x3 x4 Figure 4.1: A single neuron with four inputs. e neurons are connected to each other, forming a network: the output of a neuron may feed into the inputs of one or more neurons. Such networks were shown to be very capable computational devices. If the weights are set correctly, a neural network with enough neurons and a nonlinear activation function can approximate a very wide range of mathematical functions (we will be more precise about this later). A typical feed-forward neural network may be drawn as in Figure 4.2. Each circle is a neuron, with incoming arrows being the neuron’s inputs and outgoing arrows being the neuron’s outputs. Each arrow carries a weight, reflecting its importance (not shown). Neurons are arranged in layers, reflecting the flow of information. e bottom layer has no incoming arrows, and is ¹While summing is the most common operation, other functions, such as a max, are also possible. 42 4. FEED-FORWARD NEURAL NETWORKS the input to the network. e top-most layer has no outgoing arrows, and is the output of the network. e other layers are considered “hidden.” e sigmoid shape inside the neurons in the middle layers represent a nonlinear function (i.e., the logistic function 1=.1 C e x/) that is applied to the neuron’s value before passing it to the output. In the figure, each neuron is connected to all of the neurons in the next layer—this is called a fully connected layer or an affine layer. Output layer y1 y2 y3 Hidden layer ∫∫∫∫∫ Hidden layer ∫∫∫∫∫∫ Input layer x1 x2 x3 x4 Figure 4.2: Feed-forward neural network with two hidden layers. While the brain metaphor is sexy and intriguing, it is also distracting and cumbersome to manipulate mathematically. We therefore switch back to using more concise mathematical notation. As will soon become apparent, a feed-forward network as the one in Figure 4.2 is simply a stack of linear models separated by nonlinear functions. e values of each row of neurons in the network can be thought of as a vector. In Figure 4.2 the input layer is a 4-dimensional vector (x), and the layer above it is a 6-dimensional vector (h1). e fully connected layer can be thought of as a linear transformation from 4 dimensions to 6 dimensions. A fully connected layer implements a vector-matrix multiplication, h D xW where the weight of the connection from the ith neuron in the input row to the j th neuron in the output row is W Œi;j .² e values of h are then transformed by a nonlinear function g that is applied to each value before being passed on as input to the next layer. e whole computation from input to output can be written as: .g.xW 1//W 2 where W 1 are the weights of the first layer and W 2 are the weights of the second one. Taking this view, the single neuron in Figure 4.1 is equivalent to a logistic (log-linear) binary classifier .xw/ without a bias term . ²ThoŒjseeDwPhy4iDth1isxisŒitheWcaŒsie;,jd.enote the weight of the i th input of the j th neuron in h as W Œi;j . e value of hŒj  is then 4.2. IN MATHEMATICAL NOTATION 43 4.2 IN MATHEMATICAL NOTATION From this point on, we will abandon the brain metaphor and describe networks exclusively in terms of vector-matrix operations. e simplest neural network is called a perceptron. It is simply a linear model: NNPerceptron.x/ D xW C b (4.1) x 2 Rdin ; W 2 Rdin dout ; b 2 Rdout ; where W is the weight matrix and b is a bias term.³ In order to go beyond linear functions, we introduce a nonlinear hidden layer (the network in Figure 4.2 has two such layers), resulting in the Multi Layer Perceptron with one hidden-layer (MLP1). A feed-forward neural network with one hidden-layer has the form: NNMLP1.x/ D g.xW 1 C b1/W 2 C b2 (4.2) x 2 Rdin ; W 1 2 Rdin d1 ; b1 2 Rd1 ; W 2 2 Rd1 d2 ; b2 2 Rd2 : Here W 1 and b1 are a matrix and a bias term for the first linear transformation of the input, g is a nonlinear function that is applied element-wise (also called a nonlinearity or an activation function), and W 2 and b2 are the matrix and bias term for a second linear transform. Breaking it down, xW 1 C b1 is a linear transformation of the input x from din dimensions to d1 dimensions. g is then applied to each of the d1 dimensions, and the matrix W 2 together with bias vector b2 are then used to transform the result into the d2 dimensional output vector. e nonlinear activation function g has a crucial role in the network’s ability to represent complex functions. Without the nonlinearity in g, the neural network can only represent linear transfor- mations of the input.⁴ Taking the view in Chapter 3, the first layer transforms the data into a good representation, while the second layer applies a linear classifier to that representation. We can add additional linear-transformations and nonlinearities, resulting in an MLP with two hidden-layers (the network in Figure 4.2 is of this form): NNMLP2.x/ D .g2.g1.xW 1 C b1/W 2 C b2//W 3: (4.3) It is perhaps clearer to write deeper networks like this using intermediary variables: NNMLP2.x/ Dy h1 Dg1.xW 1 C b1/ h2 Dg2.h1W 2 C b2/ y Dh2W 3: (4.4) ³e network in Figure 4.2 does not include bias terms. A bias term can be added to a layer by adding to it an additional neuron that does not have any incoming connections, whose value is always 1. ⁴To see why, consider that a sequence of linear transformations is still a linear transformation. 44 4. FEED-FORWARD NEURAL NETWORKS e vector resulting from each linear transform is referred to as a layer. e outer-most linear transform results in the output layer and the other linear transforms result in hidden layers. Each hidden layer is followed by a nonlinear activation. In some cases, such as in the last layer of our example, the bias vectors are forced to 0 (“dropped”). Layers resulting from linear transformations are often referred to as fully connected, or affine. Other types of architectures exist. In particular, image recognition problems benefit from convolutional and pooling layers. Such layers have uses also in language processing, and will be discussed in Chapter 13. Networks with several hidden layers are said to be deep networks, hence the name deep learning. When describing a neural network, one should specify the dimensions of the layers and the input. A layer will expect a din dimensional vector as its input, and transform it into a dout dimensional vector. e dimensionality of the layer is taken to be the dimensionality of its output. For a fully connected layer l.x/ D xW C b with input dimensionality din and output dimensionality dout, the dimensions of x is 1 din, of W is din dout and of b is 1 dout. Like the case with linear models, the output of a neural network is a dout dimensional vector. In case dout D 1, the network’s output is a scalar. Such networks can be used for regression (or scoring) by considering the value of the output, or for binary classification by consulting the sign of the output. Networks with dout D k > 1 can be used for k-class classification, by associating each dimension with a class, and looking for the dimension with maximal value. Similarly, if the output vector entries are positive and sum to one, the output can be interpreted as a distribution over class assignments (such output normalization is typically achieved by applying a softmax transformation on the output layer, see Section 2.6). e matrices and the bias terms that define the linear transformations are the parameters of the network. Like in linear models, it is common to refer to the collection of all parameters as ‚. Together with the input, the parameters determine the network’s output. e training algorithm is responsible for setting their values such that the network’s predictions are correct. Unlike linear models, the loss function of multi-layer neural networks with respect to their parameters is not convex,⁵ making search for the optimal parameter values intractable. Still, the gradient-based optimization methods discussed in Section 2.8 can be applied, and perform very well in practice. Training neural networks is discussed in detail in Chapter 5. 4.3 REPRESENTATION POWER In terms of representation power, it was shown by Hornik et al. [1989] and Cybenko [1989] that MLP1 is a universal approximator—it can approximate with any desired non-zero amount of error a family of functions that includes all continuous functions on a closed and bounded subset of Rn, and any function mapping from any finite dimensional discrete space to another.⁶ is ⁵Strictly convex functions have a single optimal solution, making them easy to optimize using gradient-based methods. ⁶Specifically, a feed-forward network with linear output layer and at least one hidden layer with a “squashing” activation function can approximate any Borel measurable function from one finite dimensional space to another. e proof was later extended by Leshno et al. [1993] to a wider range of activation functions, including the ReLU function g.x/ D max.0; x/. 4.4. COMMON NONLINEARITIES 45 may suggest there is no reason to go beyond MLP1 to more complex architectures. However, the theoretical result does not discuss the learnability of the neural network (it states that a representation exists, but does not say how easy or hard it is to set the parameters based on training data and a specific learning algorithm). It also does not guarantee that a training algorithm will find the correct function generating our training data. Finally, it does not state how large the hidden layer should be. Indeed, Telgarsky [2016] show that there exist neural networks with many layers of bounded size that cannot be approximated by networks with fewer layers unless these layers are exponentially large. In practice, we train neural networks on relatively small amounts of data using local search methods such as variants of stochastic gradient descent, and use hidden layers of relatively modest sizes (up to several thousands). As the universal approximation theorem does not give any guarantees under these non-ideal, real-world conditions, there is definitely benefit to be had in trying out more complex architectures than MLP1. In many cases, however, MLP1 does indeed provide strong results. For further discussion on the representation power of feed-forward neural networks, see Bengio et al. [2016, Section 6.5]. 4.4 COMMON NONLINEARITIES e nonlinearity g can take many forms. ere is currently no good theory as to which nonlinearity to apply in which conditions, and choosing the correct nonlinearity for a given task is for the most part an empirical question. I will now go over the common nonlinearities from the literature: the sigmoid, tanh, hard tanh and the rectified linear unit (ReLU). Some NLP researchers also experimented with other forms of nonlinearities such as cube and tanh-cube. Sigmoid e sigmoid activation function .x/ D 1=.1 C e x/, also called the logistic function, is an S-shaped function, transforming each value x into the range Œ0; 1. e sigmoid was the canonical nonlinearity for neural networks since their inception, but is currently considered to be deprecated for use in internal layers of neural networks, as the choices listed below prove to work much better empirically. Hyperbolic tangent (tanh) e hyperbolic tangent tanh.x/ D e2x 1 e2x C1 activation function is an S-shaped function, transforming the values x into the range Œ 1; 1. Hard tanh e hard-tanh activation function is an approximation of the tanh function which is faster to compute and to find derivatives thereof: 8 ˆ< 1 x < 1 hardtanh.x/ D ˆ:x1 x>1 otherwise: (4.5) Rectifier (ReLU) e rectifier activation function [Glorot et al., 2011], also known as the rectified linear unit is a very simple activation function that is easy to work with and was shown many 46 4. FEED-FORWARD NEURAL NETWORKS times to produce excellent results.⁷ e ReLU unit clips each value x < 0 at 0. Despite its sim- plicity, it performs well for many tasks, especially when combined with the dropout regularization technique (see Section 4.6): ( 0 x<0 ReLU.x/ D max.0; x/ D (4.6) x otherwise: As a rule of thumb, both ReLU and tanh units work well, and significantly outperform the sigmoid. You may want to experiment with both tanh and ReLU activations, as each one may perform better in different settings. Figure 4.3 shows the shapes of the different activations functions, together with the shapes of their derivatives. sigmoid(x) 1.0 tanh(x) 1.0 hardtanh(x) 1.0 ReLU(x) 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 -0.5 -0.5 -0.5 -0.5 -1.0 -1.0 -1.0 -1.0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 ߲f ߲x 1.0 ߲f ߲x 1.0 ߲f ߲x 1.0 ߲f ߲x 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 -0.5 -0.5 -0.5 -0.5 -1.0 -1.0 -1.0 -1.0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 Figure 4.3: Activation functions (top) and their derivatives (bottom). 4.5 LOSS FUNCTIONS When training a neural network (more on training in Chapter 5), much like when training a linear classifier, one defines a loss function L.yO ; y/, stating the loss of predicting yO when the true output is y. e training objective is then to minimize the loss across the different training examples. e loss L.yO ; y/ assigns a numerical score (a scalar) to the network’s output yO given the true expected output y. e loss functions discussed for linear models in Section 2.7.1 are relevant and widely used also for neural networks. For further discussion on loss functions in the ⁷e technical advantages of the ReLU over the sigmoid and tanh activation functions is that it does not involve expensiveto-compute functions, and more importantly that it does not saturate. e sigmoid and tanh activation are capped at 1, and the gradients at this region of the functions are near zero, driving the entire gradient near zero. e ReLU activation does not have this problem, making it especially suitable for networks with multiple layers, which are susceptible to the vanishing gradients problem when trained with the saturating units. 4.6. REGULARIZATION AND DROPOUT 47 context of neural networks, see LeCun and Huang [2005], LeCun et al. [2006] and Bengio et al. [2016]. 4.6 REGULARIZATION AND DROPOUT Multi-layer networks can be large and have many parameters, making them especially prone to overfitting. Model regularization is just as important in deep neural networks as it is in linear models, and perhaps even more so. e regularizers discussed in Section 2.7.2, namely L2, L1 and the elastic-net, are also relevant for neural networks. In particular, L2 regularization, also called weight decay is effective for achieving good generalization performance in many cases, and tuning the regularization strength is advisable. Another effective technique for preventing neural networks from overfitting the training data is dropout training [Hinton et al., 2012, Srivastava et al., 2014]. e dropout method is designed to prevent the network from learning to rely on specific weights. It works by randomly dropping (setting to 0) half of the neurons in the network (or in a specific layer) in each training example in the stochastic-gradient training. For example, consider the multi-layer perceptron with two hidden layers (MLP2): NNMLP2.x/ Dy h1 Dg1.xW 1 C b1/ h2 Dg2.h1W 2 C b2/ y Dh2W 3: When applying dropout training to MLP2, we randomly set some of the values of h1 and h2 to 0 at each training round: NNMLP2.x/ Dy h1 Dg1.xW 1 C b1/ m1 Bernouli.r1/ hQ1 Dm1 ˇ h1 h2 Dg2.hQ1W 2 C b2/ m2 Bernouli.r2/ (4.7) hQ2 Dm2 ˇ h2 y DhQ2W 3: Here, m1 and m2 are random masking vectors with the dimensions of h1 and h2, respectively, and ˇ is the element-wise multiplication operation. e values of the elements in the masking 48 4. FEED-FORWARD NEURAL NETWORKS vectors are either 0 or 1, and are drawn from a Bernouli distribution with parameter r (usually r D 0:5). e values corresponding to zeros in the masking vectors are then zeroed out, replacing the hidden layers h with hQ before passing them on to the next layer. Work by Wager et al. [2013] establishes a strong connection between the dropout method and L2 regularization. Another view links dropout to model averaging and ensemble techniques [Srivastava et al., 2014]. e dropout technique is one of the key factors contributing to very strong results of neuralnetwork methods on image classification tasks [Krizhevsky et al., 2012], especially when combined with ReLU activation units [Dahl et al., 2013]. e dropout technique is effective also in NLP applications of neural networks. 4.7 SIMILARITY AND DISTANCE LAYERS We sometimes wish to calculate a scalar value based on two vectors, such that the value reflects the similarity, compatibility or distance between the two vectors. For example, vectors v1 2 Rd and v2 2 Rd may be the output layers of two MLPs, and we would like to train the network to produce similar vectors for some training examples, and dissimilar vectors for others. In what follows we describe common functions that take two vectors u 2 Rd and v 2 Rd , and return a scalar. ese functions can (and often are) integrated in feed-forward neural networks. Dot Product A very common options is to use the dot-product: Xd simdot.u; v/ Du v D uŒivŒi i D1 (4.8) Euclidean Distance Another popular options is the Euclidean Distance: disteuclidean.u; v/ D v u u tXd .uŒi p vŒi/2 D .u v/ .u v/ D jju vjj2 i D1 (4.9) Note that this is a distance metric and not a similarity: here, small (near zero) values indicate similar vectors and large values dissimilar ones. e square-root is often omitted. Trainable Forms e dot-product and the euclidean distance above are fixed functions. We sometimes want to use a parameterized function, that can be trained to produce desired similarity (or dissimilarity) values by focusing on specific dimensions of the vectors. A common trainable similarity function is the bilinear form: 4.8. EMBEDDING LAYERS 49 simbilinear.u; v/ D uM v M 2 Rd d where the matrix M is a parameter that needs to be trained. Similarly, for a trainable distance function we can use: (4.10) dist.u; v/ D .u v/M .u v/ (4.11) Finally, a multi-layer perceptron with a single output neuron can also be used for producing a scalar from two vectors, by feeding it the concatenation of the two vectors. 4.8 EMBEDDING LAYERS As will be further discussed in Chapter 8, when the input to the neural network contains symbolic categorical features (e.g., features that take one of k distinct symbols, such as words from a closed vocabulary), it is common to associate each possible feature value (i.e., each word in the vocabulary) with a d -dimensional vector for some d . ese vectors are then considered parameters of the model, and are trained jointly with the other parameters. e mapping from a symbolic feature values such as “word number 1249” to d -dimensional vectors is performed by an embedding layer (also called a lookup layer). e parameters in an embedding layer are simply a matrix E 2 Rjvocabj d where each row corresponds to a different word in the vocabulary. e lookup operation is then simply indexing: v1249 D E Œ1249;W. If the symbolic feature is encoded as a one-hot vector x, the lookup operation can be implemented as the multiplication xE . e word vectors are often concatenated to each other before being passed on to the next layer. Embeddings are discussed in more depth in Chapter 8 when discussing dense representations of categorical features, and in Chapter 10 when discussing pre-trained word representations. 51 CHAPTER 5 Neural Network Training Similar to linear models, neural network are differentiable parameterized functions, and are trained using gradient-based optimization (see Section 2.8). e objective function for nonlinear neural networks is not convex, and gradient-based methods may get stuck in a local minima. Still, gradient-based methods produce good results in practice. Gradient calculation is central to the approach. e mathematics of gradient computation for neural networks are the same as those of linear models, simply following the chain-rule of differentiation. However, for complex networks this process can be laborious and error-prone. Fortunately, gradients can be efficiently and automatically computed using the backpropagation algorithm [LeCun et al., 1998b, Rumelhart et al., 1986]. e backpropagation algorithm is a fancy name for methodically computing the derivatives of a complex expression using the chainrule, while caching intermediary results. More generally, the backpropagation algorithm is a special case of the reverse-mode automatic differentiation algorithm [Neidinger, 2010, Section 7], [Baydin et al., 2015, Bengio, 2012]. e following section describes reverse mode automatic differentiation in the context of the computation graph abstraction. e rest of the chapter is devoted to practical tips for training neural networks in practice. 5.1 THE COMPUTATION GRAPH ABSTRACTION While one can compute the gradients of the various parameters of a network by hand and implement them in code, this procedure is cumbersome and error prone. For most purposes, it is preferable to use automatic tools for gradient computation [Bengio, 2012]. e computationgraph abstraction allows us to easily construct arbitrary networks, evaluate their predictions for given inputs (forward pass), and compute gradients for their parameters with respect to arbitrary scalar losses (backward pass). A computation graph is a representation of an arbitrary mathematical computation as a graph. It is a directed acyclic graph (DAG) in which nodes correspond to mathematical operations or (bound) variables and edges correspond to the flow of intermediary values between the nodes. e graph structure defines the order of the computation in terms of the dependencies between the different components. e graph is a DAG and not a tree, as the result of one operation can be the input of several continuations. Consider for example a graph for the computation of .a b C 1/ .a b C 2/: 52 5. NEURAL NETWORK TRAINING * + + * 1 a b 2 e computation of a b is shared. We restrict ourselves to the case where the computation graph is connected (in a disconnected graph, each connected component is an independent function that can be evaluated and differentiated independently of the other connected components). (a) 1 × 17 softmax (b) 1 × 17 softmax 1 × 1 neg 1 × 1 log (c) 1 × 1 pick 1 × 17 softmax 5 1 × 17 ADD 1 × 17 ADD 1 × 17 ADD 1 × 17 MUL 1 × 20 tanh 20 × 17 W 2 1 × 17 b2 1 × 17 MUL 1 × 20 tanh 20 × 17 W 2 1 × 17 b2 1 × 17 MUL 1 × 20 tanh 20 × 17 W 2 1 × 17 b2 1 × 20 ADD 1 × 20 ADD 1 × 20 ADD 1 × 20 MUL 1 × 150 x 150 × 20 W 1 1 × 20 b1 1 × 20 MUL 1 × 150 concat 150 × 20 W 1 1 × 20 b1 1 × 50 lookup 1 × 50 lookup 1 × 50 lookup 1 × 20 MUL 1 × 150 concat 150 × 20 W 1 1 × 20 b1 1 × 50 lookup 1 × 50 lookup 1 × 50 lookup “the” “black” “dog” |V| × 50 E “the” |V| × 50 “black” “dog” E Figure 5.1: (a) Graph with unbound input. (b) Graph with concrete input. (c) Graph with concrete input, expected output, and a final loss node. Since a neural network is essentially a mathematical expression, it can be represented as a computation graph. For example, Figure 5.1a presents the computation graph for an MLP with one hidden-layer and a softmax output transformation. In our notation, oval nodes represent 5.1. THE COMPUTATION GRAPH ABSTRACTION 53 mathematical operations or functions, and shaded rectangle nodes represent parameters (bound variables). Network inputs are treated as constants, and drawn without a surrounding node. Input and parameter nodes have no incoming arcs, and output nodes have no outgoing arcs. e output of each node is a matrix, the dimensionality of which is indicated above the node. is graph is incomplete: without specifying the inputs, we cannot compute an output. Figure 5.1b shows a complete graph for an MLP that takes three words as inputs, and predicts the distribution over part-of-speech tags for the third word. is graph can be used for prediction, but not for training, as the output is a vector (not a scalar) and the graph does not take into account the correct answer or the loss term. Finally, the graph in Figure 5.1c shows the computation graph for a specific training example, in which the inputs are the (embeddings of ) the words “the,” “black,” “dog,” and the expected output is “NOUN” (whose index is 5). e pick node implements an indexing operation, receiving a vector and an index (in this case, 5) and returning the corresponding entry in the vector. Once the graph is built, it is straightforward to run either a forward computation (compute the result of the computation) or a backward computation (computing the gradients), as we show below. Constructing the graphs may look daunting, but is actually very easy using dedicated software libraries and APIs. 5.1.1 FORWARD COMPUTATION e forward pass computes the outputs of the nodes in the graph. Since each node’s output depends only on itself and on its incoming edges, it is trivial to compute the outputs of all nodes by traversing the nodes in a topological order and computing the output of each node given the already computed outputs of its predecessors. More formally, in a graph of N nodes, we associate each node with an index i according to their topological ordering. Let fi be the function computed by node i (e.g., multiplication. addition, etc.). Let .i/ be the parent nodes of node i, and 1.i/ D fj j i 2 .j /g the children nodes of node i (these are the arguments of fi ). Denote by v.i/ the output of node i, that is, the application of fi to the output values of its arguments 1.i/. For variable and input nodes, fi is a constant function and 1.i/ is empty. e computation-graph forward pass computes the values v.i/ for all i 2 Œ1; N . Algorithm 5.3 Computation graph forward pass. 1: for i = 1 to N do 2: Let a1; : : : ; am D 1.i / 3: v.i / fi .v.a1/; : : : ; v.am// 54 5. NEURAL NETWORK TRAINING 5.1.2 BACKWARD COMPUTATION (DERIVATIVES, BACKPROP) e backward pass begins by designating a node N with scalar (1 1) output as a loss-node, and running forward computation up to that node. e backward computation computes the @N gradients of the parameters with respect to that node’s value. Denote by d.i/ the quantity . @i e backpropagation algorithm is used to compute the values d.i/ for all nodes i. e backward pass fills a table of values d.1/; : : : ; d.N / as in Algorithm 5.4. Algorithm 5.4 Computation graph backward pass (backpropagation). 1: d.N / 1 2: for i = N-1 to 1 do 3: d .i / P j 2 .i/ d.j / @fj @i @N F @N D 1 @N X @N @j F @i D @j @i j 2 .i/ e backpropagation algorithm (Algorithm 5.4) is essentially following the chain-rule of differ- entiation. e quantity @fj @i is the partial derivative of fj . 1.j // w.r.t the argument i 2 1.j /. is value depends on the function fj and the values v.a1/; : : : ; v.am/ (where a1; : : : ; am D 1.j /) of its arguments, which were computed in the forward pass. us, in order to define a new kind of node, one needs to define two methods: one for calculating the forward value v.i/ based on the node’s inputs, and the another for calculating @fi @x for each x 2 1.i /. Derivatives of “non-mathematical” functions While defining @fi for mathematical func@x tions such is as log or C is straightforward, some find it challenging to think about the derivative of operations as as pick.x; 5/ that selects the fifth element of a vector. e answer is to think in terms of the contribution to the computation. After picking the ith element of a vector, only that element participates in the remainder of the computation. us, the gradient of pi ck.x; 5/ is a vector g with the dimensionality of x where gŒ5 D 1 and gŒi¤5 D 0. . Similarly, for the function max.0; x/ the value of the gradient is 1 for x > 0 and 0 otherwise. For further information on automatic differentiation, see Neidinger [2010, Section 7] and Baydin et al. [2015]. For more in depth discussion of the backpropagation algorithm and computation graphs (also called flow graphs), see Bengio et al. [2016, Section 6.5] and Bengio [2012], LeCun et al. [1998b]. For a popular yet technical presentation, see Chris Olah’s description at http://colah.github.io/posts/2015-08-Backprop/. 5.1. THE COMPUTATION GRAPH ABSTRACTION 55 5.1.3 SOFTWARE Several software packages implement the computation-graph model, including eano,¹ [Bergstra et al., 2010], TensorFlow² [Abadi et al., 2015], Chainer,³ and DyNet⁴ [Neubig et al., 2017]. All these packages support all the essential components (node types) for defining a wide range of neural network architectures, covering the structures described in this book and more. Graph creation is made almost transparent by use of operator overloading. e framework defines a type for representing graph nodes (commonly called expressions), methods for constructing nodes for inputs and parameters, and a set of functions and mathematical operations that take expressions as input and result in more complex expressions. For example, the python code for creating the computation graph from Figure 5.1c using the DyNet framework is: import dynet as dy # model i n i t i a l i z a t i o n . model = dy . Model ( ) mW1 = model . add_parameters ( ( 2 0 , 1 5 0 ) ) mb1 = model . add_parameters ( 2 0 ) mW2 = model . add_parameters ( ( 1 7 , 2 0 ) ) mb2 = model . add_parameters ( 1 7 ) lookup = model . add_lookup_parameters ((100 , 50) ) t r a i n e r = dy . SimpleSGDTrainer ( model ) def get_index(x) : pass # Logic omitted . Maps words to numeric IDs . # The f o l l o w i n g b u i l d s and e x e c u t e s the computation graph , # and updates model parameters . # Only one data point i s shown , in p r a c t i c e the f o l l o w i n g # should run in a data - feeding loop . # Building the computation graph : dy . renew_cg ( ) # c r e a t e a new graph . # Wrap the model parameters as graph - nodes . W1 = dy . parameter (mW1) b1 = dy . parameter (mb1) W2 = dy . parameter (mW2) b2 = dy . parameter (mb2) # Generate the embeddings layer . vthe = dy . lookup [ get_index ( ” the ” ) ] vblack = dy . lookup [ get_index ( ” black ” ) ] vdog = dy . lookup [ get_index ( ”dog” ) ] # Connect the l e a f nodes into a complete graph . x = dy . concatenate ( [ vthe , vblack , vdog ] ) output = dy . softmax (W2*( dy . tanh (W1*x+b1 ) )+b2 ) l o s s = -dy . log (dy . pick ( output , 5) ) ¹http://deeplearning.net/software/theano/ ²https://www.tensorflow.org/ ³http://chainer.org ⁴https://github.com/clab/dynet 56 5. NEURAL NETWORK TRAINING loss_value = loss . forward () l o s s . backward () # the gradient i s computed # and stored in the corresponding # parameters . trainer . update () # update the parameters according to the gradients . Most of the code involves various initializations: the first block defines model parameters that are be shared between different computation graphs (recall that each graph corresponds to a specific training example). e second block turns the model parameters into the graph-node (Expression) types. e third block retrieves the Expressions for the embeddings of the input words. Finally, the fourth block is where the graph is created. Note how transparent the graph creation is— there is an almost a one-to-one correspondence between creating the graph and describing it mathematically. e last block shows a forward and backward pass. e equivalent code in the TensorFlow package is:⁵ import tensorflow as tf W1 = t f . get_variable ( ”W1” , [ 2 0 , 1 5 0 ] ) b1 = t f . get_variable ( ”b1” , [ 2 0 ] ) W2 = t f . get_variable ( ”W2” , [ 1 7 , 2 0 ] ) b2 = t f . get_variable ( ”b2” , [ 1 7 ] ) lookup = t f . get_variable ( ”W” , [ 1 0 0 , 5 0 ] ) def get_index(x) : pass # Logic omitted p1 = t f . placeholder ( t f . int32 , [ ] ) p2 = t f . placeholder ( t f . int32 , [ ] ) p3 = t f . placeholder ( t f . int32 , [ ] ) target = tf . placeholder ( tf . int32 , [ ] ) v_w1 = t f . nn . embedding_lookup ( lookup , p1 ) v_w2 = t f . nn . embedding_lookup ( lookup , p2 ) v_w3 = t f . nn . embedding_lookup ( lookup , p3 ) x = t f . concat ( [ v_w1, v_w2, v_w3 ] , 0) output = t f . nn . softmax ( t f . einsum ( ” i j , j -> i ” , W2, t f . tanh ( t f . einsum ( ” i j , j -> i ” , W1, x ) + b1 ) ) + b2 ) loss = - tf . log (output [ target ]) trainer = tf . train . GradientDescentOptimizer (0.1) . minimize( loss ) # Graph d e f i n i t i o n done , compile i t and f e e d c o n c r e t e data . # Only one data - point i s shown , i n p r a c t i c e we w i l l use # a data - feeding loop . with tf . Session () as sess : sess . run( t f . global_variables_initializer () ) feed_dict = { p1 : get_index ( ” the ” ) , p2 : get_index ( ” black ” ) , p3 : get_index ( ”dog” ) , ⁵TensorFlow code provided by Tim Rocktäschel. anks Tim! 5.1. THE COMPUTATION GRAPH ABSTRACTION 57 target : 5 } loss_value = sess . run( loss , feed_dict ) # update , no c a l l of backward necessary sess . run( trainer , feed_dict ) e main difference between DyNet (and Chainer) to TensorFlow (and eano) is that the formers use dynamic graph construction while the latters use static graph construction. In dynamic graph construction, a different computation graph is created from scratch for each training sample, using code in the host language. Forward and backward propagation are then applied to this graph. In contrast, in the static graph construction approach, the shape of the computation graph is defined once in the beginning of the computation, using an API for specifying graph shapes, with place-holder variables indicating input and output values. en, an optimizing graph compiler produces an optimized computation graph, and each training example is fed into the (same) optimized graph. e graph compilation step in the static toolkits (TensorFlow and eano) is both a blessing and a curse. On the one hand, once compiled, large graphs can be run efficiently on either the CPU or a GPU, making it ideal for large graphs with a fixed structure, where only the inputs change between instances. However, the compilation step itself can be costly, and it makes the interface more cumbersome to work with. In contrast, the dynamic packages focus on building large and dynamic computation graphs and executing them “on the fly” without a compilation step. While the execution speed may suffer compared to the static toolkits, in practice the computation speeds of the dynamic toolkits are very competitive. e dynamic packages are especially convenient when working with the recurrent and recursive networks described in Chapters 14 and 18 as well as in structured prediction settings as described in Chapter 19, in which the graphs of different data-points have different shapes. See Neubig et al. [2017] for further discussion on the dynamic-vs.-static approaches, and speed benchmarks for the different toolkits. Finally, packages such as Keras⁶ provide a higher level interface on top of packages such as eano and TensorFlow, allowing the definition and training of complex neural networks with even fewer lines of code, provided that the architectures are well established, and hence supported in the higher-level interface. 5.1.4 IMPLEMENTATION RECIPE Using the computation graph abstraction and dynamic graph construction, the pseudo-code for a network training algorithm is given in Algorithm 5.5. Here, build_computation_graph is a user-defined function that builds the computation graph for the given input, output, and network structure, returning a single loss node. update_parameters is an optimizer specific update rule. e recipe specifies that a new graph is created for each training example. is accommodates cases in which the network structure varies between training examples, such as recurrent and recursive neural networks, to be discussed in ⁶https://keras.io 58 5. NEURAL NETWORK TRAINING Algorithm 5.5 Neural network training with computation graph abstraction (using minibatches of size 1). 1: Define network parameters. 2: for iteration = 1 to T do 3: for Training example xi ; yi in dataset do 4: loss_node build_computation_graph(xi , yi , parameters) 5: loss_node.forward() 6: gradients loss_node().backward() 7: parameters update_parameters(parameters, gradients) 8: return parameters. Chapters 14–18. For networks with fixed structures, such as an MLPs, it may be more efficient to create one base computation graph and vary only the inputs and expected outputs between examples. 5.1.5 NETWORK COMPOSITION As long as the network’s output is a vector (1 k matrix), it is trivial to compose networks by making the output of one network the input of another, creating arbitrary networks. e computation graph abstractions makes this ability explicit: a node in the computation graph can itself be a computation graph with a designated output node. One can then design arbitrarily deep and complex networks, and be able to easily evaluate and train them thanks to automatic forward and gradient computation. is makes it easy to define and train elaborate recurrent and recursive networks, as discussed in Chapters 14–16 and 18, as well as networks for structured outputs and multi-objective training, as we discuss in Chapters 19 and 20. 5.2 PRACTICALITIES Once the gradient computation is taken care of, the network is trained using SGD or another gradient-based optimization algorithm. e function being optimized is not convex, and for a long time training of neural networks was considered a “black art” which can only be done by selected few. Indeed, many parameters affect the optimization process, and care has to be taken to tune these parameters. While this book is not intended as a comprehensive guide to successfully training neural networks, we do list here a few of the prominent issues. For further discussion on optimization techniques and algorithms for neural networks, refer to Bengio et al. [2016, Chapter 8]. For some theoretical discussion and analysis, refer to Glorot and Bengio [2010]. For various practical tips and recommendations, see Bottou [2012], LeCun et al. [1998a]. 5.2. PRACTICALITIES 59 5.2.1 CHOICE OF OPTIMIZATION ALGORITHM While the SGD algorithm works well, it may be slow to converge. Section 2.8.3 lists some alternative, more advanced stochastic-gradient algorithms. As most neural network software frameworks provide implementations of these algorithms, it is easy and often worthwhile to try out different variants. In my research group, we found that when training larger networks, using the Adam algorithm [Kingma and Ba, 2014] is very effective and relatively robust to the choice of the learning rate. 5.2.2 INITIALIZATION e non-convexity of the objective function means the optimization procedure may get stuck in a local minimum or a saddle point, and that starting from different initial points (e.g., different random values for the parameters) may result in different results. us, it is advised to run several restarts of the training starting at different random initializations, and choosing the best one based on a development set.⁷ e amount of variance in the results due to different random seed selections is different for different network formulations and datasets, and cannot be predicted in advance. e magnitude of the random values has a very important effect on the success of training. An effective scheme due to Glorot and Bengio [2010], called xavier initialization after Glorot’s first name, suggests initializing a weight matrix W 2 Rdin dout as: " p p# 6 6 WU p ;Cp ; din C dout din C dout (5.1) where U Œa; b is a uniformly sampled random value in the range Œa; b. e suggestion is based on properties of the tanh activation function, works well in many situations, and is the preferred default initialization method by many. Analysis by He et al. [2015] suggests that when using ReLU nonlinearities, the weights should bqe initialized by sampling from a zero-mean Gaussian distribution whose standard devi- ation is 2 din . is initialization was found by He et al. [2015] to work better than xavier initial- ization in an image classification task, especially when deep networks were involved. 5.2.3 RESTARTS AND ENSEMBLES When training complex networks, different random initializations are likely to end up with different final solutions, exhibiting different accuracies. us, if your computational resources allow, it is advisable to run the training process several times, each with a different random initialization, and choose the best one on the development set. is technique is called random restarts. e average model accuracy across random seeds is also interesting, as it gives a hint as to the stability of the process. ⁷When debugging, and for reproducibility of results, it is advised to used a fixed random seed. 60 5. NEURAL NETWORK TRAINING While the need to “tune” the random seed used to initialize models can be annoying, it also provides a simple way to get different models for performing the same task, facilitating the use model ensembles. Once several models are available, one can base the prediction on the ensemble of models rather than on a single one (for example by taking the majority vote across the different models, or by averaging their output vectors and considering the result as the output vector of the ensembled model). Using ensembles often increases the prediction accuracy, at the cost of having to run the prediction step several times (once for each model). 5.2.4 VANISHING AND EXPLODING GRADIENTS In deep networks, it is common for the error gradients to either vanish (become exceedingly close to 0) or explode (become exceedingly high) as they propagate back through the computation graph. e problem becomes more severe in deeper networks, and especially so in recursive and recurrent networks [Pascanu et al., 2012]. Dealing with the vanishing gradients problem is still an open research question. Solutions include making the networks shallower, step-wise training (first train the first layers based on some auxiliary output signal, then fix them and train the upper layers of the complete network based on the real task signal), performing batch-normalization [Ioffe and Szegedy, 2015] (for every minibatch, normalizing the inputs to each of the network layers to have zero mean and unit variance) or using specialized architectures that are designed to assist in gradient flow (e.g., the LSTM and GRU architectures for recurrent networks, discussed in Chapter 15). Dealing with the exploding gradients has a simple but very effective solution: clip- ping the gradients if their norm exceeds a given threshold. Let gO be the gradients of all parameters in the network, and kgO k be their L2 norm. Pascanu et al. [2012] suggest to set: gO threshold kgO k gO if kgO k > threshold. 5.2.5 SATURATION AND DEAD NEURONS Layers with tanh and sigmoid activations can become saturated—resulting in output values for that layer that are all close to one, the upper-limit of the activation function. Saturated neurons have very small gradients, and should be avoided. Layers with the ReLU activation cannot be saturated, but can “die”—most or all values are negative and thus clipped at zero for all inputs, resulting in a gradient of zero for that layer. If your network does not train well, it is advisable to monitor the network for layers with many saturated or dead neurons. Saturated neurons are caused by too large values entering the layer. is may be controlled for by changing the initialization, scaling the range of the input values, or changing the learning rate. Dead neurons are caused by all signals entering the layer being negative (for example this can happen after a large gradient update). Reducing the learning rate will help in this situation. For saturated layers, another option is to normalize the values in the saturated layer after the activation, i.e., instead of g.h/ D tanh.h/ using g.h/ D k tanh.h/ tanh.h/k . Layer normalization is an effective measure for countering saturation, but is also expensive in terms of gradient computation. A related technique is batch normalization, due 5.2. PRACTICALITIES 61 to Ioffe and Szegedy [2015], in which the activations at each layer are normalized so that they have mean 0 and variance 1 across each mini-batch. e batch-normalization techniques became a key component for effective training of deep networks in computer vision. As of this writing, it is less popular in natural language applications. 5.2.6 SHUFFLING e order in which the training examples are presented to the network is important. e SGD formulation above specifies selecting a random example in each turn. In practice, most implementations go over the training example in random order, essentially performing random sampling without replacement. It is advised to shuffle the training examples before each pass through the data. 5.2.7 LEARNING RATE Selection of the learning rate is important. Too large learning rates will prevent the network from converging on an effective solution. Too small learning rates will take a very long time to converge. As a rule of thumb, one should experiment with a range of initial learning rates in range Œ0; 1, e.g., 0:001, 0:01, 0:1, 1. Monitor the network’s loss over time, and decrease the learning rate once the loss stops improving on a held-out development set. Learning rate scheduling decreases the rate as a function of the number of observed minibatches. A common schedule is dividing the initial learning rate by the iteration number. Léon Bottou [2012] recommends using a learning rate of the form Át D Á0.1 C Á0 t / 1 where Á0 is the initial learning rate, Át is the learning rate to use on the tth training example, and is an additional hyperparameter. He further recommends determining a good value of Á0 based on a small sample of the data prior to running on the entire dataset. 5.2.8 MINIBATCHES Parameter updates occur either every training example (minibatches of size 1) or every k training examples. Some problems benefit from training with larger minibatch sizes. In terms of the computation graph abstraction, one can create a computation graph for each of the k training examples, and then connecting the k loss nodes under an averaging node, whose output will be the loss of the minibatch. Large minibatched training can also be beneficial in terms of computation efficiency on specialized computing architectures such as GPUs, and replacing vector-matrix operations by matrix-matrix operations. is is beyond the scope of this book. PART II Working with Natural Language Data 65 CHAPTER 6 Features for Textual Data In the previous chapters we discussed the general learning problem, and saw some machine learning models and algorithms for training them. All of these models take as input vectors x and produce predictions. Up until now we assumed the vectors x are given. In language processing, the vectors x are derived from textual data, in order to reflect various linguistic properties of the text. e mapping from textual data to real valued vectors is called feature extraction or feature representation, and is done by a feature function. Deciding on the right features is an integral part of a successful machine learning project. While deep neural networks alleviate a lot of the need in feature engineering, a good set of core features still needs to be defined. is is especially true for language data, which comes in the form of a sequence of discrete symbols. is sequence needs to be converted somehow to a numerical vector, in a non-obvious way. We now diverge from the training machinery in order to discuss the feature functions that are used for language data, which will be the topic of the next few chapters. is chapter provides an overview of the common kinds of information sources that are available for use as features when dealing with textual language data. Chapter 7 discusses feature choices for some concrete NLP problems. Chapter 8 deals with encoding the features as input vectors that can be fed to a neural network. 6.1 TYPOLOGY OF NLP CLASSIFICATION PROBLEMS Generally speaking, classification problems in natural language can be categorized into several broad categories, depending on the item being classified (some problems in natural language processing do not fall neatly into the classification framework. For example, problems in which we are required to produce sentences or longer texts—i.e., in document summarization and machine translation. ese will be discussed in Chapter 17). Word In these problems, we are faced with a word, such as “dog,” “magnificent,” “magnifficant,” or “parlez” and need to say something about it: Does it denote a living thing? What language is it in? How common is it? What other words are similar to it? Is it a mis-spelling of another word? And so on. ese kind of problems are actually quite rare, as words seldom appear in isolation, and for many words their interpretation depends on the context in which they are used. Texts In these problems we are faced with a piece of text, be it a phrase, a sentence, a paragraph or a document, and need to say something about it. Is it spam or not? Is it about politics or 66 6. FEATURES FOR TEXTUAL DATA sports? Is it sarcastic? Is it positive, negative or neutral (toward some issue)? Who wrote it? Is it reliable? Which of a fixed set of intents does this text reflect (or none)? Will this text be liked by 16–18 years old males? And so on. ese types of problems are very common, and we’ll refer to them collectively as document classification problems. Paired Texts In these problems we are given a pair of words or longer texts, and need to say something about the pair. Are words A and B synonyms? Is word A a valid translation for word B? Are documents A and B written by the same author? Can the meaning of sentence A be inferred from sentence B? Word in Context Here, we are given a piece of text, and a particular word (or phrase, or letter, etc.) within it, and we need to classify the word in the context of the text. For example, is the word book in I want to book a flight a noun, a verb or an adjective? Is the word apple in a given context referring to a company or a fruit? Is on the right preposition to use in I read a book on London? Does a given period denote a sentence boundary or an abbreviation? Is the given word part of a name of a person, location, or organization? And so on. ese types of questions often arise in the context of larger goals, such as annotating a sentence for parts-of-speech, splitting a document into sentences, finding all the named entities in a text, finding all documents mentioning a given entity, and so on. Relation between two words Here we are given two words or phrases within the context of a larger document, and need to say something about the relations between them. Is word A the subject of verb B in a given sentence? Does the “purchase” relation hold between words A and B in a given text? And so on. Many of these classification cases can be extended to structured problems in which we are interested in performing several related classification decisions, such that the answer to one decision can influence others. ese are discussed in Chapter 19. What is a word? We are using the term word rather loosely. e question “what is a word?” is a matter of debate among linguists, and the answer is not always clear. One definition (which is the one being loosely followed in this book) is that words are sequences of letters that are separated by whitespace. is definition is very simplistic. First, punctuation in English is not separated by whitespace, so according to our definition dog, dog?, dog. and dog) are all different words. Our corrected definition is then words separated by whitespace or punctuation. A process called tokenization is in charge of splitting text into tokens (what we call here words) based on whitespace and punctuation. In English, the job of the tokenizer is quite simple, although it does need to consider cases such as abbreviations (I.B.M) and titles (Mr.) that needn’t be split. In other languages, things can become much tricker: in Hebrew and Arabic some words attach to the next one without whitespace, and . in Chinese there are no whitespaces at all. ese are just a few examples. 6.2. FEATURES FOR NLP PROBLEMS 67 When working in English or a similar language (as this book assumes), tokenizing on whitespace and punctuation (while handling a few corner cases) can provide a good approximation of words. However, our definition of word is still quite technical: it is derived from the way things are written. Another common (and better) definition take a word to be “the smallest unit of meaning.” By following this definition, we see that our whitespace-based definition is problematic. After splitting by whitespace and punctuation, we still remain with sequences such as don’t, that are actually two words, do not, that got merged into one symbol. It is common for English tokenizers to handle these cases as well. e symbols cat and Cat have the same meaning, but are they the same word? More interestingly, take something like New York, is it two words, or one? What about ice cream? Is it the same as ice-cream or icecream? And what about idioms such as kick the bucket? In general, we distinguish between words and tokens. We refer to the output of a tokenizer as a token, and to the meaning-bearing units as words. A token may be composed of multiple words, multiple tokens can be a single word, and sometimes different tokens denote the same underlying word. Having said that, in this book, we use the term word very loosely, and take it to be interchangeable with token. It is important to keep in mind, however, that the story is more . complex than that. 6.2 FEATURES FOR NLP PROBLEMS In what follows, we describe the common features that are used for the above problems. As words and letters are discrete items, our features often take the form of indicators or counts. An indicator feature takes a value of 0 or 1, depending on the existence of a condition (e.g., a feature taking the value of 1 if the word dog appeared at least once in the document, and 0 otherwise). A count takes a value depending on the number of times some event occurred, e.g., a feature indicating the number of times the word dog appears in the text. 6.2.1 DIRECTLY OBSERVABLE PROPERTIES Features for Single Words When our focus entity is a word outside of a context, our main source of information is the letters comprising the word and their order, as well as properties derived from these such as the length of the word, the orthographic shape of the word (Is the first letter capitalized? Are all letters capitalized? Does the word include a hyphen? Does it include a digit? And so on), and prefixes and suffixes of the word (Does it start with un? Does it end with ing?). We may also look at the word with relation to external sources of information: How many times does the word appear in a large collection of text? Does the word appear in a list of common person names in the U.S.? And so on. Lemmas and Stems We often look at the lemma (the dictionary entry) of the word, mapping forms such as booking, booked, books to their common lemma book. is mapping is usually per- 68 6. FEATURES FOR TEXTUAL DATA formed using lemma lexicons or morphological analyzers, that are available for many languages. e lemma of a word can be ambiguous, and lemmatizing is more accurate when the word is given in context. Lemmatization is a linguistically defined process, and may not work well for forms that are not in the lemmatization lexicon, or for mis-spelling. A coarser process than lemmatization, that can work on any sequence of letters, is called stemming. A stemmer maps sequences of words to shorter sequences, based on some language-specific heuristics, such that different inflections will map to the same sequence. Note that the result of stemming need not be a valid word: picture and pictures and pictured will all be stemmed to pictur. Various stemmers exist, with different levels of aggressiveness. Lexical Resources An additional source of information about word forms are lexical resources. ese are essentially dictionaries that are meant to be accessed programmatically by machines rather than read by humans. A lexical resource will typically contain information about words, linking them to other words and/or providing additional information. For example, for many languages there are lexicons that map inflected word forms to their possible morphological analyses (i.e., telling you that a certain word may be either a plural feminine noun or a past-perfect verb). Such lexicons will typically also include lemma information. A very well-known lexical resource in English is WordNet [Fellbaum, 1998]. WordNet is a very large manually curated dataset attempting to capture conceptual semantic knowledge about words. Each word belongs to one or several synsets, where each synsets describes a cognitive concept. For example, the word star as a noun belongs to the synsets astronomical celestial body, someone who is dazzlingly skilled, any celestial body visible from earth and an actor who plays a principle role, among others. e second synset of star contains also the words ace, adept, champion, sensation, maven, virtuoso, among others. Synsets are linked to each other by means of semantic relations such as hypernymy and hyponymy (more specific or less specific words). For example, for the first synset of star these would include sun and nova (hyponyms) and celestial body (hypernym). Other semantic relations in WordNet contain antonyms (opposite words) and holonyms and meronyms (part-whole and whole-part relations). WordNet contains information about nouns, verbs, adjectives, and adverbs. FrameNet [Fillmore et al., 2004] and VerbNet [Kipper et al., 2000] are manually curated lexical resources that focus around verbs, listing for many verbs the kinds of argument they take (i.e., that giving involves the core arguments D, R, and T (the thing that is being given), and may have non-core arguments such as T, P, P, and M, among others. e Paraphrase Database (PPDB) [Ganitkevitch et al., 2013, Pavlick et al., 2015] is a large, automatically created dataset of paraphrases. It lists words and phrases, and for each one provides a list of words and phrases that can be used to mean roughly the same thing. Lexical resources such as these contain a lot of information, and can serve a good source of features. However, the means of using such symbolic information effectively is task dependent, 6.2. FEATURES FOR NLP PROBLEMS 69 and often requires non-trivial engineering efforts and/or ingenuity. ey are currently not often used in neural network models, but this may change. Distributional Information Another important source of information about words is distributional—which other words behave similar to it in the text? ese deserve their own separate treatment, and are discussed in Section 6.2.5 below. In Section 11.8, we discuss how lexical resources can be used to inject knowledge into distributional word vectors that are derived from neural network algorithms. Features for Text When we consider a sentence, a paragraph, or a document, the observable features are the counts and the order of the letters and the words within the text. Bag of words A very common feature extraction procedures for sentences and documents is the bag-of-words approach (BOW). In this approach, we look at the histogram of the words within the text, i.e., considering each word count as a feature. By generalizing from words to “basic ele- ments,” the bag-of-letter-bigrams we used in the language identification example in Section 2.3.1 is an example of the bag-of-words approach. We can also compute quantities that are directly derived from the words and the letters, such as the length of the sentence in terms of number of letters or number of words. When considering individual words, we may of course use the word-based features from above, counting for example the number of words in the document that have a specific prefix or suffix, or compute the ratio of short words (with length below a given length) to long words in a document. Weighting As before, we can also integrate statistics based on external information, focusing for example on words that appear many times in the given document, yet appear relatively few times in an external set of documents (this will distinguish words that have high counts in the documents because they are generally common, like a and for from words that have a high count because they relate to the document’s topic). When using the bag-of-words approach, it is common to use TF-IDF weighting [Manning et al., 2008, Chapter 6]. Consider a document d which is part of a larger corpus D. Rather than representing each word w in d by its normalized count in the document P #d .w/ w02d #d .w0/ (the Term Frequency), TF-IDF weighting represent it instead by P #d .w/ w02d #d .w0/ log jfd jDj 2D Ww 2d gj . e second term is the Inverse Document Frequency: the inverse of the number of distinct documents in the corpus in which this word occurred. is highlights words that are distinctive of the current text. Besides words, one may also look at consecutive pairs or triplets of words. ese are called ngrams. Ngram features are discussed in depth in Section 6.2.4. Features of Words in Context When considering a word within a sentence or a document, the directly observable features of the word are its position within the sentence, as well as the words or letters surrounding it. Words that are closer to the target word are often more informative about it than words that are further apart.¹ ¹However, note that this is a gross generalization, and in many cases language exhibit a long-range dependencies between words: a word at the end of a text may well be influenced by a word at the beginning. 70 6. FEATURES FOR TEXTUAL DATA Windows For this reason, it is often common to focus on the immediate context of a word by considering a window surrounding it (i.e., k words to each side, with typical values of k being 2, 5, and 10), and take the features to be the identities of the words within the window (e.g., a feature will be “word X appeared within a window of five words surrounding the target word”). For example, consider the sentence the brown fox jumped over the lazy dog, with the target word jumped. A window of 2 words to each side will produce the set of features { word=brown, word=fox, word=over, word=the }. e window approach is a version of the bag-of-words approach, but restricted to items within the small window. e fixed size of the window gives the opportunity to relax the bag-of-word assumption that order does not matter, and take the relative positions of the words in the window into account. is results in relative-positional features such as “word X appeared two words to the left of the target word.” For example, in the example above the positional window approach will result in the set of features { word-2=brown, word-1=fox, word+1=over, word+2=the }. Encoding of window-based features as vectors is discussed in Section 8.2.1. In Chapters 14 and 16 we will introduce the biRNN architecture, that generalizes window features by providing a flexible, adjustable, and trainable window. Position Besides the context of the word, we may be interested in its absolute position within a sentence. We could have features such as “the target word is the 5th word in the sentence,” or a binned version indicating more coarse grained categories: does it appear within the first 10 words, between word 10 and 20, and so on. Features for Word Relations When considering two words in context, besides the position of each one and the words surrounding them, we can also look at the distance between the words and the identities of the words that appear between them. 6.2.2 INFERRED LINGUISTIC PROPERTIES Sentences in natural language have structures beyond the linear order of their words. e structure follows an intricate set of rules that are not directly observable to us. ese rules are collectively referred to as syntax, and the study of the nature of these rules and regularities in natural language is the study-object of linguistics.² While the exact structure of language is still a mystery, and rules governing many of the more intricate patterns are either unexplored or still open for debate among linguists, a subset of phenomena governing language are well documented and well understood. ese include concepts such as word classes (part-of-speech tags), morphology, syntax, and even parts of semantics. While the linguistic properties of a text are not observable directly from the surface forms of words in sentences and their order, they can be inferred from the sentence string with vary- ²is last sentence, is, of course, a gross simplification. Linguistics has much wider breadth than syntax, and there are other systems that regulate the human linguistic behavior besides the syntactic one. But for the purpose of this introductory book, this simplistic view will be sufficient. For a more in depth overview, see the further reading recommendations at the end of this section. 6.2. FEATURES FOR NLP PROBLEMS 71 ing degrees of accuracy. Specialized systems exist for the prediction of parts of speech, syntactic trees, semantic roles, discourse relations, and other linguistic properties with various degrees of accuracy,³ and these predictions often serve as good features for further classification problems. Linguistic Annotation Let’s explore some forms of linguistic annotations. Consider the sentence the boy with the black shirt opened the door with a key. One level of annotation assigns to each word its part of speech: the boy with the black shirt opened the door with a key D N P D A N V D N P D N Going further up the chain, we mark syntactic chunk boundaries, indicating the the boy is a noun phrase. [NP the boy ] [PP with ] [NP the black shirt ] [VP opened ] [NP the door ] [PP with ] [NP a key ] Note that the word opened is marked as a verbal-chunk (VP). is may not seem very useful because we already know its a verb. However, VP chunks may contain more elements, covering also cases such as will opened and did not open. e chunking information is local. A more global syntactic structure is a constituency tree, also called a phrase-structure tree: S NP VP DT NN PP VBD NP PP the boy IN NP opened DT NN IN NP with DT JJ NN the door with DT NN the black shirt a key Constituency trees are nested, labeled bracketing over the sentence, indicating the hi. erarchy of syntactic units: the noun phrase the boy with the black shirt is made of the noun ³Indeed, for many researchers, improving the prediction of these linguistic properties is the natural language processing problem they are trying to solve. 72 6. FEATURES FOR TEXTUAL DATA phrase the boy and the preposition phrase (PP) with the black shirt. e latter itself contains the noun phrase the black shirt. Having with a key nested under the VP and not under the NP the door signals that with a key modifies the verb opened (opened with a key) rather than the NP (a door with a key). A different kind of syntactic annotation is a dependency tree. Under dependency syntax, each word in the sentence is a modifier of another word, which is called its head. Each word in the sentence is headed by another sentence word, except for the main word, usually a verb, which is the root of the sentence and is headed by a special “root” node. det prep nsubj pobj det amod root prep dobj det pobj det the boy with the black shirt opened the door with a key While constituency trees make explicit the grouping of words into phrases, dependency trees make explicit the modification relations and connections between words. Words that are far apart in the surface form of the sentence may be close in its dependency tree. For example, boy and opened have four words between them in the surface form, but have a direct nsubj edge connecting them in the dependency tree. e dependency relations are syntactic: they are concerned with the structure of the sentence. Other kinds of relations are more semantic. For example, consider the modifiers of the verb open, also called the arguments of the verb. e syntactic tree clearly marks the boy (with the black shirt), the door, and with a key as arguments, and also tells us that with a key is an argument of open rather than a modifier of door. It does not tell us, however, what are the semantic-roles of the arguments with respect to the verb, i.e., that the boy is the A performing the action, and that a key is an I (compare that to the boy opened the door with a smile. Here, the sentence will have the same syntactic structure, but, unless we are in a magical-world, a smile is a M rather than an I. e semantic role labeling annotations reveal these structures: . 6.2. FEATURES FOR NLP PROBLEMS 73 the boy with the black shirt opened the door with a key A P I the boy with the black shirt opened the door with a smile A P M . Besides the observable properties (letters, words, counts, lengths, linear distances, frequencies, etc.), we can also look such inferred linguistic properties of words, sentences, and documents. For example, we could look at the part-of-speech tag (POS) of a word within a document (Is it a noun, a verb, adjective, or a determiner?), the syntactic role of a word (Does it serve as a subject or an object of a verb? Is it the main verb of the sentence? Is it used as an adverbial modifier?), or the semantic role of it (e.g., in “the key opened the door,” key acts as an I, while in “the boy opened the door” boy is an A). When given two words in a sentence, we can consider the syntactic dependency tree of the sentence, and the subtree or paths that connect the two words within the this tree, as well as properties of that path. Words that are far apart in the sentence in terms of the number of words separating them can be close to each other in the syntactic structure. When moving beyond the sentence, we may want to look at the discourse relations that connect sentences together, such as E, C, CE, and so on. ese relations are often expressed by discourse-connective words such as moreover, however, and and, but are also expressed with less direct cues. Another important phenomena is that of anaphora—consider the sentence sequence the boy opened the door with a key. It1 wasn’t locked and he1 entered the room. He2 saw a man. He3 was smiling. Anaphora resolution (also called coreference resolution) will tell us that It1 refers to the door (and not the key or the boy), he2 refers to the boy and he3 is likely to refer to the man. Part of speech tags, syntactic roles, discourse relations, anaphora, and so on are concepts that are based on linguistic theories that were developed by linguists over a long period of time, with the aim of capturing the rules and regularities in the very messy system of the human language. While many aspects of the rules governing language are still open for debate, and others may seem overly rigid or simplistic, the concepts explored here (and others) do indeed capture a wide and important array of generalizations and regularities in language. Are linguistic concepts needed? Some proponents of deep-learning argue that such inferred, manually designed, linguistic properties are not needed, and that the neural network will learn these intermediate representations (or equivalent, or better ones) on its own. e jury is still out on this. My current personal belief is that many of these linguistic concepts can indeed be inferred by 74 6. FEATURES FOR TEXTUAL DATA the network on its own if given enough data and perhaps a push in the right direction.⁴ However, for many other cases we do not have enough training data available for the task we care about, and in these cases providing the network with the more explicit general concepts can be very valuable. Even if we do have enough data, we may want to focus the network on certain aspects of the text and hint to it that it should ignore others, by providing the generalized concepts in addition to, or even instead of, the surface forms of the words. Finally, even if we do not use these linguistic properties as input features, we may want to help guide the network by using them as additional supervision in a multi-task learning setup (see Chapter 20) or by designing network architecture or training paradigms that are more suitable for learning certain linguistic phenomena. Overall, we see enough evidence that the use of linguistic concepts help improve language understanding and production systems. Further Reading When dealing with natural language text, it is well advised to be aware of the linguistic concepts beyond letters and words, as well as of the current computational tools and resources that are available. is book barely scratches the surface on this topic. e book of Bender [2013] provides a good and concise overview of linguistic concepts directed at computationalminded people. For a discussion on current NLP methods, tools, and resources see the book by Jurafsky and Martin [2008] as well as the various specialized titles in this series.⁵ 6.2.3 CORE FEATURES VS. COMBINATION FEATURES In many cases, we are interested in a conjunction of features occurring together. For example, knowing that the two indicators “the word book appeared in a window” and “the part-of-speech V appeared in a window” is strictly less informative than knowing “the word book with the assigned part of speech V appeared in a window.” Similarly, if we assign a distinct parameter weight for each indicator feature (as is the case in linear models), then knowing that the two distinct features “word in position 1 is like,” “word in position 2 is not” occur is almost useless compared to the very indicative combined indicator “word in position 1 is like and word in position 2 is not.” Similarly, knowing that a document contains the word Paris is an indication toward the document being in the T category, and the same holds for the word Hilton. However, if the document contains both words, it is an indication away from the T category and toward the C or G categories. Linear models cannot assign a score to a conjunction of events (X occurred and Y occurred and …) that is not a sum of their individual scores, unless the conjunction itself is modeled as its own feature. us, when designing features for a linear model, we must define not only the core features but also many combination features.⁶ e set of possible combination is very large, and ⁴See, for example, the experiment in Section 16.1.2 in which a neural networks learns the concept of subject-verb agreement in English, inferring the concepts of nouns, verbs, grammatical number and some hierarchical linguistics structures. ⁵Syntactic dependency structures are discussed in Kübler et al. [2008] and semantic roles in Palmer et al. [2010]. ⁶is is a direct manifestation of the XOR problem discussed in Chapter 3, and the manually defined combination-features are the mapping function that maps the nonlinearly separable vectors of core-features to a higher dimensional space in which the data is more likely to be separable by a linear model. 6.2. FEATURES FOR NLP PROBLEMS 75 human expertise, coupled with trial and error, is needed in order to construct a set of combinations that is both informative and relatively compact. Indeed, a lot of effort has gone into design decisions such as “include features of the form word at position -1 is X and at position +1 is Y but do not include features of the form word at position -3 is X and at position -1 is Y.” Neural networks provide nonlinear models, and do not suffer from this problem. When using a neural network such as a multi-layer perceptron (Chapter 4), the model designer can specify only the set of core features, and rely on the network training procedure to pick up on the important combinations on its own. is greatly simplifies the work of the model designer. In practice, neural networks indeed manage to learn good classifiers based on core features only, sometimes surpassing the best linear classifier with human-designed feature combinations. However, in many other cases a linear classifier with a good hand-crafted feature-set is hard to beat, with the neural network models with core features getting close to but not surpassing the linear models. 6.2.4 NGRAM FEATURES A special case of feature combinations is that of ngrams—consecutive word sequences of a given length. We already saw letter-bigram features in the language classification case (Chapter 2). Word-bigrams, as well as trigrams (sequences of three items) of letters or words are also common. Beyond that, 4-grams and 5-grams are sometimes used for letters, but rarely for words due to sparsity issues. It should be intuitively clear why word-bigrams are more informative than individual words: it captures structures such as New York, not good, and Paris Hilton. Indeed, a bag-of-bigrams representation is much more powerful than bag-of-words, and in many cases proves very hard to beat. Of course, not all bigrams are equally informative, bigrams such as of the, on a, the boy, etc. are very common and, for most tasks, not more informative than their individual components. However, it is very hard to know a-priori which ngrams will be useful for a given task. e common solution is to include all ngrams up to a given length, and let the model regularization discard of the less interesting ones by assigning them very low weights. Note that vanilla neural network architectures such as the MLP cannot infer ngram features from a document on their own in the general case: a multi-layer perceptron fed with a bag-of-words feature vector of a document could learn combinations such as “word X appear in the document and word Y appears in the document” but not “the bigram X Y appears in the document.” us, ngram features are useful also in the context of nonlinear classification. Multi-layer perceptrons can infer ngrams when applied to a fixed size windows with positional information—the combination of “word at position 1 is X” and “word at position 2 is Y” is in effect the bigram XY. More specialized neural network architectures such as convolutional networks (Chapter 13) are designed to find informative ngram features for a given task based on a sequence of words of varying lengths. Bidirectional RNNs (Chapters 14 and 16) generalize the ngram concept even further, and can be sensitive to informative ngrams of varying lengths, as well as ngrams with gaps in them. 76 6. FEATURES FOR TEXTUAL DATA 6.2.5 DISTRIBUTIONAL FEATURES Up until now our treatment of words was as discrete and unrelated symbols: the words pizza, burger, and chair are all equally similar (and equally dis-similar) to each other as far as the algorithm is concerned. We did achieve some form of generalization across word types by mapping them to coarsergrained categories such as parts-of-speech or syntactic roles (“the, a, an, some are all determiners”); generalizing from inflected words forms to their lemmas (“book, booking, booked all share the lemma book”); looking at membership in lists or dictionaries (“John, Jack, and Ralph appear in a list of common U.S. first names”); or looking at their relation to other words using lexical resources such as WordNet. However, these solutions are quite limited: they either provide very coarse grained distinctions, or otherwise rely on specific, manually compiled dictionaries. Unless we have a specialized list of foods we will not learn that pizza is more similar to burger than it is to chair, and it will be even harder to learn that pizza is more similar to burger than it is to icecream. e distributional hypothesis of language, set forth by Firth [1957] and Harris [1954], states that the meaning of a word can be inferred from the contexts in which it is used. By observing cooccurrence patterns of words across a large body of text, we can discover that the contexts in which burger occur are quite similar to those in which pizza occurs, less similar to those in which icecream occurs, and very different from those in which chair occurs. Many algorithms were derived over the years to make use of this property, and learn generalizations of words based on the contexts in which they occur. ese can be broadly categorized into clustering-based methods, which assign similar words to the same cluster and represent each word by its cluster membership [Brown et al., 1992, Miller et al., 2004], and to embedding-based methods which represent each word as a vector such that similar words (words having a similar distribution) have similar vectors [Collobert and Weston, 2008, Mikolov et al., 2013b]. Turian et al. [2010] discuss and compare these approaches. ese algorithms uncover many facets of similarity between words, and can be used to derive good word features: for example, one could replace words by their cluster ID (e.g., replacing both the words June and aug by cluster732), replace rare or unseen words with the common word most similar to them, or just use the word vector itself as the representation of the word. However, care must be taken when using such word similarity information, as it can have unintended consequences. For example, in some applications it is very useful to treat London and Berlin as similar, while for others (for example when booking a flight or translating a document) the distinction is crucial. We will discuss word embeddings methods and the use of word vectors in more detail in Chapters 10 and 11. 77 CHAPTER 7 Case Studies of NLP Features After discussing the different sources of information available for us for deriving features from natural language text, we will now explore examples of concrete NLP classification tasks, and suitable features for them. While the promise of neural networks is to alleviate the need for manual feature engineering, we still need to take these sources of information into consideration when designing our models: we want to make sure that the network we design can make effective use of the available signals, either by giving it direct access to them by use of feature-engineering; by designing the network architecture to expose the needed signals; or by adding them as an additional loss signals when training the models.¹ 7.1 DOCUMENT CLASSIFICATION: LANGUAGE IDENTIFICATION In the language identification task, we are given a document or a sentence, and want to classify it into one of a fixed set of languages. As we saw in Chapter 2, a bag of letter-bigrams is a very strong feature representation for this task. Concretely, each possible letter-bigram (or each letter bigram appearing at least k times in at least one language) is a core feature, and the value of a core feature for a given document is the count of that feature in the document. A similar task is the one of encoding detection. Here, a good feature representation is a bag-of byte-bigrams. 7.2 DOCUMENT CLASSIFICATION: TOPIC CLASSIFICATION In the Topic Classification task, we are given a document and need to classify it into one of a predefined set of topics (e.g., Economy, Politics, Sports, Leisure, Gossip, Lifestyle, Other). Here, the letter level is not very informative, and our basic units will be words. Word order is not very informative for this task (except maybe for consecutive word pairs such as bigrams). us, a good set of features will be the bag-of-words in the document, perhaps accompanied by a bag-of-word-bigrams (each word and each word-bigram is a core feature). ¹Additionally, linear or log-linear models with manually designed features are still very effective for many tasks. ey can be very competitive in terms of accuracy, as well as being very easy to train and deploy at scale, and easier to reason about and debug than neural networks. If nothing else, such models should be considered as strong baselines for whatever networks you are designing. 78 7. CASE STUDIES OF NLP FEATURES If we do not have many training examples, we may benefit from pre-processing the document by replacing each word with its lemma. We may also replace or supplement words by distributional features such as word clusters or word-embedding vectors. When using a linear classifier, we may want to also consider word pairs, i.e., consider each pair of words (not necessarily consecutive) that appear in the same document as a core feature. is will result in a huge number of potential core features, and the number will need to be trimmed down by designing some heuristic, such as considering only word pairs which appear in a specified number of documents. Nonlinear classifiers alleviate this need. When using a bag-of-words, it is sometimes useful to weight each word with proportion to its informativeness, for example using TF-IDF weighting (Section 6.2.1). However, the learning algorithm is often capable of coming up with the weighting on its own. Another option is to use word indicators rather than word counts: each word in the document (or each word above a given count) will be represented once, regardless of its number of occurrences in the document. 7.3 DOCUMENT CLASSIFICATION: AUTHORSHIP ATTRIBUTION In the authorship attribution task [Koppel et al., 2009] we are given a text and need to infer the identify of its author (from a fixed set of possible authors), or other characteristics of the author of the text, such as their gender, their age or their native language. e kind of information used to solve this task is very different than that of topic classification—the clues are subtle, and involve stylistic properties of the text rather than content words. us, our choice of features should shy away from content words and focus on more stylistic properties.² A good set for such tasks focus on parts of speech (POS) tags and function words. ese are words like on, of, the, and, before and so on that do not carry much content on their own but rather serve to connect to content-bearing words and assign meanings to their compositions, as well as pronouns (he, she, I, they, etc.) A good approximation of function words is the list of top300 or so most frequent words in a large corpus. By focusing on such features, we can learn to capture subtle stylistic variations in writing, that are unique to an author and very hard to fake. A good feature set for authorship attribution task include a bag-of-function-words-andpronouns, bag-of-POS-tags, and bags of POS bigrams, trigrams, and 4grams. Additionally, we may want to consider the density of function words (i.e., the ratio between the number of function words and content words in a window of text), a bag of bigrams of function words after removing the content words, and the distributions of the distances between consecutive function words. ²One could argue that for age or gender identification, we may as well observe also the content-words, as there are strong correlation between age and gender of a person and the topics they write about and the language register they use. is is generally true, but if we are interested in a forensic or adversarial setting in which the author has an incentive to hide their age or gender, we better not rely on content-based features, as these are rather easy to fake, compared to the more subtle stylistic cues. 7.4. WORD-IN-CONTEXT: PART OF SPEECH TAGGING 79 7.4 WORD-IN-CONTEXT: PART OF SPEECH TAGGING In the parts-of-speech tagging task, we are given a sentence, and need to assign the correct partof-speech to each word in the sentence. e parts-of-speech come from a pre-defined set, for this example assume we will be using the tagset of the Universal Treebank Project [McDonald et al., 2013, Nivre et al., 2015], containing 17 tags.³ Part-of-speech tagging is usually modeled as a structured task—the tag of the first word may depend on the tag of the third one—but it can be approximated quite well by classifying each word in isolation into a POS-tag based on a window of two words to each side of the word. If we tag the words in a fixed order, for example from left to right, we can also condition each tagging prediction on tag predictions made on previous tags. Our feature function when classifying a word wi has access to all the words in the sentence (and their letters) as well as all the previous tagging decisions (i.e., the assigned tags for words w1; : : : ; wi 1). Here, we discuss features as if they are used in an isolated classification task. In Chapter 19 we discuss the structured learning case—using the same set of features. e sources of information for the POS-tagging task can be divided into intrinsic cues (based on the word itself ) and extrinsic cues (based on its context). Intrinsic cues include the identify of the word (some words are more likely than others to be nouns, for example), prefixes, suffixes, and orthographic shape of the word (in English, words ending in -ed are likely past-tense verbs, words starting with un- are likely to be adjectives, and words starting with a capital letter are likely to be proper names), and the frequency of the word in a large corpus (for example, rare words are more likely to be nouns). Extrinsic cues include the word identities, prefixes, and suffixes of words surrounding the current word, as well as the part-of-speech prediction for the previous words. Overlapping features If we have the word form as a feature, why do we need the prefixes and suffixes? After all they are deterministic functions of the word. e reason is that if we encounter a word that we have not seen in training (out of vocabulary or OOV word) or a word we’ve seen only a handful of times in training (a rare word), we may not have robust enough information to base a decision on. In such cases, it is good to back-off to the prefixes and suffixes, which can provide useful hints. By including the prefix and suffix features also for words that are observed many times in training, we allow the learning algorithms to better . adjust their weights, and hopefully use them properly when encountering OOV words. ³adjective, adposition, adverb, auxiliary verb, coordinating conjunction, determiner, interjection, noun, numeral, particle, pronoun, proper noun, punctuation, subordinating conjunction, symbol, verb, other. 80 7. CASE STUDIES OF NLP FEATURES An example of a good set of core features for POS tagging is: • word=X • 2-letter-suffix=X • 3-letter-suffix=X • 2-letter-prefix=X • 3-letter-prefix=X • word-is-capitalized • word-contains-hyphen • word-contains-digit • for P in Œ 2; 1; C1; C2: – Word at position P=X – 2-letter-suffix of word at position P=X – 3-letter-suffix of word at position P=X – 2-letter-prefix of word at position P=X – 3-letter-prefix of word at position P=X – word at position P=X is capitalized – word at position P=X contains hyphen – word at position P=X contains digit • Predicted POS of word at position -1=X • Predicted POS of word at position -2=X In addition to these, distributional information such as word clusters or word-embedding vectors of the word and of surrounding words can also be useful, especially for words not seen in the training corpus, as words with similar POS-tags tend to occur in more similar contexts to each other than words of different POS-tags. 7.5. WORD-IN-CONTEXT: NAMED ENTITY RECOGNITION 81 7.5 WORD-IN-CONTEXT: NAMED ENTITY RECOGNITION In the named-entity recognition (NER) task we are given a document and need to find named entities such as Milan, John Smith, McCormik Industries, and Paris, as well as to categorize them into a pre-defined set of categories such as L, O, P, or O. Note that this task is context dependent, as Milan can be a location (the city) or an organization (a sports team, “Milan played against Barsa Wednesday evening”), and Paris can be the name of a city or a person. A typical input to the problem would be a sentence such as: John Smith , president of McCormik Industries visited his niece Paris in Milan , reporters say . and the expected output would be: [PER John Smith ] , president of [ORG McCormik Industries ] visited his niece [PER Paris ] in [LOC Milan ], reporters say . While NER is a sequence segmentation task—it assigns labeled brackets over nonoverlapping sentence spans—it is often modeled as a sequence tagging task, like POS-tagging. e use of tagging to solve segmentation tasks is performed using BIO encoded tags.⁴ Each word is assigned one of the following tags, as seen in Table 7.1: Table 7.1: BIO tags for named entity recognition Tag O B-PER I-PER B-LOC I-LOC B-ORG I-ORG B-MISC I-MISC Meaning Not part of a named entity First word of a person name Continuation of a person name First word of a location name Continuation of a location name First word of an organization name Continuation of an organization name First word of another kind of named entity Continuation of another kind of named entity ⁴Variants on the BIO tagging scheme are explored in the literature, and some perform somewhat better than it. See Lample et al. [2016], Ratinov and Roth [2009]. 82 7. CASE STUDIES OF NLP FEATURES e sentence above would be tagged as: John/B-PER Smith/I-PER ,/O president/O of/O McCormik/B-ORG Industries/I-ORG visited/O his/O niece/O Paris/B-PER in/O Milan/B-LOC ,/O reporters/O say/O ./O e translation from non-overlapping segments to BIO tags and back is straightforward. Like POS-tagging, the NER task is a structured one, as tagging decisions for different words interact with each other (it is more likely to remain within the same entity type than to switch, it is more likely to tag “John Smith Inc.” as B-ORG I-ORG I-ORG than as B-PER I-PER B-ORG). However, we again assume it can be approximated reasonably well using independent classification decisions. e core feature set for the NER task is similar to that of the POS-tagging task, and relies on words within a 2-words window to each side of the focus word. In addition to the features of the POS-tagging task which are useful for NER as well (e.g., -ville is a suffix indicating a location, Mc- is a prefix indicating a person), we may want to consider also the identities of the words that surround other occurrences of the same word in the text, as well as indicator functions that check if the word occurs in pre-compiled lists of persons, locations and organizations. Distributional features such word clusters or word vectors are also extremely useful for the NER task. For a comprehensive discussion on features for NER, see Ratinov and Roth [2009]. 7.6 WORD IN CONTEXT, LINGUISTIC FEATURES: PREPOSITION SENSE DISAMBIGUATION Prepositions, words like on, in, with, and for, serve for connecting predicates with their arguments and nouns with their prepositional modifiers. Preposions are very common, and also very ambiguous. Consider, for example, the word for in the following sentences. (1) a. We went there for lunch. b. He paid for me. c. We ate for two hours. d. He would have left for home, but it started raining. e word for plays a different role in each of them: in (a) it indicates a P in (b) a B, in (c) a D and in (d) a L. In order to fully understand the meaning of a sentence, one should arguablly know the correct senses of the prepositions within it. e preposition-sense disambiguation task deals with assigning the correct sense to a preposition in context, from a finite inventory of senses. Schneider et al. [2015, 2016] discuss the task, present a unified sense inventory that covers many preposi- 7.6. WORD IN CONTEXT, LINGUISTIC FEATURES: PREPOSITION SENSE DISAMBIGUATION 83 tions, and provide a small annotated corpus of sentences from online reviews, covering 4,250 preposition mentions, each annotated with its sense.⁵ Which are a good set of features for the preposition sense disambiguation task? We follow here the feature set inspired by the work of Hovy et al. [2010]. Obviously, the preposition itself is a useful feature (the distribution of possible senses for in is very different from the distribution of senses for with or about, for example). Besides that, we will look in the context in which the word occurs. A fixed window around the preposition may not be ideal in terms of information content, thought. Consider, for example, the following sentences. (2) a. He liked the round object from the very first time he saw it. b. He saved the round object from him the very first time they saw it. e two instances of from have different senses, but most of the words in a window around the word are either not informative or even misleading. We need a better mechanism for selecting informative contexts. One option would be to use a heuristic, such as “the first verb on the left” and “the first noun on the right.” ese will capture the triplets hliked,from,timei and hsaved,from,himi, which indeed contain the essence of the preposition sense. In linguistic terms, we say that this heuristic helps us capture the governor and the and object of the preposition. By knowing the identify of the preposition, as well as its governor and objects, humans can in many cases infer the sense of the preposition, using reasoning processes about the fine-grained semantics of the words. e heuristic for extracting the object and governor requires the use of a POS-tagger in order to identify the nouns and verbs. It is also somewhat brittle—it is not hard to imagine cases in which it fails. We could refine the heuristic with more rules, but a more robust approach would be to use a dependency parser: the governor and object information is easily readable from the syntactic tree, reducing the need for complex heuristics: root nsubj prep dobj det amod pobj det amod amod rcmod nsubj dobj he liked the round object from the very rst time he saw it Of course, the parser used for producing the tree may be wrong too. For robustness, we may look at both the governor and object extracted from the parser and the governor and object extracted using the heuristic, and use all four as sources for features (i.e., parse_gov=X, parse_obj=Y, ⁵Earlier sense inventories and annotated corpora for the task are also available. See, for example, Litkowski and Hargraves [2005, 2007], Srikumar and Roth [2013a]. 84 7. CASE STUDIES OF NLP FEATURES heur_gov=Z, heur_obj=W), letting the learning process decide which of the sources is more reliable and how to balance between them. After extracting the governor and the object (and perhaps also words adjacent to the governor and the object), we can use them as the basis for further feature extraction. For each of the items, we could extract the following pieces of information: • the actual surface form of the word; • the lemma of the word; • the part-of-speech of the word; • prefixes and suffixes of the word (indicating adjectives of degree, number, order, etc such as ultra-, poly-, post-, as well as some distinctions between agentive and non-agentive verbs); and • word cluster or distributional vector of the word. If we allow the use of external lexical resources and don’t mind greatly enlarging the feature space, Hovy et al. [2010] found the use of WordNet-based features to be helpful as well. For each of the governor and the object, we could extract many WordNet indicators, such as: • does the word have a WordNet entry?; • hypernyms of the first synset of the word; • hypernyms of all synsets of the word; • synonyms for first synset of the word; • synonyms for all synsets of the word; • all terms in the definition of the word; • the super-sense of the word (super-senses, also called lexicographer-files in the WordNet jargon, are relatively high levels in the WordNet hierarchy, indicating concepts such as being an animal, being a body part, being an emotion, being food, etc.); and • various other indicators. is process may result in tens or over a hundred of core features for each preposition instance, i.e., hyper_1st_syn_gov=a, hyper_all_syn_gov=a, hyper_all_syn_gov=b, hyper_all_syn_gov=c, ..., hyper_1st_syn_obj=x, hyper_all_syn_obj=y, ..., term_in_def_gov=q, term_in_def_gov=w, etc. See the work of Hovy et al. [2010] for the finer details. 7.7. RELATION BETWEEN WORDS IN CONTEXT: ARC-FACTORED PARSING 85 e preposition-sense disambiguation task is an example of a high-level semantic classification problem, for which we need a set of features that cannot be readily inferred from the surface forms, and can benefit from linguistic pre-processing (i.e., POS-tagging and syntactic parsing) as well as from selected pieces of information from manually curated semantic lexicons. 7.7 RELATION BETWEEN WORDS IN CONTEXT: ARC-FACTORED PARSING In the dependency parsing task, we are given a sentence and need to return a syntactic dependency tree over it, such as the tree in Figure 7.1. Each word is assigned a parent word, except for the main word of the sentence whose parent is a special *ROOT* symbol. det prep nsubj pobj det amod root prep dobj det pobj det the boy with the black shirt opened the door with a key Figure 7.1: Dependency tree. For more information on the dependency parsing task, its linguistic foundations and approaches to its solution, see the book by Kübler et al. [2008]. One approach to modeling the task is the arc-factored approach [McDonald et al., 2005], where each of the possible n2 word-word relations (arcs) is assigned a score independent of the others, and then we search for the valid tree with the maximal overall score. e score assignment is made by a trained scoring function AS.h; m; sent/, receiving a sentence as well as the indices h and m of two words within it that are considered as candidates for attachment (h is the index of the candidate head-word and m is the index of the candidate modifier). Training the scoring function such that it works well with the search procedure will be discussed in Chapter 19. Here, we focus on the features used in the scoring function. Assume a sentence of n words w1Wn and their corresponding parts-of-speech p1Wn, se nt D .w1; w2; : : : ; wn; p1; p2; : : : ; pn/ When looking at an arc between words wh and wm, we can make use of the following pieces of information. We begin with the usual suspects: