{"title": "Shared Context Probabilistic Transducers", "book": "Advances in Neural Information Processing Systems", "page_first": 409, "page_last": 415, "abstract": null, "full_text": "Shared Context Probabilistic Transducers \n\nYoshua Bengio* \n\nDept. IRO , \n\nUniversite de Montreal, \n\nMontreal (QC) , Canada, H3C 3J7 \n\nbengioyOiro.umontreal.ca \n\nSamy Bengio t \nMicrocell Labs, \n\n1250 , Rene Levesque Ouest, \n\nMontreal (QC) , Canada, H3B 4W8 \n\nsamy.bengioOmicrocell.ca \n\nJean-Fran~ois Isabellet \n\nMicrocell Labs, \n\n1250, Rene Levesque Ouest , \n\nMontreal (QC), Canada, H3B 4W8 \n\njean-francois.isabelleCmicrocell.ca \n\nYoram Singer \n\nAT&T Laboratories, \n\nMurray Hill , NJ 07733, USA, \nsingerOresearch.att.com \n\nAbstract \n\nRecently, a model for supervised learning of probabilistic transduc(cid:173)\ners represented by suffix trees was introduced. However, this algo(cid:173)\nrithm tends to build very large trees, requiring very large amounts \nof computer memory. In this paper, we propose anew, more com(cid:173)\npact, transducer model in which one shares the parameters of distri(cid:173)\nbutions associated to contexts yielding similar conditional output \ndistributions . We illustrate the advantages of the proposed algo(cid:173)\nrithm with comparative experiments on inducing a noun phrase \nrecogmzer. \n\n1 \n\nIntroduction \n\nLearning algorithms for sequential data modeling are important in many applica(cid:173)\ntions such as natural language processing and time-series analysis, in which one has \nto learn a model from one or more sequences of training data. Many of these algo(cid:173)\nrithms can be cast as weighted transducers (Pereira, Riley and Sproat, 1994), which \nassociate input sequences to output sequences, with weights for each input/output \n\n* Yoshua Bengio is also with AT&T Laboratories, Holmdel, NJ 07733, USA. \nt This work was performed while Samy Bengio was at INRS-Telecommunication, Ile(cid:173)\n\ndes-Soeurs, Quebec, Canada, H3E IH6 \n\nt This work was performed while Jean-Franc;ois Isabelle was at INRS-Telecom(cid:173)\n\nmunication, Ile-des-Soeurs, Quebec, Canada, H3E IH6 \n\n\f410 \n\ny. Bengio, S. Bengio, J-F. Isabelle and Y. Singer \n\nsequence pair. When these weights are interpreted as probabilities, such models are \ncalled probabilistic transducers. In particular, a probabilistic transducer can rep(cid:173)\nresent the conditional probability distribution of output sequences given an input \nsequence. For example, algorithms for combining several transducers were found \nuseful in natural language and speech processing (Riley and Pereira, 1994). Very \noften, weighted transducers use an intermediate variable that represents \"context\", \nsuch as the state variable of Hidden Markov Models (Baker, 1975; Jelinek, 1976). A \nparticular type of weighted transducer, called Input/Output Hidden Markov Model, \nis one in which the input-to-context distribution and context-to-output distribution \nare represented by flexible parameterized models (such as neural networks) (Bengio \nand Frasconi, 1996) . In this paper, we will study probabilistic transducers with a \ndeterministic input-to-state mapping (i.e., a function from the past input subse(cid:173)\nquence to the current value of the context variable). One such transducer is the one \nwhich assigns a value of the context variable to every value of the past input sub(cid:173)\nsequence already seen in the data. This input-to-state mapping can be efficiently \nrepresented by a tree. Such transducers are called suffix tree transducers (Singer, \n1996). \n\nA problem with suffix tree transducers is that they tend to yield very large trees \n(whose size may grow as O(n 2 ) for a sequence of data of length n). For example, in \nthe application studied in this paper, one obtains trees requiring over a gigabyte of \nmemory. Heuristics may be used to limit the growth of the tree (e.g., by limiting \nthe maximum depth of the context, i.e., of the tree, and by limiting the maximum \nnumber of contexts, i.e., nodes of the tree). In this paper, instead, we propose a \nnew model for a probabilistic transducer with deterministic input-to-state function \nin which this function is compactly represented, by sharing parameters of contexts \nwhich are associated to similar output distributions. Another way to look at the \nproposed algorithm is that it searches for a clustering of the nodes of a suffix tree \ntransducer. The data structure that represents the contexts is not anymore a tree \nbut a single-root acyclic directed graph. \n\n2 Background: Suffix Tree Probabilistic Transducers \n\nThe learning algorithm for suffix tree probabilistic transducers (Singer, 1996) con(cid:173)\nstructs the model P(Yilxi) from discrete input sequences xi = {Xl,X2, ... ,Xn} \nto output sequences yi = {Y1, Y2, ... , Yn}, where Xt are elements of a finite alphabet \nEin. This distribution is represented by a tree in which each internal node may have \na child for every element of Ein, therefore associating a label E Ein to each arc. A \nnode at depth d is labeled with the sequence ut of labels on arcs from root to node, \ncorresponding to a particular input context, e.g., at some position n in the sequence \na context of length d is the value ut of the preceding subsequence x~_d l' Each \nnode at depth d is therefore associated with a model of the output distribution in \nthis context, P(Yn IX~-d+1 = ut) (independent of n). \nTo obtain a local output probability for Yn (i.e., given xi), one follows the longest \npossible path from the root to a node a depth d according to the labels xn, xn -1, \n... Xn-d+1. The local output probability at this node is used to model Yn' Since \np(yflxf) can always be written n~=1 P(Yn Ixi)' the overall input/output condi(cid:173)\ntional distribution can be decomposed, according to this model, as follows: \n\np(yflxf) = II P(YnIX~_d(x~)+l)' \n\nT \n\n(1) \n\nwhere d(xi) is the depth of the node of the tree associated with the longest suffix \nuf = x~_d+1 of xi. Figure 1 gives a simple example of a suffix tree transducer. \n\nn=1 \n\n\fShared Context Probabilistic Transducers \n\n411 \n\nPfaj\"O.S \nPfb)~OJ \nP(c)\", 0 2 \n\np(a)\",O.6 \np(b) _O.) \np(c:)= O I \n\n11 \n\nFigure 1: ExampJe of suffix tree transducer (Singer, 1996). The input alphabet, \nE in = {O, I} and the output aJphabet, EotJt = {a, b, c}. For instance, P(aIOOllO) = \nP(alllO) = 0.5 . \n\n3 Proposed Model and Learning Algorithm \n\nIn the model proposed here, the input/output conditional distribution p(yT I xI) \nis represented by a single-root acyclic directed graph. Each node of this graph is \nassociated with a set of contexts Cnode = {(jt\u00b7}, corresponding to all the paths i \n(of various lengths di ) from the root of the tree to this node. All these contexts are \nassociated with the same local output distribution P(Yn Ix? has a suffix in Cnode). \nLike in suffix tree transducers, each internal node may have a child for every element \nof E in . The arc is labeled with the corresponding element of ~in . Also like in suffix \ntree transducers, to obtain P(Ynlx~), one follows the path from the root to the \ndeepest node called deepest(x?) according to the labels Xn, Xn-l, etc .. . The local \noutput distribution at this node is used to predict Yn or its probability. The overall \nconditional distribution is therefore given by \n\nP(yilxf) = II P(Ynldeepest(x~)) \n\nT \n\nn=l \n\n(2) \n\nwhere the set of contexts Cdeepe3t(x~) associated to the deepest node deepest(xl) \ncontains a suffix of x? The model can be used both to compute the conditional \nprobability of a given input/output sequence pair, or to guess an output sequence \ngiven an input sequence. Note that the input variable can contain delayed values \nof the output variable (as in Variable Length Markov Models). \n\n3.1 Proposed Learning Algorithm \n\nWe present here a constructive learning algorithm for building the graph of the \nmodel and specify which data points are used to update each local output model \n(associated to nodes of the graph). The algorithm is on-line and operates accord(cid:173)\ning to two regimes: (1) adding new nodes and simply updating the local output \ndistributions at existing nodes, and (2) merging parts of the graph which represent \nsimilar distributions. If there are multiple sequences in the training data they are \nconcatenated in order to obtain a single input/output sequence pair. \n(1) After every observation (xn, Yn), the algorithm updates the output distributions \n\n\f412 \n\ny. Bengio, S. Bengio, J-F. Isabelle and Y. Singer \n\nof the nodes for which Cnode(x~) contains a suffix of Xl, possibly adding new nodes \n(with labels x~_d.) until xl E Cnode for some node. \n(2) Every Tmerge observations, the algorithm attempts to merge sub-graphs which \nare found similar enough, by comparing the N (N - 1) /2 pairs of sub-graphs rooted \nat the N nodes that have seen at least minn observations. Merging two subgraphs \nis equivalent to forcing them to share parameters (as well as reducing the size of \nthe representation of the distribution). A merge is performed between the graphs \nrooted at nodes a and b if Ll(a, b) < mina and the merge succeeds. The details of \nthe similarity measure and merging algorithm are given in the next subsections. \n\n3.2 Similarity Measure Between Rooted Subgraphs \n\nIn order to compare (asymmetrically) output distributions P(yla) \ntwo nodes a and b, one can use the Kullback-Liebler divergence: \n\nand P(ylb) at \n\n_ \nP(ylb) \nIi. L(a, b) = L...J P(ylb) log P(yla) \n\n~ \n\nyEE out \n\n(3) \n\nHowever, we want to compare the whole acyclic graphs rooted at these 2 nodes. In \norder to do so, let us define the following. Let s be a string of input labels, and b \na node. Define desc(b, s) as the most remote descendant of b obtained by following \nfrom b the arcs whose labels correspond to the sequence s. Let descendents(a) be \nthe set of strings obtained by following the arcs starting from node a until reaching \nthe leaves which have a as an ancestor. Let P(sla) be the probability offollowing the \narcs according to string s, starting from node a. This distribution can be estimated \nby counting the relative number of descendents through each of the children of each \nnode. \nTo compare the graphs rooted at two nodes a and b, we extend the KL divergence \nby weighing each of the descendents of a, as follows: \n\nW K L(a, b) = \n\nP(sla)K L(desc(a, s), desc(b, s)) \n\n(4) \n\nL \n\n3 E de3cendent8 (a) \n\nFinally, to obtain a symmetric measure, we define \n\n(5) \nthat is used in the merge phase of the constructive learning algorithm to decide \nwhether the subgraphs rooted at a and b should be merged. \n\nLl(a,b) = WKL(a,b) + WKL(b,a) \n\n3.3 Merging Two Rooted Subgraphs \nIf Ll (a, b) < mina (a predefined threshold) we want to merge the two subgraphs \nrooted at a and b and create a new subgraph rooted at c. The local output distri(cid:173)\nbution at c is obtained from the local output distributions at a and b as follows: \n\nP(Yn Ic) = P(Ynla)P(ala or b) + P(Ynlb)P(bla or b) \n\n(6) \n\nwhere we define \n\nad(a) \n\nP(ala or b) = ad(a) + ad(b) , \n\n(7) \nwhere d(a) is the length ofthe longest path from the root to node a, and a represents \na prior parameter (between 0 and 1) on the depth of the acyclic graphs. This prior \nparameter can be used to induce a prior distribution over possible rooted acyclic \ngraphs structures which favors smaller graphs and shorter contexts (see the mixture \nof probabilistic transducers of (Singer, 1996)). \n\nThe merging algorithm can then be summarized as follows: \n\n\fShared Context Probabilistic Transducers \n\n413 \n\n\u2022 The parents of a and b become parents for c. \n\u2022 Some verifications are made to prevent merges which would yield to cycles in the \n\ngraph. The nodes a and b are not merged if they are parents of one another. \n\n\u2022 We make each child of a a child of c. For each child u of b (following an arc labeled \nl), look for the corresponding child v of c (also following the arc labeled l) . If there \nis no such child, and u is not a parent of c, make u a new child of c. Else, if u and \nv are not parents of each other, recursively merge them. \n\n\u2022 Delete nodes a and b, as well as all the links from and to these nodes. \n\nThis algorithm is symmetric with respect to a and b except when a merge cannot \nbe done because a and b are parents of one another. In this case, an asymmetric \ndecision must be taken: we chose to keep only a and reject b. Figure 2 gives a \nsimple example of merge. \n\nFigure 2: This figure shows how two nodes are merged. The result is no longer a \ntree, but a directed graph. Some verifications are done to avoid cycles in the graph. \nEach node can have multiple labels, corresponding to the multiple possible paths \nfrom the root to the node. \n\n4 Comparative Experiments \n\nWe compared experimentally our model to the one proposed in (Singer, 1996) on \nmixtures of suffix tree transducers, using the same task. Given a text where each \nword is assigned an appropriate part-of-speech value (verb, noun, adjective, etc), \nthe task is to identify the noun phrases in the text. The UPENN tree-bank corpus \ndatabase was used in these experiments. The input vocabulary size, IEinl = 41, \nis the number of possible part-of-speech tags, and the output vocabulary size is \nIEouti = 2. The model was trained over 250000 marked tags, constraining the \ntree to be of maximal depth 15. The model was then tested (freezing the model \nstructure and its parameters) over 37000 other tags. Using the mixture of suffix tree \ntransducers (Singer, 1996) and thresholding the output probability at 0.5 to take \noutput decisions, yielded an accuracy rate of 97.6% on the test set, but required \nover 1 gigabyte of computer memory. \nTo make interesting comparisons with the shared context transducers, we chose \nthe following experimental scheme. Not only did we fix the maximal depth of the \ndirected graph to 15, but we also fixed the maximal number of allocated nodes, i.e., \nsimulating fixed memory resources. When this number was reached, we froze the \nstructure but continued to update the parameters of the model until the end of the \ntraining database was reached. For the shared context version, whenever a merge \nfreed some nodes, we let the graph grow again to its maximal node size. At the end \nof this process, we evaluated the model on the test set. \n\n\f414 \n\nY. Bengio, S. Bengio, J-F. Isabelle and Y. Singer \n\nWe tested this method for various values of the maximum number of nodes in the \ngraph. For each experiment, we tried different values of the other parameters (the \nsimilarity threshold min~ for merging, the minimum number of observations miIln \nat a node before it can be considered for a merge, and the delay Tmerge between two \nmerging phases), and we picked the one which performed the best on the training \nset. Results are reported in figure 3. \n\nmaximal with without \nnumber merge merge \nof nodes \n\n20 \n50 \n100 \n500 \n1000 \n2000 \n5000 \n\n(%) \n0.762 \n0.827 \n0.861 \n0.924 \n0.949 \n0.949 \n0.952 \n\n(%) \n0.584 \n0.624 \n0.727 \n0.867 \n0.917 \n0.935 \n0.948 \n\n005 \n\n09 \n\n085 \n\n01 \n\n075 \n\n07 \n\n085 \n\n01 \n\n055 \n\nFigure 3: This figure shows the generalization accuracy rate of a transducer with \nmerges (shared contexts graph) against one without merges (suffix tree), with dif(cid:173)\nferent maximum number of nodes. The maximum number of nodes are in a loga(cid:173)\nrithmic scale, and the accuracy rates are expressed in relative frequency of correct \nclassification. \n\nAs can be seen from the results, the accuracy rate over the test set is better for \ntransducers with shared contexts than without. More precisely, the gain is greater \nwhen the maximum number of nodes is smaller. When we fix the maximum number \nof nodes to a very small value (20), a shared context transducer performs 1.3 times \nbetter (in classification error) than a non-shared one. This gain becomes smaller \nand smaller as the maximum size increases. Beyond a certain maximum size, there \nis almost no gain, and one could probably observe a loss for some large sizes. We \nalso need to keep in mind that the larger the transducer is, the slower the program \nto create the shared context transducer is, compared to the non-shared one. Finally, \nit is interesting to note that using only 5000 nodes, we were able to obtain 95.2% \naccuracy, which is only 2.4% less than those obtained with no constraint on the \nnumber of nodes. \n\n5 Conclusion \n\nIn this paper, we have presented the following: \n\n\u2022 A new probabilistic model for probabilistic transducers with deterministic \ninput-to-state function, represented by a rooted acyclic directed graph with \nnodes associated to a set of contexts and children associated to the different \ninput symbols. This is a generalization of the suffix tree transducer. \n\n\u2022 A constructive learning algorithm for this model, based on construction and \nmerging phases. The merging is obtained by clustering parts of the graph \nwhich represent a similar conditional distribution. \n\n\u2022 Experimental results on a natural-language task showing that when the size \nof the graph is constrained, this algorithm performs better than the purely \nconstructive (no merge) suffix tree algorithm. \n\n\fShared Context Probabilistic Transducers \n\n4}5 \n\nReferences \n\nBaker, J. (1975). Stochastic modeling for automatic speech understanding. \n\nIn \nReddy, D., editor, Speech Recognition, pages 521-542. Academic Press, New \nYork. \n\nBengio, S. and Bengio, Y. (1996). An EM algorithm for asynchronous input/output \n\nhidden markov models. In Proceedings of the International Conference on Neu(cid:173)\nral Information Processing, Honk Kong. \n\nBengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. \n\nIEEE Transactions on Neural Networks , 7(5):1231-1249. \n\nJelinek, F. (1976). Continuous speech recognition by statistical methods. Proceed(cid:173)\n\nings of the IEEE, 64:532-556 . \n\nPereira, F. , Riley, M., and Sproat, R. (1994). Weighted rational transductions and \ntheir application to human language processing. In ARPA Natural Language \nProcessing Workshop. \n\nRiley, M. and Pereira, F. (1994). Weighted-finite-automata tools with applications \nto speech and language processing. Technical Report Technical Memorandum \n11222-931130-28TM, AT&T Bell Laboratories. \n\nSinger, Y. (1996). Adaptive mixtures of probabilistic transducers. In Mozer, M., \nTouretzky, D., and Perrone, M., editors, Advances in Neural Information Pro(cid:173)\ncessing Systems 8. MIT Press, Cambridge, MA. \n\n\f", "award": [], "sourceid": 1379, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Jean-Franc", "family_name": "Isabelle", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}