Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Jie Yang; Shuailong Liang; Yue Zhang

arxiv: 1810.12594 · v1 · pith:QSBOOBINnew · submitted 2018-10-30 · 💻 cs.CL

Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Jie Yang , Yue Zhang , Shuailong Liang This is my paper

classification 💻 cs.CL

keywords latticelstmencodinginformationlexiconsubwordwordembeddings

0 comments

read the original abstract

We investigate a lattice LSTM network for Chinese word segmentation (CWS) to utilize words or subwords. It integrates the character sequence features with all subsequences information matched from a lexicon. The matched subsequences serve as information shortcut tunnels which link their start and end characters directly. Gated units are used to control the contribution of multiple input links. Through formula derivation and comparison, we show that the lattice LSTM is an extension of the standard LSTM with the ability to take multiple inputs. Previous lattice LSTM model takes word embeddings as the lexicon input, we prove that subword encoding can give the comparable performance and has the benefit of not relying on any external segmentor. The contribution of lattice LSTM comes from both lexicon and pretrained embeddings information, we find that the lexicon information contributes more than the pretrained embeddings information through controlled experiments. Our experiments show that the lattice structure with subword encoding gives competitive or better results with previous state-of-the-art methods on four segmentation benchmarks. Detailed analyses are conducted to compare the performance of word encoding and subword encoding in lattice LSTM. We also investigate the performance of lattice LSTM structure under different circumstances and when this model works or fails.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Investigating Self-Attention Network for Chinese Word Segmentation
cs.CL 2019-07 unverdicted novelty 4.0

Self-attention networks achieve competitive results to BiLSTM-CRF on Chinese word segmentation, with BERT and word integration yielding the best reported performance on six heterogeneous domain benchmarks.