One-hot encoding is very inefficient because it is a sparse vector - meaning that most indeces are 0. This just wouldn’t make sense efficiency wise, if for example, there were 10,000 words in the vocabulary. To one-hot encode a word, 99.99% of the indeces would be 0 and just one would have 1 stored in it. In comparison, word embeddings are efficient, dense representations. A dense vector is one where all elements are full. For example, “the cat sat on the mat” could have a dense vector of [5,1,4,3,5,2]. An embedding is a dense vector of floating point values - these values are trainable parameters/ weights learned by the model during training. After the weights are learend, you can encode each word by looking at the dense vector it corresponds to.
Above are the two plots from the word embedding tensorflow exercise. Training accuracy continued to increase over the course of the epochs, while validation accuracy then decreased. The divergence of the two lines indicates that the model may be overfit. The other plot shows trianing loss is much less than validation loss. This means the network may be overfitting.
The above 4 plots show training/validation accuracy and training/validation loss. The bottom 2 plots are after the addition of 2 more LTSM layers in the RNN. Honestly, I don’t see too much of a difference between the plots. It’s clear from all 4 plots that the model is overfit, from the divergence of the trianing and validation loss curves.