This is due to the quantity of probable word sequences boosts, and also the designs that inform final results become weaker. By weighting text within a nonlinear, distributed way, this model can "discover" to approximate terms instead of be misled by any unfamiliar values. Its "being familiar with" of a presented term isn't really as tightly tether