# Notes from Speech and Language Processing

## Ch. 2 (sections 2, 3 and 4)

A corpus is a (computer-readable) collection of texts. Choosing corpora and analysing them, can be a sensitive task. Firstly, we must ensure they accurately represent our goal, with proper language diversity, representation of dialects/sociolects, attention to text genre and platform specific quirks (e.g. hashtags on social media), etc. A dataset should have a corresponding datasheet properly explaining such features and the motivation behind choosing this specific data.1

We can look the following subsets inside a coropus:

• Types: The set of unique words inside the corpus.
• Tokens: The set of total words inside the corpus - unique or not.

Given a set $V$ of types and $N$ number of tokens inside a corpus, Heap’s/Herdan’s law says that:

$|V| = kN^\beta$

where $k$ and $\beta$ are positive, and $\beta \in (0, 1)$.2 $|V|$ here represents the size of the type set, namely the number of unique words. One can easily output a type set (sorted from most used to least used) from a file with the following shell command:3

\$ tr -sc 'A-Za-z' '\n' < <text-file> | tr A-Z a-z | sort | uniq -c | sort -n -r

This command obviously has some limitations over more rigorous statistical methods, but it works as a quick and dirty solution.

Given our corpus, it is necessary to normalize it before applying any NLP. This process commonly involves:

1. Tokenizing: This splits the corpus into separate words. We have to beware some difficulties.

• Punctuation should be kept as separate tokens. They convey meaningful information by themselves.
• A lot of words have internal separators. Examples include Ph.D, Mr., U.S.A., etc. They should be parsed as one token.
• Some words should be expanded into multiple tokens. In english, these are likely words that have an apostrophe in them, e.g. doesn’t. However, in the english case, we also need to consider genitive markers like the book’s cover - where book’s is only one token.
2. Word normalization: This involves normalizing tokens into standard forms, like stripping away plurals, stripping prefixes/suffixes, detecting what is the common word root (lemmatization), etc.

3. Sentence segmentation: Here we split on punctuation to find where one sentence stops and another one begins. Some punctuation are quite unambiguous, while others - like the period - have ambiguities that must be handled.