27 Aug 2020
Each day, a script scrapes new articles from nytimes.com. That text is tokenized, or split into words based on whitespace and punctuation. Each word then must pass several criteria. Containing a number or special character is criteria for disqualification. To avoid proper nouns, all capitalized words are filtered. The most important check is against the New York Time’s archive search service. The archive goes back to 1851 and contains more than 13 million articles. The paper publishes many thousands of words each day, but only a very few are firsts.
I like these kinds of Twitter accounts, finding genuinely unique content.