In this post I will explain how this can be done faster using TF-IDF, N-Grams, and sparse matrix multiplication. Super Fast String Matching in Python Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. I just grabbed a random dataset with lots of company names from Kaggle. It contains all company names in the SEC EDGAR database. I don’t know anything about the data or the amount of duplicates in this dataset (it should be 0), but most likely there will be some very similar names. ![]() TF-IDF is a method to generate features from text by multiplying the frequency of a term (usually a word) in a document (the Term Frequency, or TF) by the importance (the Inverse Document Frequency or IDF) of the same term in an entire corpus. This last term weights less important words (e.g. the, it, and etc) down, and words that don’t occur frequently up. IDF(t) = log_e(Total number of documents / Number of documents with term t in it).Īn example (from Consider a document containing 100 words in which the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. TF-IDF is very useful in text classification and text clustering. It is used to transform documents into numeric vectors, that can easily be compared. While the terms in TF-IDF are usually words, this is not a necessity. In our case using words as terms wouldn’t help us much, as most company names only contain one or two words. This is why we will use n-grams: sequences of N contiguous items, in this case characters. The following function cleans a string and generates all n-grams in this string:Īs you can see, the code above does some cleaning as well. Next to removing some punctuation (dots, comma’s etc) it removes the string “ BD”. This is a nice example of one of the pitfalls of this approach: some terms that appear very infrequent will result in a high bias towards this term. In this case there where some company names ending with “ BD” that where being identified as similar, even though the rest of the string was not similar. ![]() The code to generate the matrix of TF-IDF values for each is shown below. The resulting matrix is very sparse as most terms in the corpus will not appear in most company names. Scikit-learn deals with this nicely by returning a sparse CSR matrix. You can see the first row (“!J INC”) contains three terms for the columns 11, 16196, and 15541. The last term (‘INC’) has a relatively low value, which makes sense as this term will appear often in the corpus, thus receiving a lower IDF weight. To calculate the similarity between two vectors of TF-IDF values the Cosine Similarity is usually used.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |