Model coefficients, vocabulary, and TFIDF inputs

Previously I wrote a an article on how to train text classification model, sometimes I got question like how the model predict one category/class over the other? What are the important features the model use for prediction?

To answer the question, let’s dissect the model we built and look inside the parts to gain insights.

Train a text classification model

First let’s again quickly train a text classification model using scikit learn TfidfVectorizer and SDGClassifier, and train it with bbc news article data that you can download from kaggle:

df = pd.read_csv('bbc-text.csv')
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['category'], test_size=.2…

