C# and Microsoft.ML. Removing words from sentence. Getting started with Machine Learning

Microsoft.ML is the NuGet package for ML.Net, Microsoft’s open-source Machine Learning framework.

In this introduction I will create a stopword engine, capable of removing unwanted words from a sentence. I know that it is overkill to use machine learning to do this, but it serves as a great introduction as how to initialize and call Microsoft.ML.

STEP 1: THE NUGET PACKAGE

You need the following NuGet package:

STEP 2: CREATE A LIST OF STOPWORDS

We need a list or unwanted words to remove from the list:

public class StopWords
{
  internal static readonly string[] Custom =
  {
    "profanity",
    "swearing",
    "degrading"
  };
}

STEP 3: CREATE THE TEXTPROCESSING SERVICE

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Transforms.Text;

public class TextProcessingService
{
  // The PredictionEngineis part of the Microsoft.ML package
  private readonly PredictionEngine<InputData, OutputData> _stopWordEngine;

  // The PredictionEngine receives an array of words
  private class InputData
  {
    public string[] Words { get; set; }
  }

  // The PredictionEngine returns an array of words
  private class OutputData : InputData
  {
    public string[] WordsWithoutStopWords { get; set; }
  }

  public TextProcessingService()
  {
    var context = new MLContext();

    // Getting the list of words to remove from our sentece
	var stopWords =
      StopWords.Custom.ToArray();

    // Define the transformation
	var transformerChain = context.Transforms.Text
      .RemoveDefaultStopWords(
        inputColumnName: "Words",
        outputColumnName: "WordsWithoutDefaultStopWords",
        language: StopWordsRemovingEstimator.Language.English)
      .Append(context.Transforms.Text.RemoveStopWords(
        inputColumnName: "WordsWithoutDefaultStopWords",
        outputColumnName: "WordsWithoutStopWords",
        stopwords: stopWords));

    var emptySamples = new List<InputData>();
    var emptyDataView = context.Data.LoadFromEnumerable(emptySamples);
    var textTransformer = transformerChain.Fit(emptyDataView);

    _stopWordEngine = context.Model.CreatePredictionEngine<InputData, OutputData>(textTransformer);
  }

  public string[] ExtractWords(string text)
  {
      // This will remove stopwords
	  var withoutStopWords = _stopWordEngine.Predict(new InputData { Words = text.Split(' ')}).WordsWithoutStopWords;
      if (withoutStopWords == null)
        return null;
      return withoutStopWords;
  }
}

USAGE:

public static void Main()
{
  var textProcessing = new TextProcessingService();
  var newString = textProcessing.ExtractWords("my code removes swearing and degrading language");
  Console.WriteLine(String.Join(' ',newString));
}

The code above will generate the following output:

  • code removes language

But why does it do that? The input string is “my code removes swearing and degrading language” and I have only defined “swearing” and “degrading” as words that needs to be removed?

The answer lies within line 37 in the TextProcessingService. I use a StopWordsRemovingEstimator, and the language is set to English. The RemoveDefaultStopWords method will add these default stop words to my list of words. The Microsoft class is pre-loaded with a number of stopwords, among those “my“, “and“. My list of words just adds to that list.

That’s it. Happy coding.

MORE TO READ:

About briancaos

Developer at Pentia A/S since 2003. Have developed Web Applications using Sitecore Since Sitecore 4.1.
This entry was posted in .net, .NET Core, c#, General .NET and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.