For simple transactional phrases (such as flash messages), this is more than enough.
But if you want to generate an entire article in plain English from data structures, then the logic becomes very convoluted very fast.
I looked around and found a few Ruby projects that could help, for example:
- "nameable" which can do useful stuff like this:
1 2 3 4 |
Nameable::Latin.new('Chris').gender #=> :male Nameable::Latin.new('Janine').female? #=> true |
- "calyx" which can be used to generate simple phrases like this:
1 2 3 4 5 6 7 8 |
class GreenBottle < Calyx::Grammar mapping :pluralize, /(.+)/ => '\\1s' start 'One green {bottle}.', 'Two green {bottle.pluralize}.' bottle 'bottle' end # => "One green bottle." # => "Two green bottles." |
Nice and dandy, but still useless for the more complex needs I have in mind.
So I decided to dig a bit deeper, into the dark world of NLG, or Natural Language Generation (not to be confused with NLP, which stands for Natural Language Processing, which is the opposite of what I want. NLP gets plain English text and returns a parsed data structure).
For NLP (parsing, tokenization, etc) I'd highly recommend "Stanford CoreNLP". It seems to be one of the most robust and comprehensive out there (come on, it's from Stanford). Again a Java project, and a big download (more than 300MB!). Those linguistics projects are super heavy because they have to download entire dictionaries and lexicon databases.
But focusing on my problem at hand, NLG, there are several options out there. In all honesty, I did not do a very extensive research so if you are aware of which is the most robust and also well maintained and with an easy to use interface, let me know in the comments section below.
My choice was SimpleNLG. From it's GitHub page we can see that it seems to be quite well maintained to this day, it's a simple Java library and it is one of the "simpler" alternatives. KPML is on the opposite spectrum: it seems to be one of the oldest (since the 80's!) and most robust one. But seriously, it feels like you need a ph.D to even get started.
Reading the SimpleNLG Java source code was boring but easy enough. Give yourself one full day of study to get used to the code and you're in business.
The main problem is that it's written in Java and I am not intending to write anything in Java (or any derivative) for now. For a short while I considered the endeavour or rewriting the damn thing in something more portable such as Rust, which I could load anywhere through FFI.
But even though SimpleNLG has "Simple" in it's name it has a few hairy dependencies to load the lexicon database. And the database itself is an HSQLDB dump, which is a Java-written database. And then, there would be the issue of maintaining a fork.
I quickly gave up on that idea and instead I worked around this by wrapping the library under a simple Rails-API endpoint. I had a few issues because I had Git LFS tracking jar files in my system and Heroku doesn't support it and I ended up with a corrupted deployment (beware of those quircks, by the way!)
Finally, I was able to deploy a working JRuby + Rails-API project embedding SimpleNLG at Heroku. You can deploy your own copy by cloning my nlg_service. It works fine with the latest JRuby 9.1.5.0. You should pay for at least a Hobby tier over Heroku. Java takes a ridiculous amount of time to start up and more time to warm up. Heroku's free tier shuts down your dyno if it sits idle and a subsequent web request will definitelly time out or take an absurd amount of time to return.
Once deployed it starts up Rails, then loads this initializer:
1 2 3 4 5 6 7 |
require 'java' Java::JavaLang::System.set_property "file.encoding","UTF-8" SIMPLE_NLG_DEFAULT_LEXICON_PATH = Rails.root.join("lib/SimpleNLG/resources/default-lexicon.xml").to_s.freeze SIMPLE_NLG_PATH = Rails.root.join("lib/SimpleNLG").to_s.freeze Dir["#{SIMPLE_NLG_PATH}/*.jar"].each { |jar| require jar } |
And then I map the classes like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
module SimpleNLG %w( simplenlg.aggregation simplenlg.features simplenlg.format.english simplenlg.framework simplenlg.lexicon simplenlg.morphology.english simplenlg.orthography.english simplenlg.phrasespec simplenlg.realiser.english simplenlg.syntax.english simplenlg.xmlrealiser simplenlg.xmlrealiser.wrapper ).each { |package| include_package package } end |
Finally, I have a simple endpoint mapped to a controller action:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
class Api::RealisersController < ApplicationController def create reader = java::io::StringReader.new(params[:xml]) begin records = SimpleNLG::XMLRealiser.getRecording(reader) output = records.getRecord.map do |record| SimpleNLG::XMLRealiser.realise(record&.getDocument) end @realisation = output.join("\n").strip render plain: @realisation ensure reader.close end end end |
The process of generating the final English text is called "realisation". SimpleNLG has a comprehensive Java API but it also exposes it as a simpler XML format. The full XML Realiser Schema is available as an XSD.
If I want to write this sentence:
"There are some finished and delivered stories that may not have been tested."
This is the XML that I need to assemble:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
<?xml version="1.0"?> <NLGSpec xmlns="https://simplenlg.googlecode.com/svn/trunk/res/xml" xmlns:xsd="https://www.w3.org/2001/XMLSchema" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"> <Recording> <Record> <Document cat="PARAGRAPH"> <child xsi:type="SPhraseSpec"> <subj xsi:type="NPPhraseSpec"> <head cat="ADVERB"> <base>there</base> </head> </subj> <vp xsi:type="VPPhraseSpec" PERSON="THIRD"> <head cat="VERB"> <base>be</base> </head> <compl xsi:type="NPPhraseSpec" NUMBER="PLURAL"> <head cat="NOUN"> <base>story</base> </head> <spec xsi:type="WordElement" cat="DETERMINER"> <base>a</base> </spec> <preMod xsi:type="CoordinatedPhraseElement" conj="and"> <coord xsi:type="VPPhraseSpec" TENSE="PAST"> <head cat="VERB"> <base>finish</base> </head> </coord> <coord xsi:type="VPPhraseSpec" TENSE="PAST"> <head cat="VERB"> <base>deliver</base> </head> </coord> </preMod> <compl xsi:type="SPhraseSpec" MODAL="may" PASSIVE="true" TENSE="PAST"> <vp xsi:type="VPPhraseSpec" TENSE="PAST" NEGATED="true"> <head cat="VERB"> <base>test</base> </head> </vp> </compl> </compl> </vp> </child> </Document> </Record> </Recording> </NLGSpec> |
Ok, this is preposterous, I know.
Which is why I decided to go ahead and use one of Ruby's most recognized strengths: creating DSLs or Domain Specific Languages.
The result of my initial endeavor to simplify this process is the nlg_xml_realiser_builder ruby gem. Simply add the following to your Gemfile
:
1 |
gem 'nlg_xml_realiser_builder' |
And the humongous XML above becomes something more manageable like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
dsl = NlgXmlRealiserBuilder::DSL.new dsl.builder(true) do sp :child do subj :np, 'there', cat: 'ADVERB' verb 'be', PERSON: 'THIRD' do compl :np, ['a', 'story'], NUMBER: 'PLURAL' do preMod :cp, conj: 'and' do coord :vp, 'finish', TENSE: 'PAST' coord :vp, 'deliver', TENSE: 'PAST' end compl :sp, MODAL: 'may', PASSIVE: true, TENSE: 'PAST' do verb 'test', TENSE: 'PAST', NEGATED: true end end end end end.to_xml |
Understanding the intricasies of an NPPhraseSpec
vs a VPPhraseSpec
or the difference between a WordElement
or StringElement
are beyond this blog post. But most of the original XSD has been mapped through this constants file.
I have a few acceptance specs that are generating XMLs like the above, posting to my live online NLG Web Service and fetching the resulting English sentences. I will change this process in the future but you can test it our yourself.
The advantages start here. Now let's check out the previous example more closely. Again, it renders this phrase:
"There are some finished and delivered stories that may not have been tested."
Now, it's in plural form because I am talking about 'stories', but what if I want a singular version?
Below is the new version where I just wrap it around a method and make the attribute 'NUMBER' accept both 'PLURAL' or 'SINGULAR':
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def example(plural = 'PLURAL') dsl = NlgXmlRealiserBuilder::DSL.new dsl.builder(true) do sp :child do subj :np, 'there', cat: 'ADVERB' verb 'be', PERSON: 'THIRD' do compl :np, ['a', 'story'], NUMBER: plural do preMod :cp, conj: 'and' do coord :vp, 'finish', TENSE: 'PAST' coord :vp, 'deliver', TENSE: 'PAST' end compl :sp, MODAL: 'may', PASSIVE: true, TENSE: 'PAST' do verb 'test', TENSE: 'PAST', NEGATED: true end end end end end.to_xml end |
And I can run the singular version like this:
1 |
puts example('SINGULAR') |
This is the resulting phrase:
"There is a finished and delivered story that may not have been tested."
Check out how it changed the verb from "are" to "is" and the noun determiner from "some" to "a" on its own! And of course, this is a contrived example. Now imagine an entire customizable article, full of paragraphs and sentences that I can customize depending on several variable I have.
While I was studying and writing this DSL I got a good enough grasp of the SimpleNLG structure, but if you have more examples for more complex phrase structures, please let me know in the comments section down below.
Most of the specs were copied from the XML Realiser tests from the original Java project to make sure I am covering most cases.
It will be interesting to see if this DSL makes it easier for more people to experiment with NLG. As usual, send your Pull Requests, ideas and suggestions on my GitHub public repositories:
And if you're interested in the subject of NLP and NLG I found this list of Ruby related open source projects as well.