The verbosified work of A.I. Space Rovers
Or: Convincing GPT-2 that more is more

In early 2019, OpenAI released a language model called GPT-2 which was widely sensationalized as being "too dangerous to release" due to its capability to generate bulk amounts of believable text. It's precisely that model which I chose to use as the main tool for this project. This art project: accomplishes what a word-count-seeking writer could only dream of; is a statement to the needlessly verbose writing style I often find myself embodying; and showcases that the 'scariness' of GPT-2 depends largely on how the model is configured.

Parker Grey Addison

Ms. Wolper

Science Fiction, 4°

03.28.2016

Artificial Intelligence in Planetary Rovers

As of 1970, when the Soviet Union’s Lunokhod, the first unmanned rover, landed on the Moon, rovers have been agreed on as one of the best ways to gather scientific data about celestial bodies outside of our own Earth. Rovers, by nature not needing a human present, are able to travel to and survive on planets and moons not suitable for humans. Not only do they not need the same accommodations as humans, they also: last much longer than a human would on another planet, can transfer huge amounts of scientific data back to scientists on Earth, and they don’t necessarily need to have a plan to return to Earth. While starting off as directly remote controlled devices with limited data collection ability, planetary and lunar rovers have advanced greatly, and gained the capability of many more actions. Now, rovers can collect precise measurements, perform highly specific and intricate tasks, and even carry out entire scientific operations—ranging from basic to complex—solely on their own. A huge amount of this advancement can be contributed not to progress in equipment, but instead progress in the Artificial Intelligence the rovers possess. Just a few years after the initial use of rovers on other celestial bodies, space programs in the Soviet Union, the U.S., and all over the globe, saw the importance of autonomous capabilities on their rovers. Ever since that discovery, progress on this A.I. has been steadily increasing.

On November 17, 1970—a year after the Apollo 11 moon landing—the first unmanned space rover made contact with the lunar surface. This rover was the Lunokhod 1, from the Soviet Union’s Luna program. A couple months later, on January 15, 1973, another Soviet rover, Lunokhod 2, made a successful landing. The rovers were equipped with cameras and some basic scientific instruments meant to measure the composition and properties of the lunar soil.[1] Another year later, in December of 1971, the Soviet Mars 3 lander was making its descent onto the planet. However, as a result of a dust storm on the surface, a connection with the lander was only able to be sustained for 20 seconds. The lander contained a vehicle on board, called the Prop-M rover. While ultimately unsuccessful, the Prop-M rover was the first rover to have a form of artificial intelligence. This A.I. was very limited, and consisted merely of two metal rods on the front of the rover that would allow of autonomous obstacle avoidance.[2] For the period from the 1970’s until the turn of the century, A.I. in space rovers was no more than simple obstacle avoidance. With the launch of NASA’s two “robotic geologists”[3] , a huge shift occurred in the software of rovers.

The Spirit and Opportunity rovers, part of NASA’s Mars Exploration Rovers program, were launched on July 7 and 10, 2003, and landed on Mars on January 3 and 24, 2004. Featuring still limited artificial intelligence, these rovers were expected to only have a three month lifespan to collect and transmit data. Lasting long past their expected terminations—Spirit ceased to function as of 2010, but Opportunity is still functioning without issue—the two rovers have underwent a great deal of improvements.[3] Rovers are generally thought of as having cutting-edge technology in order to make scientific discoveries and advancements. Due to the length of development time and travel time, the building phase lasting over 8 years for NASA’s Curiosity, equipment onboard the rovers is generally robust, but outdated. While this equipment is something that cannot be easily upgraded once the rover has left Earth, the programming of the rover can be changed, and updated.[4] A large part of the updates to Discovery have been towards increasing the rover’s artificial intelligence. The same process is also in place for other rovers and machines, such as Curiosity and other planned rovers.[5] The benefits of this increased intelligence are numerous. On a basic level, the A.I. assists with navigation. The rovers can avoid obstacles and can find better paths to and from ordered positions with greater ease and safety, all with less input needed from the humans back on Earth. As software is updated, and intelligence is added, the rovers’ capabilities drastically increase.[6] For data collection, rovers are smarter about knowing when to transfer data back to Earth. While in the past some data has been lost due to poor connection, or not enough storage and being overwritten, the rovers now can decide when they should backup data according to amount of storage left, and broadband connectivity with mission control.[7] The rovers have also developed the intelligence to think scientifically. If a photo that they take, or object that they see is deemed scientifically interesting, they will seek closer inspection and take more pictures with different angles and filters. If the object is unimportant—an empty sky devoid of storms or interesting features, for example—then they will only store it temporarily, and won’t hesitate to overwrite it. Unless, of course, they are instructed otherwise.[6][8] All in all, as rovers become more autonomous, they are able to do their job quicker, instead of waiting for a command that can take upwards of 10 minutes to be received. As more rovers are sent to Mars and to other planets, it is entirely possible that there will be interconnected teams of rovers working together as scientists. Eventually, barely any human moderation will be needed for entire experiments and research projects to take place.[5]

Increased artificial intelligence for planetary rovers is not without sceptics. Indeed, there are plenty of obstacles and possible issues that could arise before the A.I. in the rovers can work as desired. A logistical problem that A.I. faces is the numerous unintended and unforeseen consequences of small errors in the programming, and the sheer amount of time it takes to correct these small errors. While the rovers, and the new software updates go through many tests before completion, and while there hasn’t yet been a large mishap relating to any updates, the possibility still exists. Especially as the intelligence gets more complex, it is likely for there to be situations in which the code doesn’t “think” the way we want it to.[9] The second main issue with increasing the A.I. of rovers stems from the fear a large proportion of people have about artificial intelligence. While A.I. is commonly accepted as benign, advanced but still limited, there is still great uncertainty about the future of its advancements. A great deal of people have concerns about creating something that, not only can think, but that can think differently than we do.[4] In order to address this fear, people must be guaranteed that the rovers will ultimately follow orders from humans. If the A.I. were to progress to superiority over humans, then there is a possibility the rover may ignore human input, unless programmed to ensure otherwise. Even then, there is a lot that a rover could do in the 10 to 20 minutes it takes for communication transmissions to be sent from Mars to Earth and back again. However, that doesn’t seem to bother too many scientists and engineers.

Since the turn of the century, artificial intelligence has become increasingly present in a wide variety of machines and robotics. Among experimental robots, Planetary rovers—developed for the most part by organizations such as NASA and the European Space Agency—have been at the cutting edge of A.I. advancements. The A.I. assists with many aspects of the rovers’ roles, allowing for more efficient basic functions, as well as scientifically thoughtful analysis and scientifically prioritized decisions that can be made autonomously. While there are dissenters to the increasing intelligence of rovers, primarily due to a fear of a created thought process, the development of A.I. is continuing to advance and is continuing to be uploaded to rovers on Mars.

For the technically inclined

To create the art piece seen above, I finetuned and explored OpenAI's 124M model for GPT-2. I also leveraged Max Woolf's GPT-2-simple package for Python to ease the finetuning process.

I wanted to purposefully misuse parts of the GPT-2 model in order to put an end to model's creativity, and to produce something in stark contrast to the text-summarative capabilities that GPT-2 has been cited for. This involved a large amount of experimentation with different generation parameters such as temperature and top_k, as well as exploration of the use and over-use of finetuning.

I tried different methods of finetuning with the following findings:

  1. Finetuning on a large corpus of academic writing, magazine writing, and newspaper writing (medium amount of epochs, ~500)
    • The model had a strong desire to talk about topics mentioned in this corpus—did not stay on topic to the prompt
  2. Finetuning on the essay that I wanted to verbosify (small amount of epochs, ~1–5)
    • The model generally stayed on topic, but the specific subject discussed still had decent variability
  3. Finetuning on (1) followed by (2)
    • The model would generally stay on topic, but would often get derailed by topics from (1)
  4. Finetuning on the prompt that I wanted to verbosity (smallest amount of epochs, only 1)
    • AMAZING! The model would generally output very similar information to that contained by the prompt, and would occasionally rely on general knowledge to complete an idea. I ended up solely using this

I tried different parameters with the following findings:

  1. Temperature - affects variability of output
    • Low temperatures (0.1–0.4) resulted in low variability (duh) and high likelihood of the output getting caught in a cycle. If the text was off-subject, setting a low temperature did not fix the issue, and instead only cemented the output as slightly off-subject
    • Medium temperatures (0.4–0.7) resulted in moderate variability and low likelihood of the output getting caught in a cycle. This often lead to the output repeating the prompt if the model didn't know much about the subject being discussed
    • High temperatures (0.7–1.0) resulted in good variability. I used primarily a temperature of 0.8, as this allowed for the model to reference its general knowledge about the subject being discussed. If many samples were returning blank or returning a limited output, then bumping the temperature up allowed the model to attempt to bring up a different subject
  2. Top-k - allows the model to choose from only the top-k tokens during a generation step
    • Default (0) was horrible since it actually removes all restrictions and allows sampling from all tokens. This meant that the model went off topic very easily
    • Recommended (40) was generally good, since it restricted the model to generating words that were seen in or similar to those seen in the prompt (since I also finetuned on the prompt)
    • Lower (30) was what I used for most of my generations, since it further restricted the model to using words similar to those in the prompt. Lower top-k forces the model to stay on subject, since it isn't allowed to mention unrelated words. A low top-k can be used with a high temperature to very effectively repeat the finetuning data/prompt while still attempting to avoid replication
    • Higher (80) was what I used if many samples were returning blank or the model was having a hard time avoiding replication of the prompt. Doing this returned a bit of creative freedom to the model and allowed it to rely on general knowledge to continue the prompt
  3. Length - determines how many tokens are generated
    • I generated 100 tokens for each sample, since putting too small a size often cut off sentences that seemed to have good potential, and too large a size goes off topic or enters a cycle

I played around with these finetuning methods and parameters on a handful of prompts from high school essays, and then got to work doubling the word count of one essay in particular. In order to do this, I created a collection of "passages" that formed the essay. Each passage consisted of three consecutive sentences that were all contained in the same paragraph. (I also played around with using three versus four sentences, but found that three gave better results for my particular high-school-writing-style.) Then, for each passage, I initialized a GPT-2 model and ran one iteration of finetuning on the passage. I then generated a handful of samples, using the passage as the prompt. I would vary the parameters in accordance to how I explain them above. Finally, I would hand pick some of my favorite generated sentences and add those to my essay. I found good sentences for most passages using only 10 samples, though some I had to look at up to 40 samples (playing with parameters every 10) in order to find something worthwhile. Note that the only modifications I made to the samples was adding a period if the output had been truncated.

The end result is (in my opinion) beautiful. For the most part, the sentences that I selected are very mundane. They don't pull the essay off topic, they don't detract from its value, and they certainly don't add to it... with the exception being that they add to the word count! However, there are some instances—particularly in the first few passages—where the creative side of GPT-2 was trying as hard as it could to reveal itself... these instances produced some interesting, amusing, and sometimes terrifying outputs.

Overall, this was a time consuming process when done on an entire essay, but I could see it being somewhat useful, or at least somewhat amusing as an extension to Google Docs. After all, the only thing the model needs to be finetuned on is the prompt itself, so the training and generation is actually quite quick and very accessible—no need for a large corpus, just a few sentences.