stackTrace.blog();

[whitty cs joke]

Ntrl Lngg Prcssng

Imagine you see the text "Wrnng: D nt ntr". Your brain is likely smart enough to figure out what message it's trying to portray: "Warning: Do not enter", however, humans are very good at looking at deciphering language, along with looking at the whole picture at once, two things that computers are historically challenged with. The idea of having a computer fill in missing vowels in words arrose while aimlessly browsing wikipedia when I came across this as a description of something computers struggle to do (I don't recall exactly what the page was about). My first thought was "wow imagine a data compression system that removes vowels and then uses natural language processing to fill them back in". Now clearly the thought was flawed, because NLP is rarely, if ever a completely perfect way of dealing with language, so the end results wouldn't be exactly the same, also with modern storage capabilities, nobody needs to compress a text file anymore.

I started the project anyway, just as NLP practice, and to see if I could do it. I decided to stick with Java since an efficient processing speed would be required, as well as a decent Object Oriented language. My intentions were to create a statisitcal model based on a bag-of-words context where each vowel-less word would be analyzed based on the vowel-less words around it, and a word with vowels with statistically similar words around it would be selected as the solution. I also thought about using a HMM, but decided to go with the bag-of-words context option to better simulate the concept of "looking at the whole picture". To build my simple statistical model, I constructed a dictionary of every word with a list of all of the surrounding words and the number of times that surrounding word appeared near the main word. Then I trained to system with the stories from textfiles.com.

I spent a great deal of time making the training shell, which had options to input text to train from manually or use a text file, along with options to check the current details of the dictionary. For example I had a search function which would find the searched word in the dictionary and display all of the words found in context with that word. Another example is an option to display the last n words added to the dictionary, which was helpful after adding a large text file to see what new things it added. Another thing I spent far too much time on was preproccessing the contents of a text file so they would fit with my system. I know there are already countless libraries available to do this, but I decided to try it myself. Some of the functions include a whitelist, which returns only specific characters from a given text, a new String.split() method to split paragraphs into sentences, and then those sentences into words (for the bag-of-words context). I tried out many other functions in the process, but settled on these and just a few others.

Once all of the training data was compiled into the system (I had an option to save the dictionary so I wouldn't have to retrain each time), the vowel filler takes in a user input string of text, either with or without vowels (the vowels would be removed by the system if the input text contained vowels), and splits it into the context with the same size as the dictionary to simulate the bag-of-words. Then searches the dictionary for any words that have the same consonants. All words that have the same consonants were then searched for the consonants of the surrounding words. The dictionary word with the highest count of matches would then be selected as the most likely match, and thus the result for that word. An alternate method that I tried which utilized little to no machine learning techniques was to replace the word with the first occurance of its consonants in the dictionary, which actually yielded supprisingly decent results, perhaps even better than my statistical model (I hadn't gathered accuracy results for this other test). You can see the results of my system for this blog post here.

Hello, World!

Welcome to my website/blog thing! Hi! Nice to meet you, my name is Tyler and I'm a Computer Science and Neuroscience double major at Oberlin College. This site is for me to express my progress in CS, as most of my progress in neuro happens almost exclusively in my head (also in my peripheral nervous system). In this blog I plan on chronicling (I don't like how that's spelled) any programming adventures on which I partake, as well as some history on things that I've done prior to the existance of this site.

A little history to start us off, alright? In high school, it wasn't required, but it was tradition, for everyone to take this class "Web Design" where students were able to create content for the school's "award winning" website. Now I'm not saying much for the class, although the teacher and webmaster was certainly enthusiastic about it. We learned basic photoshop and some html, it was much more content oriented than actual computer science. Anyway, the webmaster and I had a pretty good rapport and he encouraged me to look into web-building on my own and apply it to the school's site. I ran my own page on the site, a contest page for students to create and submit art made in MS Paint, and although the page existed before I was there, I added the vast majority of the content, and am responsible for its most successful semester, and the page was one of the most viewed on the site at the time. It's still live here, and you can still view my personal webbuilder profile.

The following semester, having enjoyed web design, I decided to take Introduction to Computer Programming, a class on Java programming taught by a great dude. Although it was intro level, in retrospect it was more challenging than AP Computer Science, and pretty similar to a college level intro class. Anyway, I learned the basics there and loved every minute of it. Some things I worked on include a simple physics engine, collision detection in Java (yeah really), and part of a recreation of pokemon red version. I loved it so much that the following year I took AP Computer Science (crap class) online, which was pretty much only good to help me place out of the intro cs class here at Oberlin. And now here I am. There's quite a bit I've done since being here, but thats for future posts.

The reason for the existance of this site is honestly a place to host my resume (still need to get on that) and much of my productivity is due to procrastinating on getting that done, so trade-offs and such. Also I figure I should have some legitimate webdesign experience, so here we are. Thanks for visiting!