Being the last week of my internship, I spent most of my time finalizing my code and learning how to upload it to their main database. I finished both my projects and the fixed all the changes suggested to me by others at Armorblox.
This whole process was a huge learning experience for me and I wanted to give a special thanks to Ms. Visa and Mr. Sampath for helping guide me through this and also to Ms. Belcher for helping organize the whole senior project and making sure I stayed on track and dedicated. I learned so much about machine learning and natural language processing from the very basics such as part of speech tagging to the more complex softwares such as word2vec and RAKE. Without this project, I don’t think I would have ever got a chance to explore the field of NLP in so much depth until my last year or two of college. Even non-academically, I was exposed to a whole new world. Working in a start-up, I learned the importance of communication and teamwork while working on such large scale projects where numerous components have to work in perfect unison for the product to work. The daily stand-ups were one way I saw Armorblox stay in sync and I also saw how no one hesitated before asking questions or clarifying anything to make sure everything worked out perfectly.
For me, I was fairly lucky in that, I didn’t face many non-technical challenges during the course of my senior project. I think this was because I made sure I followed my schedule and that I would work extra to make sure I met deadlines if I was going to miss them. The major thing that went wrong for me was during the middle couple weeks when my code for the threat insertion project deleted itself…twice. This meant I had to rewrite it twice but in my opinion in turned out to be a good thing. I wasn’t 100% sure how my code was working or why, I just knew it was, but after I had to rewrite it twice, I knew exactly how everything worked.
My final product will be the code I wrote for both projects along with examples of them running. Although I haven’t created a visual representation of that yet, I will be working on it for the next couple weeks leading up to my presentation on May 22 at 7pm at the DoubleTree Hilton Hotel in San Jose.
For those of you reading this that are considering a senior project next year, I highly recommend that you do. It may seem like a lot of work (and it is) but if you pick a topic you are interested in, its a lot of fun and you learn valuable lessons that you wouldn’t learn anywhere else.
I loved working on my senior project and interning at Armorblox! It’s an experience I’ll never forget!
I spent most of this week fixing and reformatting the code I wrote for the Twitter Email Project and finalizing the code for the Threat Simulation Project. A lot of the work involved cleaning up the code and making it readable as well as adjusting it to adopted coding standards. Since I was testing a lot of different NLP algorithms, my code was very messy and disorganized. I took the time going through the code to make sure I completely understood what was going on and so could others.
I also spent a large amount of time creating and practicing my presentation at home and in front of my external advisor and Ms. Belcher.
Since I am waiting on code reviews from my coworkers, I wasn’t able to make a lot of progress on the Twitter Email Project. My plan for the next week is to prepare my final project for my presentation on May 22 and fix any last minute changes with my code that my coworkers might suggest.
I spent most of my spring break making up hours for this week (since I was in Houston for world championships for robotics) and the week I was in Davis. I didn’t spend any of my time at my internship working on the Twitter Email Project but rather focusing on standardization of the code I wrote in the first week.
I was introduced by my coworker to this software called Pylint. Pylint checks the code you wrote to make sure it fits a standard set of adopted rules that make software easier to read. For example, lines can’t be longer than 100 characters and if they are, it needs to be split. This is to make sure each line fits within the window of anyone reading code. Pylint also checks for things like variable names and file names to make sure they follow consistent nomenclature.
I also added a logger to my code. The logger replaces print statements and gives updates on what the program is doing. There are five different levels of logs which are determined by what information the user wants to receive. Loggers act like print statements and the user can determine whether they only want to be notified of things like warnings or general info updates.
Afterwards I spent a day to reformat the code by changing the one the program takes inputs and chooses what threats to insert. This was because my boss wanted to begin to integrate my code with the code they had so that they could completely demonstrate their algorithm to potential customers and their board. I ran into a couple technical difficulties during this time because when I tried to upload my code to the repository where it needs to be stored, it somehow got deleted meaning I had to restart an entire five hours of work.
The following day I was given a new assignment to improve the threat injection code I wrote in week one. I had to change the dates in the Enron email dataset so that it seemed as if the emails were sent within the last year instead of in 2002. Although this was more for the aesthetic of the Enron dataset, the project still took me around a day to complete.
I was also assigned another project which involved changing the users in the emails to a given set of users so that Armorblox can sync up the Enron database with actual accounts to make their demonstrations more realistic. This project also took me a day to create and integrate with my previously written code.
Although I didn’t get to work on the Twitter Email Project this week, I learned a lot about how software engineers operate in the real world and what structures and processes they followed to stay organized.
I plan to spend the next week finishing up the Twitter Email Project since I am nearly done and also making sure it is up to par with the unspoken rules of coding demonstrated in Pylint. I am also going to be creating my final presentation and rehearsing it in front of my Basis advisor on Thursday.
I spent the first couple days this week researching the difference between TextRank and LexRank to see what would best fit my project. Both algorithms were very similar in terms of design but were two different approaches to summarization. TextRank turned out to be optimized for extracting key words and phrases from text documents. It was based on Google’s PageRank algorithm that works on the backend of the Google Search Engine. LexRank on the other hand, was an unsupervised machine learning approach to the same problem, but it also used Google’s PageRank algorithm. LexRank is optimized for generating summaries for longer texts in the form of sentences instead of just extracting key phrases. It also required being trained on a corpus of data. Due to this, I discussed with a couple others at my internship and decided that TextRank would be the best way to go since it was optimized for what we want and didn’t require any training time.
I also began planning the structure of my presentation. I want to begin with a introduction to the threat insertion project (what I worked on during the first week) and how it has been implemented and used with the code at Armorbloc. Following that, I hope to discuss Twitter Email Project and the goals my advisors and I had set for myself during the first week of my internship. Then, I want to talk about the different types of natural language processing algorithms I came across over the course of my senior project from RAKE to word2vec to Lesk to TextRank. I hope to go in depth into the pros and cons of each algorithm which I will be doing as part of my final product as well.
Along with the code I used for the Twitter Email Project, I hope to write a short paper detailing the benefits and limitations of different NLP algorithms and how machine learning plays or doesn’t play a role in each on of them.
Finally, I want to touch upon the challenges I faced over the course of these 12 weeks and how I learned to overcome them in different ways.
Over the course of the next week, I plan on finishing the part of my code that extracts the subject line of the email from the tweet using NLP. I will then have a meeting with my external advisor to discuss how to improve my algorithm and what I can add to it.
I also hope to spend some time fixing my code for the threat insertion project as per the comments of my co-worker. My external advisor told me that they will soon be integrating the threat insertion code with their main code so I need to make sure everything runs smoothly and is at the same standard as the rest of their work.
My week began with going back to the code I wrote during the first week of my internship to insert threats into the Enron email database. I edited based on suggestions from my fellow co-workers and then began to refactor the code so that someone other than myself could understand how the code work. I also changed the urls and attachments to real threats (they are modified a little to make sure no one gets accidentally hacked) to make the database as realistic as possible. This took me around two days after which I was introduced to bitbucket, a platform Armorblox uses to stay organized and check in code. It took me a while to get registered and get a hang of how to use bitbucket but now I feel comfortable doing so.
The following are a couple snippets of the code I wrote.
I also began and finished a side project given to me by my advisor. I wrote additional code that created workflow emails. Workflow emails are emails that are sent by companies with email addresses similar to firstname.lastname@example.org. They are usually meant for password reset notifications or order shipping notifications. My job was to write some code to inject a certain number of these emails into the Enron email database at the same time as I injected the threats.
After uploading this code to bitbucket for review, I went back to working on my twitter email project. Through a lucky google search, I came across a project called PyTextRank. PyTextRank takes sentences and uses the nltk library to extract phrases from a sentence(s) that are used to summarize the text. It utilizes Google’s PageRank algorithm to determine which words in the text are more important. A simple explanation of PyTextRank can be found here. For the published paper for TextRank by the University of Michigan, click here.
While playing around with TextRank, I found that it is much faster than my combination of the Lesk algorithm and word2vec. My combination would take anywhere from 3-5 minutes to run on 10 emails whereas TextRank takes less than a couple seconds since it doesn’t involve word2vec’s huge training data. Additionally, in my opinion, TextRank was producing equivalent or better summarization results than my algorithm most of the time. The following is a snippet of the texts of tweets followed by the key words/phrases generated by TextRank:
Although, TextRank is working great on the tweets, I want to have a couple more words extracted from each sentence so I am working on seeing how I will be able to do that efficiently. After further research, I found LexRank, a more sophisticated version of TextRank. LexRank is used to summarize multiple documents into a couple human-like summarized sentences. I read in a couple places that LexRank can be adopted to summarizing a sentence into a phrase so I want to explore that possibility a little.
Due to these discoveries, I have to push off working on the Machine Learning aspect for another couple days till I can get the key phrase extraction the way I want it. My final project will be a combination of my final subject line generating code as well as a paper analyzing the differences between the various natural language processing algorithms I have worked with over my internship and how machine learning plays or doesn’t play a role in each of them.
I haven’t had much chance this past week to work on my project because I was out of town for a robotics competition. I am still working on understanding the Lesk algorithm and how it outputs data. Using the algorithm I can figure out the meaning of a word given a context, but I still need to figure out how to find the synonyms of a word given its definition.
One idea I had was to get all possible synonyms for that word but then filter them based on their definition. I am currently working on this method but am stuck on how to compare definitions to definitions. I have a sort of an idea that may work but I am still testing it out.
My goal for the following week is to completely implement the Lesk algorithm so that I can find close to perfect synonyms of words in context. Once I have done that, I want to begin exploring the Machine Learning program that I will have to write to convert these synonyms into a subject line. I believe this algorithm will take me around two weeks to create, implement, and optimize. Afterwards, I want to spend some time sifting through my data by varying the parameters and then begin integrating it with the code Armorblox is building to test how well their cybersecurity algorithms are working.
I believe that I am on pace with what I had planned to do and will stay on track for the rest of the year.
Today marks the end of another week working on my senior project. I realized that a lot of the tools I found last week would not work in my program. I had planned to use RAKE to extract keywords from the tweets to form subjects but it turns out that RAKE is not very accurate on tweets. Since it is trained on more wholesome and formal writing, RAKE can’t extract keywords from semi-broken sentences.
On the other hand, I am continuing to use gloVe for its doesnt_match() method. This method takes in an input of words and returns the word that fits the worst in the list. The most common example is with the input of “breakfast lunch dinner food”. The doesnt_match method returns “food” because although food is a key component of breakfast, lunch, and dinner, the three meals share more of a correlation with each other. I started to use this method to reduce all the synonyms of a word to just two or three.
My plan is to get the closest synonym to every noun and verb in a sentence and then use RAKE or another algorithm to pick which few nouns and verbs are used the most. I will then train a Machine Learning algorithm that will be able to determine the best order to place each noun and verb to create the subject line for each email.
To get the synonyms of a word, I use WordNet, a lexical database for English developed by Princeton. NLTK provides an interface to use WordNet which is what I am using to get all the synonyms for a given word. In this database, words are grouped as ‘Synsets’, sets of synonyms that share a common meaning. Each Synset, a specific definition of a specific word, contains ‘Lemmas’ which reference to a synoyms of a specific word.
After implementing this, I came across another problem. WordNet was giving me all possible synonyms for a given word regardless of the context. This is due to the lexical ambiguity I discussed in my second blog post. Words have multiple meanings in different contexts and I was getting synonyms for all possible contexts. I had to find a way to only get the synonyms for the word based on the context it was being used in.
This is where I stumbled upon Lesk algorithm, a word sense disambiguation algorithm that used WordNet. Based on the sentence or phrase the word is used in, the Lesk algorithm returns the Synset with the most overlapped words.
This was extremely helpful as I have now begun working on how to implement the Lesk algorithm and hopefully will soon be heading on my way towards the Machine Learning part of my project, my next big challenge!
Over the past couple weeks, my project goals have slightly shifted. Initially, I wanted my final product to be a program that could combine two sentences correctly, but as I work on the project for my internship, I believe the Twitter project is a more in-depth and applicable approach to my goal. I want to explore the world of Machine Learning and Natural Language Processing and have realized my internship project is much better in helping me do that.
This week I delved into the world of Natural Language Processing. After meeting with my external advisor, I was given the task of taking the emails I had created from the Twitter API and auto-generating a subject line for each email. This involves analyzing each tweet and determining its meaning. On Wednesday, I looked into text classification programs (avoiding the Google API because my external advisor said that was the simple solution that didn’t require any skill on my part). I came across the Natural Language Toolkit (NLTK) and gloVe, among many other natural language processing resources. NLTK is a platform, written for python, to help programmers analyze text. GloVe is a machine learning algorithm written by professors at Stanford that converts words into 100-dimensional vectors. When plotted, the distance between vectors represents the correlation between two words! More information of gloVe can be found here: https://nlp.stanford.edu/projects/glove/
Realizing that text classification sorts text into predetermined categories instead of using the “language” of the text, I turned toward text summarization. I looked into numerous ways to summarize text but I found nothing that could convert a sentence or two in a phrase. Most text summarization programs were written to summarize paragraphs or whole documents into sentences. I wanted phrases for the subject line, not grammatically correct sentences because the tweets themselves were barely over a sentence or two long.
After even more research, I came across RAKE (Rapid Automatic Keyword Extraction). This algorithm uses NLTK to determine which words/phrases in a body of text are the most important. This works perfectly for what I want since I can take the most important words/phrases and set them as the subject line of the email. I am currently working on implementing this algorithm and will then spend some time reading through the source code and understanding how it works. Additional information on RAKE can be found here: https://pypi.python.org/pypi/rake-nltk
I also spent some time this week refactoring my code for pulling tweets from Twitter, converting them to an email, and saving it to a file. Initially, I had a single script, but after taking some advice from my external advisor, I turned my program into a series of classes and methods so that it would be easier for others to understand and modify my code. Hopefully, sometime next week, I will be posting my code on gitHub and updating it as I progress through my project.
Closing in on the second week of my internship, I have learned so much more. I began working on my project which entails creating a repository of emails based on converting tweets and replies on twitter into emails. I spent the first couple of days, creating a Twitter account and learning how to interact with the Twitter API. Since I am using Python, I downloaded a library called Tweepy which helps me write code to access Twitter’s API.
I came across some challenges during this time. I found out that Twitter places a rate limit on how many times I can access their API search call (150 times in 15 minutes). This limited me to only being able to obtain 15,000 tweets every 15 minutes since each call retrieved 100 tweets. After further research, I found a hidden method that allows me to now access close to 45,000 tweets every 15 minutes. Although this may seem like a lot, the limit has slowed me down since the programs I am testing take around a minute to run and then I have to wait 15 minutes before I can run them again.
Another challenge I ran into was obtaining replies to certain tweets. Since I want to convert tweets into emails, I also need the replies to those tweets to convert them into email replies. Unfortunately, the Twitter API does not have a function that allows a program to retrieve all the replies to a given tweet. This produces a significant challenge for me. Although I have found a logical way to solve this problem, I am currently struggling to implement it.
So far, my program can access the Twitter API and pull almost every tweet posted over the last 7 days that include a given set of search words. Since this would take hours to run due to the rate limit, I keep the maximum tweet limit to a couple hundred for testing. The following is a screenshot of my code:
I can also convert each tweet into an email using the MIME format (Multipurpose Internet Mail Extensions). I then store the emails in an array of dictionaries so that I can later write them to a file in JSON format. This way I can use the cyber attack simulation code I wrote in my second week to insert fake threats into these emails.
Today ends the first week of my internship. I was introduced to the Enron email database which contains over half a million emails from Enron which were released to the public when they were investigated by the FBI. This database is the largest publicly available database of emails between actual people and is used all over the world as test data.
I downloaded the database in JSON and have been writing a Python script over the past couple days to manipulate the database. My project was to be able to simulate cyber attacks in this database by switching out email address while keeping the author’s name the same as well as inserting malicious urls and attachments into the software. This was my first step into using MIME, a function that extends emails so that they can support other types of data transfers such as images, pdfs, and even non-traditional characters and symbols.
I also learned how to use the parser in Python so that I can run a Python script from the Terminal and add set inputs and parameters to the script.
This project took me to the end of the week where I met with my external advisor to discuss my next project. This project involves interacting with the Twitter API and MIME to convert tweets, retweets, and replied to email like the Enron database. This way, I will be able to create a new repository of emails with the click of a button and then use those emails to simulate cyber attacks.