Category Archives: Uncategorized

Formal Languages

As always, life in Budapest is very busy, but I’ve managed to produce another blog post. This was my first Thanksgiving away from my family, but all the students at AIT Budapest (all 40 of us) got together with the professors for a massive feast with traditional Hungarian and American dishes. Many of the Hungarian students had never celebrated Thanksgiving before, so it was a lot of fun to experience it with them.

In my Algorithms for Bioinformatics class, we were introduced to the abstract concept of formal grammars in computer science and their interesting applications in real biological systems. A grammar is essentially a set of production rules for generating a string in a language. The rules form strings from the language’s alphabet that are syntactically valid. However, the output strings do not have inherent meaning they are just valid for the language. A formal grammar is a set of rules accompanied by a start symbol for initializing the process. The application of Formal Language Theory is used in theoretical computer science, theoretical linguistics, formal semantics and mathematical logic. Graduate students at MIT created a program to generate random research papers that are syntactically correct, but when read over, have no real world meaning. Below is a paper that was generated for me, take a look!

MIT Paper Generator

A couple of these papers have actually been accepted at conferences (very low ranking conferences, but conferences none the less!). In biological systems, a Lindenmayer systems a type of formal grammar that consists of an alphabet of symbols, a collection of production rules, an initial axiom string and a mechanism for translating the generated strings into geometric structures. From these generated strings, scientists can actually generate accurate predictions of plant structure, like the one created below.

Language-Generated Trees
Besides modeling plant growth, context free grammars can model protein folding as well. In this, the language is the string of amino acids, and the production rules will create folds resulting in alpha helices and beta sheets, which accurately resemble real world protein structure.

Jo Napot Kivanok

Budapest is finally starting to admit that summer is indeed over, and the city is transitioning to crisp autumn weather. The outdoor Turkish baths are shutting down, forcing bathers indoors to the ornate swimming halls. Every day I try to pick up a little more Hungarian, my vocabulary and conversations are currently limited to ordering food and describing myself (Amerikai diak vagyok). Even though the iron curtain fell many years ago, it is fascinating to see everyday throwbacks to how life was back in that time (oppressive grey apartment buildings, people pushing wheelbarrows of hundreds of potatoes down a busy street). The city of Budapest is actually incredibly developed, with a better public transit system than I’ve seen anywhere in the states. While here, I found a lacrosse team to play with, and we traveled this past weekend to Serbia to compete in a multinational tournament that we ended up winning! The team is filled with a bunch of goofballs:

 

bolasz

 

Everyone is super friendly, and are willing to let me practice my weak Hungarian on them. Most people here, not just on the team, actually speak very good English.

 

In my classes, we are talking about large datasets gathered from biology, such as genome and protein sequencing, and the issues that arise from data management and analysis.

The cost of sequencing an entire human genome has fallen drastically (under $5000, and projected to approach $1000), as well as the time needed to perform the sequence. But with this great technology comes the burden of overwhelming amounts of data. Scientists are now not only working on improving biological reading techniques, but the ways in manage the data as well. The most pressing issues are: data transfer, standardization of the data formats, access control and data integration.

One such platform to solve the problems presented above is a concept known as cluster computing. The goal behind this is to realize supercomputer performance without the need of actually possessing a supercomputer. Many computers on a single local network are linked together so that they can function as one single computer. This method is extremely cost effective and enables supercomputer performance for a fraction of the price. However, the other costs associated with this method (specialized facility and hardware, as well as extremely knowledgeable IT support) present potential drawbacks.

To overcome some of these issues, many companies are switching to cloud computing for their data storage and analysis. In the cloud, an on-demand shared pool of computing resources is available whenever needed for a very low cost. This is especially effective when the task doesn’t require the data to be continuously accessed, but instead read for one-off tasks. Cloud computing comes with its own set of drawbacks, such as privacy concerns about health records in public space and network bandwidth restrictions associated with  uploading the large datasets into the cloud.

Similar to both cloud and cluster computing is the method of grid computing. In grid computing, tasks are distributed to ‘loosely’ connected computers (as opposed to a single network of computers in cluster computing).These computers could be separated anywhere in the world, in different companies, or even running on volunteers laptops at home. This enables companies to muster huge computational power at almost no cost to them. Like cloud computing, grid computing suffers when transferring or uploading data. Additionally, there is minimal control over the hardware that the programs are actually running on. One way of speeding grid computing up comes from the practice of heterogeneous computing. These computers utilize accelerators, such as GPUs, to turn one computer into a cluster computer.