How To Program In An Unknown Language (And Domain)

2023-12-18

This is a little experience report from my daily life facing a completely foreign programming language to implement a simple task. My thought process will hopefully help in showcasing the process to learn a new language. Also be warned: there will be a lot or ranting for comedic purposes.

Background

Recently I had to program a very simple script for my girlfriend that evaluates a matrix of Elements where (Element1, Element2) = (Element2, Element1) and (Element1, Element2) = some Value between -0.5 and 1 (cosine similarity), fill the missing values and list for each Element the highest matches and lowest matches while also displaying the value as well as item name as a table.

Drumroll...

in R.

A language designed to be used by data scientists to model statistical problems and prototype research workflows. Also a language that tries its best to make the entire ecosystem as opaque as possible, and make the trivial task to output nicely formatted custom tables as painful as possible.

I approached it as I would with any new language. Get familiarized with the environment, understand how to execute a script and then get familiar with the syntax.

Environment

First of all. The default environment RStudio. It is basically a JupiterNotebook. If I create a new .r file I would expect a normal script that can be run from top to bottom. But no, RStudio only enables you to run single selected lines. So you need to first select all lines and then execute them. While I understand that this kind of progressive evaluation might be beneficial when writing in .rmarkdown to embed code in documentation, this seems very unecessary, in a context where I want to execute logic in one sitting.

Then every value you evaluate is stored in an environment that you can inspect on the side. Which is very cool. It caches your history and for every token you defined the values are present as per the last evaluation. However this lead to all sorts of problems. Sometimes this environment trips up, and you can't overwrite those values throwing a very crypting message of the sorts "RCache token, bla is unrecognized or can't be overwritten" something like that. The only solution: Restarting the program. RStudio. Not the script. Not ideal.

So instead of trying to bang my head against it like it was a "normal" script, I setup an .rmarkdown and added little annotations before each major logic block to make it reproducible and add some additional info. With that out of the way I moved on to syntax.

Syntax and Semantics

I mean, I have seen many programming languages and language families, and for the exception of LISP languages every language (even the terse Haskell) use "=" as assignment.
Yet for some funny reason the R language designers thought it was appropriate to use "<-" instead. Other than that everything looked "C" like from it's syntax, with code blocks in curly braces. But everything, included return and typeof (meta-statement) are used as function calls... Finally we have squared brackets for indexing. Which brings me to the next point.

Indexing is crazy. When I figured out that "read.csv" read a csv and produced a dataframe as output I was facing the next problem. What is a dataframe internally and how can I work with that. And the issue is gloriously under-documented. I guess for the most workflows you could abstract over these details and just feed into matrices, etc... But in my case that is exactly what I needed.

So after some research and analysis (who defined typeof() as function call, really?) I found out that a dataframe is a list. Columns to be exact are named lists that contain row entries, also as lists. Now this is where the mindf*** beggins. A list is usually a linked list, something that contains a head and its tail recursively, to be undefinetely extended. Yet in the case of R (from what I understood) a list is that, mixed with a hash-table (or commonly known as a lookup table). Where each value had an associated name, and you could use the name to index the value aswell.

Nice! Also not so nice, if you have multiple instances of the same name in one entry, like in the case of my matched groups. Where one entry could appear multiple times. So I thought ok, this "value" is a composed type like in Haskell or similar languages. I was desperately trying to extract the "name" component from my value without any success.

Also a list is a vector. What is a vector? A list of unnamed entries of the same type, so essentially an array. Makes sense in a mathematical domain language to call arrays like 1 dimensional tensor. However what does not make sense is the syntactic and semantic support difference between lists and vectors. A list can be sorted with sort. A vector can't. Something about "X is not atomic" -> ???. A vector can be sliced with [x:y], a list can't.
Then I assumed that the defining difference between vector and list, was that list has named values of different types and vector does not. But when I casted my list to a vector with as.vector (the casting options being contained in the as object was syntactically and semantically refreshing and a little saving grace for this otherwise gruesome experience) that vector still contained names bound to those values. So, wtf? That added to the confusion greatly.

So my next logical step was to explore available libraries and solutions. From my experience with other high-level languages I wanted something like, filter or map over lists to avoid iteration. Turns out that map from the rlist package allows you to access each entry of a list with ".", the index with ".i", and the name with ".name". Nice, exactly what I needed, except that this is only done on the "content" level of my list. Meaning I could not create a nested list of name + value and then later merge them all together. I realized this when I had list(., name) in my final output (in the csv, for crying out loud). To get names in the conventional way your only option is the function names() which returns the names of each value in the same order.

So at the end of my patience -- and to save you from more struggles with this poorly documented language -- I simply used a mutable vector and a for loop over the list. Using a combination of c (to create vectors... but also merging them... that also sometimes works with lists, häh?!) and paste which is R-gibberish for simply appending Strings together like += in most weak languages or <> in Haskell. (Can we talk about the unorthodox naming everything follows? Let's better not open that door aswell. Egal.) I then used for loops over the range of entries indexing both names and values from their respective vectors, merging them progressively.

This taught me a valuable lesson. Which brings me to the takeaway of this post.

Takeaway

When learning a new language first get to know the environment and the intended workflow/purpose of said environment. Adapt to it quickly and go with the flow. Get familiar with the syntax. Don't try to adapt the programming conventions of other languages. Scratch your understanding of data-structures and other constructs you know from other languages too. Don't try to overburden your confusion with additional libraries. At the end everything you need is to:

Understand your input and output
Find and Understand the necessary data structures and how to manipulate them
Mechanisms of Iteration, Sequencing and Selection (essence of all programming)

And you can implement (although not idiomatic or efficient, or elegant, or scalable) solutions to most problems. But it is enough to get you started!