Neural networks and deep learning — self-study and 2 presentations

Last month, after mentioning “deep learning” a few times to some professors, I suddenly found myself in a position where I had to prepare three talks about “deep learning” within just one month… :sweat_smile: This is not to complain. I actually strongly enjoy studying the relevant theory, applying it to interesting datasets, and presenting what I have learned. Besides, teaching may be the best way to learn. However, it is quite funny. :laughing: The deep learning hype is too real. :trollface:

Read More

Probabilistic interpretation of AUC

Unfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be :scream_cat:). So it took me some time until I learned that the AUC has a nice probabilistic meaning.

Read More

Mining USPTO full text patent data - Analysis of machine learning and AI related patents granted in 2017 so far - Part 1

The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).

Read More

Freedman's paradox

Recently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as “Freedman’s paradox”. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors.

Read More

5 ways to measure running time of R code

A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available.

Read More

The Lean PhD Student — Can The Lean Startup principles be applied to personal productivity in graduate school?

The lean startup methodology consists of a set of principles that were proposed and popularized by Eric Ries in the book The Lean Startup (and elsewhere). He believes that startup success can be engineered by following the lean startup methodology. Eric Ries defines a startup as “a human institution designed to deliver a new product or service under conditions of extreme uncertainty”. If we replace “product or service” by “research result”, that sounds awfully similar to what a PhD student has to do. Indeed, the similarities between being a junior researcher, such as a PhD student, and running a startup have been often pointed out (for example: [1], [2], [3]). In light of this, I propose that the lean startup methodology can also be applied to academic pursuits of a PhD student. Below, I adapt some of the most important lean startup concepts for application to a junior researcher’s personal productivity and academic success.1

  1. Please note that I’m writing from the point of view of mathematical, statistical, and computational sciences, rather than from the viewpoint of experimental sciences. 

Read More

Understanding the Tucker decomposition, and compressing tensor-valued data (with R code)

In many applications, data naturally form an n-way tensor with n > 2, rather than a “tidy” table. As mentioned in the beginning of my last blog post, a tensor is essentially a multi-dimensional array:

  • a tensor of order one is a vector, which simply is a column of numbers,
  • a tensor of order two is a matrix, which is basically numbers arranged in a rectangle,
  • a tensor of order three looks like numbers arranged in rectangular box (or a cube, if all modes have the same dimension),
  • an nth order (or n-way) tensor looks like numbers arranged in an n-hyperrectangle… you get the idea…
Read More

A comprehensive Linux command cheat sheet

I think that it doesn’t matter what operating system you use — as long as you know your OS of choice well!:bowtie: This is a Linux command cheat sheet covering a wide range of topics. I cannot guarantee that the information is fully up-to-date or even correct. Use at own risk :stuck_out_tongue:. It is intended primarily as a reference for myself in the future. I have learned most of the material covered below a couple of years ago in the LinuxFoundationX’s Introduction to Linux course offered through edx.org.

Read More