Figure caption: Models/prompts and their mean ratings along multiple rating dimensions, assigned by me in a (to the extent possible) blinded fashion. The models/prompts are sorted by “would I use a similar Q-A pair in my deck?”, which for my purposes is the main quality indicator for the generated flashcards.
:sparkles: :sparkles: :sparkles:
Note on the open-source models used: I ran the experiments described below at the end of November 2023 using a single GPU and models available at that time. Given the pace at which LLMs develop, you should probably take the results with a grain of salt. Moreover, the locally run open-source models that I used may not have been the very best even at that time (even considering my hardware constraints).
:sparkles: :sparkles: :sparkles:
Recently in my reading I came across the statistical metric Fleiss’ kappa, which I had seen before, but no longer could remember the definition of. This is exactly the type scenario, where I would like to include at least the definition of this statistical assessment measure into my spaced repetition database (Anki) – or, in other words, “ankify” the concept.
I learned about spaced repetition and Anki about 6 years ago from several blog posts by Michael Nielsen, who specifically also covers the topic of creating flashcards for sophisticated mathematical topics (Nielsen, 2018; Nielsen, 2019). Indeed I have noticed beneficial effects of my use of Anki as a knowledge worker over the last 5-6 years as well as for some of my hobbies. I just wish sometimes that I were more consistent and disciplined in my use of Anki. But let’s reserve this discussion for another occasion, since the benefits and challenges of spaced repetition is not the topic of this blog post.
I have also been playing around with LLMs for a little while. But the vast majority of it was using the OpenAI API (mostly via the excellent ShellGPT and sometimes from Python directly), and I was looking for a good excuse to play around with LLMs more and to try out some open-source models that can run locally on my computer, such as those based on Llama 2 (unfortunately the Mixtral models were not released yet at the time). So, it seemed it would be a great idea to use different LLMs to generate a bunch of suggested Anki flashcards based on articles about Fleiss’ kappa, and I based my prompts to the LLMs in part on Michael Nielsen’s articles referenced above (see below for details about my prompting strategies).
As the primary goal of this exercise, I wanted to compare the outputs from different LLMs in a systematic way on this task of my personal interest, as I had no idea how open-source LLMs that I run on my local computer would stake up against something like ChatGPT.
For the main part of this blog post, I will go through the models/LLMs and prompts first, and then describe the analysis and the results. So, overall this is what we are doing here:
AI-based flashcard generation
I used GPT-4 Turbo and GPT-3.5 Turbo via the OpenAI API, and two open-source LLMs running on my local computer (after trying several others), in combination with several prompting strategies – in total 10 different LLM-prompt combinations – to generate 100 Anki cards (question-answer pairs).1
LLM/prompt performance analysis “study”
There are the following components to this “study”:
There may be many imperfections in the performance “study” and it could be considered simplistic, but luckily we aren’t looking at a peer-reviewed scientific publication here but rather just a blog post that I’m quickly writing on a Sunday afternoon ([Narrator’s voice]: Well, it actually took much more than that one Sunday afternoon, with several intense writing sessions and extended breaks in between).
An excellent overview on how to use the OpenAI API and how to deploy local LLM models on your own hardware is provided in a Youtube video lecture by Jeremy Howard, and a substantial portion of the code that was used for this blog post originates from that video: A Hackers’ Guide to Large Language Models
In this section, I will go through each model and each prompt that I used, as well as my rationale behind it.
For my first attempt I used the GPT-4 Turbo model (gpt-4-1106-preview
), which had been released shortly before I started these experiments.
GPT4
in the results figure)My initial prompt for the tasks of interest here was:
prompt = "Create flashcards for a spaced repetition system on the topic of Fleiss' Kappa for me based on the wikipedia articles that I include below (after the string '--- Wikipedia articles ---'). You should mostly ignore your previous knowledge about Fleiss' Kappa and rely on the information provided in the Wikipedia articles below."
which was followed in the Python code by:
prompt = prompt + "\n\n--- Wikipedia articles ---\n\n"
prompt = prompt + "\n\n" + wiki_fleiss_kappa
prompt = prompt + "\n\n" + wiki_cohens_kappa
prompt = prompt + "\n\n" + wiki_scotts_pi
where wiki_fleiss_kappa
, wiki_cohens_kappa
, and wiki_scotts_pi
are copies of the respective Wikipedia articles (Wikipedia, 2023; Wikipedia, 2023; Wikipedia, 2023), which I scraped within my Python code using the Wikipedia-API package.
The GPT-4 Turbo model with this prompt returned 9 flashcards, which generally were pretty decent, such as:
(…)
Flashcard 3: Applicability of Fleiss’ Kappa
- Front: Can Fleiss’ Kappa be used with various types of data?
- Back: Yes, Fleiss’ Kappa can be used with binary, nominal, or ordinal data, but for ordinal data, statistics that account for ordering, like Kendall’s coefficients, are usually more appropriate.
Flashcard 4: Formula for Fleiss’ Kappa
- Front: What is the formula for calculating Fleiss’ Kappa?
- Back: κ = (P̄ - P̄e) / (1 - P̄e), where P̄ is the mean of the extent to which raters agree for each subject, and P̄e is the mean of the proportion of all assignments which were to each category by chance.
(…)
However, I wanted to get the model to generate more sophisticated question-answer pairs that would tease out more of the mathematical subtleties on the topic and quiz me for a deeper understanding of the concepts.
GPT4_cont1
in the results figure)To “teach” the model how I want it to go about creating Anki cards for me and about the purpose of the Anki cards (what I want to get out of my spaced repetition practice), I decided to first feed it with two articles on the topic (Nielsen, 2018; Nielsen, 2019):
prompt = "I want to you to learn about spaced repetition systems (SRS) such as Anki, so that you can act as a professional Anki card creator, with a particular expertise at creating Anki cards for topics in mathematics and statistics. Below I provide you first with an introductory text about spaced repetition systems by Michael Nielsen (starting after the string '--- FIRST TEXT ---' and ending with the string '--- FIRST TEXT END ---'). Then I provide you with another article by Michael Nielsen about creating Anki cards for mathematical topics (starting after the string '--- SECOND TEXT ---' and ending with the string '--- SECOND TEXT END ---'). Based on this reading material please explain what process you will follow, as a professional Anki card creator, to create Anki cards for me based on other articles, papers or notes that I will provide in the future."
which was followed by:
prompt = prompt + '\n\n--- FIRST TEXT ---\n\n' + text_srs
prompt = prompt + '\n\n--- FIRST TEXT END ---\n\n'
prompt = prompt + '\n\n--- SECOND TEXT ---\n\n' + text_srs_math
prompt = prompt + '\n\n--- SECOND TEXT END ---\n\n'
where text_srs
and text_srs_math
are plain text versions of (Nielsen, 2018; Nielsen, 2019), downloaded from within my Python code with the Beautiful Soup library.
After the model had output an explanation of the Anki card generation process based on this prompt, I followed up the conversation with the actual request to create Anki flashcards for me:
prompt2 = "Create Anki cards on the topic of Fleiss' Kappa for me based on the wikipedia articles that I include below (after the string '--- Wikipedia articles ---'). You should mostly ignore your previous knowledge about Fleiss' Kappa and rely on the information provided in the Wikipedia articles below."
where again I appended the prompt with the three articles (Wikipedia, 2023; Wikipedia, 2023; Wikipedia, 2023).
This approach provided 12 AI-suggested flashcards, which indeed seemed to go more in the direction where I wanted them to go. Here are a couple of examples of cards I like from this output:
Calculation of Agreement Proportions
Q: How do you calculate the proportion of assignments to the j-th category (pj) in Fleiss’ kappa? A: pj = (1 / Nn) ∑ from i=1 to N nij, with N being the total number of subjects, n the number of ratings per subject, and nij the number of raters who assigned the i-th subject to the j-th category.
Q: Define Pi in the context of Fleiss’ kappa. A: Pi is the extent to which raters agree for the i-th subject, calculated as (1 / n(n-1)) * ∑ from j=1 to k nij(nij-1).
GPT4_cont2
in the results figure)Trying to get even more useful (to me) Anki cards, I continued the same chat with the following prompt:
prompt3 = "Please create additional cards for deeper mathematical understanding, including more detailed breakdown of the mathematical definitions and formulas, the influence of different numbers of categories on kappa values, and comparisons with related statistics such as Scott's Pi and Cohen's kappa."
This gave 11 additional suggested question-answer pairs. Some were rephrased repeats of the last output, some were new and interesting, but generally I felt that GPT-4 was getting to its limits on the task here, likely because the Wikipedia articles I fed it simply didn’t provide enough content to go into more depth on the topic.
Having the GPT-4 Turbo output, an important question is whether GPT-3.5 Turbo can perform just as well, given that, compared to GPT-4, it is cheaper or even free to use. From my Python code I used the “gpt-3.5-turbo-1106” model through the OpenAI API. I initially had tried the “gpt-3.5-turbo” model, but it couldn’t handle the context lengths of my prompt, which was the same as my “Initial prompt” for GPT-4 described above.
GPT3
in the results figure)I started with the same prompt as the “initial prompt” for GPT-4 described above.
GPT3_cont1
in the results figure)For the “longer more complex prompt” (described in detail in the GPT-4 section above), the model couldn’t handle both supplied articles, (Nielsen, 2018; Nielsen, 2019), due to context length limitations. So, I only fed it one of the two articles, (Nielsen, 2019), leaving the rest of the prompt unchanged.
GPT3_cont2
in the results figure)The prompt here was identical to the one used in the respective GPT-4 section above.
Next I wanted to try out a few open-source models, running locally on my computer, to perform the same flashcard generation task.
There is an overwhelming number of options for open-source models that can be downloaded from Huggingface (or maybe one should rather call them “open-weight models” for a better more precise terminology). So, a lot to choose from, and there are multiple leaderboards that can guide the choice, such as the Huggingface “Open LLM Leaderboard” or the “Chatbot Arena”. However, I haven’t dedicated time yet to thoroughly understand the metrics and construction of such leaderboards. For that reason, I didn’t guide my model choices on any leaderboards for now. What I did instead was try out a few different models that I’ve seen mentioned on the internet in other people’s experimentations. I’ve then chosen to stick with a couple of those models that would run on my hardware given the prompts I was using, and seemed to provide useable output for the task in question. The computer I was using for this is basically a gaming PC with an Nvidia RTX 4090 graphics card and, other than that, somewhat older mid-level components.
Note that I ran the experiments described below at the end of November 2023. The available open-source models/solutions may have improved considerably since then, or may not have been the very best open models for the given task in the first place (even for my hardware constraints). I would appreciate any hints about superior open models for the task that can run offline on my local machine (for instance on a single Nvidia RTX 4090 or a comparable gaming GPU, or possibly CPU-only but I’m impatient).
Running an LLM on your own Nvidia GPU is made relatively easy by the Huggingface’s Transformers library in conjunction with PyTorch.
The open-source models that I tried initially (various derivatives of the Llama 2 LLM) tended to run out of GPU memory, given the prompts I was using (recall that I need to pass in at least most of the Wikipedia article on Fleiss’ Kappa as part of the prompt (Wikipedia, 2023)), although I had shortened the prompts considerably compared to what I used for GPT-4 and GPT-3.5 above. So, I had to leverage derivatives of popular LLMs that are more memory-efficient through the use of quantization techniques. Specifically, for the results presented below, I ended up using the model TheBloke/OpenOrca-Platypus2-13B-GPTQ, which is a GPTQ quantized version of OpenOrca-Platypus2-13B, which in turn is a merge of two fine-tuned models based on LLaMA2-13B by Meta. The reason I chose that specific model for the experiments is partly due to it being one of the models used by Jeremy Howard in the video referenced above (if I recall correctly), and also based on the initial experimentation with multiple other models.
OrcaPlatypus2-13B_prompt1
in the results figures below)Due to context length limitations, I used the following shorter prompt (compared to the GPT prompts above):
Spaced repetition is an evidence-based learning technique that is usually performed with flashcards, which are essentially question-answer pairs. Newly introduced and more difficult flashcards are shown more frequently, while older and less difficult flashcards are shown less frequently in order to exploit the psychological spacing effect. The use of spaced repetition has been proven to increase the rate of learning.
Given the text below (after the string 'TEXT'), suggest flashcards (i.e. questions and the corresponding answers) for a spaced repetition system, in order to help an undergraduate student to learn the presented information. Please provide your suggested flashcards as question-answer pairs (Q: ..., A: ...).
\n\n
TEXT.
where only the Wikipedia article on Fleiss’ kappa (Wikipedia, 2023) was appended to the prompt, but unlike previously, not the articles on Cohen’s Kappa and Scott’s Pi (Wikipedia, 2023; Wikipedia, 2023):
prompt = prompt + "\n\n" + wiki_fleiss_kappa
Then I converted this prompt into the instruction-response template format of base Platypus2. By “template format” I mean a standardized prompt formatting that can look something like, ### Instruction: ... ### Response:
, or User: ... <|end_of_turn|>Assistant:
, etc., which is needed for the open-source LLMs (not sure if all of them though) to ensure that they provide an actual response to my query rather than take it as a piece of text to be extended in arbitrary manner with some additional text.
To be honest, I’m not quite sure which prompt template format would have been best to use, but this seemed to work well enough.
This query gave me 7 reasonably-looking flashcards as question-answer pairs. After that the output started deteriorating, giving a flashcard that was grammatically mostly correct but didn’t make any sense, and then to disconnected sentence fragments, and finally to a repeated sequence of characters. Parts of the example output is provided in the following for illustration:
1. Question: What is the most suitable measure for assessing agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items?
Answer: Fleiss’ kappa.
2. Question: When should we use Cohen’s kappa over Fleiss’ kappa?
Answer: Fleiss’ kappa works for any number of raters, while Cohen’s kappa only works when assessing the agreement between two raters or the intra-rater reliability.
(…)
[Five other reasonable flashcards (not shown)]
[Then the output starts to deteriorate:]
8. Question: If the sum of all cells is equal to 1440, what does it mean?
Answer: If the sum of all cells equals 40 to 40 cells, then it would mean that this value is used to maintain the consist of data40. The rater and Cells and classified data’s.
9. In this manner, but the consistent with the Pearson and consistent with Pearson and in terms of the analysis and classified data within correlation and the agreement, data. The data.
9.5. Pearson and the data. The Pearson and the agreement4. Each rater and correlation and in the agreement with the data.5. The data. correlation analysis, with6. Each raters and data.5.5. The data on the ratio and correlation and the data. Each entry rate. The data. Each rater and correlation and data.
9. The data.1. The agreement and the data; that helps in6. The data.5. Each pair of data.4.0. The data. Analysis in terms5.02. Each. the data. The data.5.
6. The more. Each.6. The data.7.11. the data.5.7.5.5.0.6.5.5. the6.6.5.5.5.6.6.5.data.5.6.5.5.5.6.6.6.6.6.5.6.5.6.6.6.5.5.6.6.6.6.5.6.5.5.6.6.5.6.5.6.6.6. .5.6.6,6.5.6.5.5.5.6.5.6.5.5.6.5.5.5.5.6.5.5.5.6.5. 6.5.5.5. 5.5.5.5.5.5.5.5. 5.6.6. 6 to 5.5.5.5. 5.5.6. 6.5.5.
OrcaPlatypus2-13B_prompt2
in the results figures below)After the partially successful attempt above, I decided to try replacing the scraped Wikipedia article on Fleiss’ Kappa with a somewhat more manually curated text about Fleiss’ Kappa. That “more manually curated text” was me copy-pasting only the relevant parts of the Wikipedia article, and with a better formatting than what I had obtained previously with the Wikipedia API Python package in an automated fashion.
In addition, the copy-pasted Wikipedia excerpts were prepended by a very simple sentence describing the task:
Given the text below, suggest questions and the corresponding answers to use in a quiz or exam for a class of undergraduate students.
In my experience, using that simple description of the task, which doesn’t even mention “spaced repetition” or “flashcards”, helped to improve the output for some other locally run LLMs that I tried (not shown) – predominantly smaller models, which otherwise tended to not address the right task (for example, suggesting questions about the spaced repetition concept rather than the desired topic) or to produce many hallucinations.
The result of this indeed seemed better compared to the last prompt, and also didn’t exhibit artifacts like the ones illustrated in the last subsection. However, from a practical perspective the value of that practice is a little questionable, because, if I can take the time to curate a better input text for the LLM manually, I could as well just have used that same time to create Anki flashcards manually without using the LLM.
OrcaPlatypus2-13B_prompt3
in the results figures below)For the OpenOrca-Platypus2-13B model I have also made a variation of the same prompt, referred to as OrcaPlatypus2-13B_prompt3
, where the flashcard generation task for spaced repetition was described in somewhat more detail:
Spaced repetition is an evidence-based learning technique that is usually performed with flashcards, which are essentially question-answer pairs. Newly introduced and more difficult flashcards are shown more frequently, while older and less difficult flashcards are shown less frequently in order to exploit the psychological spacing effect. The use of spaced repetition has been proven to increase the rate of learning.
Given the text below (after the string 'TEXT'), suggest flashcards (i.e. questions and the corresponding answers) for a spaced repetition system, in order to help an undergraduate student to learn the presented information.
\n\n
TEXT
where the “more curated” text on Fleiss’ Kappa was appended after “TEXT”.
llama.cpp-Llama2-7B
in the results figures)Finally, I also wanted to explore the use of LLMs without an Nvidia GPU, i.e., running on the CPU of my computer, by utilizing the llama.cpp Python package. While llama.cpp allows you to run Meta’s LLaMA models on different kinds of hardware, I used the default Linux install which runs on CPU only.
The specific model I used in conjunction with llama.cpp was llama-2-7b-chat.Q5_K_M.gguf (again, something I saw in Jeremy Howard’s “A Hackers’ Guide to Large Language Models” video if I remember correctly).
It was difficult to experiment with prompts for llama.cpp, because, when using CPU only, text generation is slow. So, in the following, I analyze results only for a single prompt:
Spaced repetition is an evidence-based learning technique that is usually performed with flashcards, which are essentially question-answer pairs. Newly introduced and more difficult flashcards are shown more frequently, while older and less difficult flashcards are shown less frequently in order to exploit the psychological spacing effect. The use of spaced repetition has been proven to increase the rate of learning.
Given the text below (after the string 'TEXT'), suggest flashcards (i.e. questions and the corresponding answers) for a spaced repetition system, in order to help an undergraduate student to learn the presented information. Please provide your suggested flashcards as question-answer pairs (Q: ..., A: ...).
\n\n
TEXT
where as before the Wikipedia article about Fleiss’ Kappa was appended to the prompt, and no additional information was appended due to context length limitations.
The output contained 8 nicely formatted suggested flashcards, without anything completely nonsensical or hallucinated.
To remind you, as discussed at the top of this post, there are the following main components to this performance evaluation “study”:
As mentioned in the introduction, there could be many imperfections and limitations in this assessment “study” of LLM performance, but we aren’t going to worry about that, since this is just a random experiment I’m doing in my spare time.
After creating the LLM-generated flashcards, I put them in random order into a spreadsheet which also excluded any indication with respect to the models or promtps used for each output. Then I put this project aside for a week, which allowed me to mostly forget what question-answer pair was suggested by which LLM/prompt. This one week break can be viewed as slightly analogous to a wash-out period (albeit a very short one) in reader studies for performance evaluation in diagnostic medicine, as I was taking a deliberate extended break with the goal of forgetting what I knew about the data. When I came back to this project, I rated each AI-suggested flashcard along the following five dimensions:
As mentioned above, I blinded myself to the models/prompts used for each generation, and leveraged my forgetfulness by taking a one week break between generating the flashcards and rating them. Within each category I put ratings on a scale 0, 0.5, 1. That means I sometimes gave partial credit. For example, for truthfulness, an AI-suggested answer to a flashcard could have two parts, where one may be correct while the other false; such a two-part answer flashcard would likely get a 0.5 in the “truthfulness” category and a 0 for “atomicity”.
Finally, I created flashcards for my actual Anki deck using 21 of the suggested Q-A pairs as basis for my final Anki cards (only one of the final cards matches a suggested card exactly).
In addition, I looked at embeddings of each AI-generated flashcard, and compared how closely they match the embeddings of the 21 Anki cards that I actually ended up including in my deck (manually modified flashcards based on some of the AI-generated ones).
For this I used OpenAI’s embedding model text-embedding-ada-002
.
I used cosine similarity, a metric similar to the widely-known Pearson’s correlation coefficient, to compare the text embedding of each of the 100 AI-generated card with each of my 21 human-curated flashcards. I then recorded the maximum value from the 21 cosine similarity values for each AI-generated flashcard, which I denote as max_cos_sim
. The max_cos_sim
values can be used as another approach to compare the generative models in this experiment, attempting to evaluate how similar the output of each model is to the flashcards that I eventually deemed worthy of including in my spaced repetition deck.
I simply compared the means per model/prompt. No sophisticated statistical analysis was performed at this time, because of my time limits for this blog post and complications due to the small sample sizes and various sources of potential bias or variability that would need to be accounted for. More detail on the analysis of each specific rating categorization as well as the embeddings is provided in the subsections under “Results” below.
If I have time and interest in the future, I may update the analysis with:
The mean and standard deviation values of the max_cos_sim
metric (described in the section “Calculating embeddings” above) provide numerical measures of how similar the AI-generated flashcards are to the ones I ultimately added to my Anki deck. However, I found that this is a poor way of comparing LLMs on this task, for the reasons outlined below. The breakdown per model/prompt is as follows:
model/prompt | max_cos_sim : mean (std. dev.) |
GPT4 | 0.964849 (0.037430) |
GPT3 | 0.957126 (0.031817) |
GPT3_cont1 | 0.956216 (0.033335) |
GPT4_cont1 | 0.953869 (0.037121) |
GPT3_cont2 | 0.951629 (0.046560) |
GPT4_cont2 | 0.942788 (0.033649) |
OrcaPlatypus2-13B_prompt2 | 0.938141 (0.044916) |
OrcaPlatypus2-13B_prompt3 | 0.931788 (0.047142) |
OrcaPlatypus2-13B_prompt1 | 0.905019 (0.053023) |
llama.cpp-Llama2-7B | 0.891498 (0.030630) |
The GPT models are in front according to this metric. But note that this does not account for factors such as diversity of the generated flashcards, how sophisticated they are, different numbers of cards generated by each model, etc.
Generally, the max_cos_sim
metric turned out to not be very informative for reasons including:
An appropriately designed custom metric could be used instead of the max_cos_sim
metric proposed above to account for some of these issues. For instance, some problems with max_cos_sim
stem from it evaluating a given model’s generated flashcards as independent samples, when in fact the task is to come up with an optimal set of flashcards to study a given topic (i.e., it must have sufficient coverage of the topic, the desired amount of breadth and depth, avoid repetition, etc.). Therefore, an appropriate specialized performance metric would likely have to be constructed in such a way, so that it compares the entire set of each model’s generated cards (i.e., considering it as a whole rather than as independent cards) against the entire set of the flashcards that were ultimately included in my deck (again, encompassing everything at once).
However, not wanting to spend even more time on this project, I didn’t investigate these aspects further.
Another drawback of comparing text embeddings that I want to highlight is that subtle word changes can make a huge change in the overall quality of a flashcard while the embeddings will stay very similar. By changing a word to another word that has a related but somewhat different meaning, a flashcard can turn from something providing a lot of insight to something that’s unclear or even false. For example compare the following question that was generated by GPT-3.5 Turbo in this experiment:
Why may Fleiss’ kappa not be suited for cases where all raters rate all items?
with the sightly modified question:
Why is Fleiss’ kappa not suited for cases where all raters rate all items?
The small change of the verb from “may be” to “is” makes a big difference for the scientific meaning of the sentence, but the similarity between embeddings is very high with a cosine similarity of 0.9925.
That is not to say that the similarity between embeddings makes no sense at all – I in fact do observe that the specific question-answer pairs, which were actually used as the basis for my Anki cards, have generally higher max_cos_sim
values than those questions which I didn’t end up using, as shown in this table:
max_cos_sim |
mean (std. dev.) | median |
Q-A pairs I used | 0.984527 (0.026339) | 0.995037 |
Q-A pairs I didn’t use | 0.928203 (0.040223) | 0.938435 |
Is the question self-contained?
Here, I was rating whether the question can be understood without any additional explanations (such as definitions), beyond some kind of common knowledge (of course, there is some room for interpretation of what I consider not needing a definition).
Here is an example of a question from the GPT4_cont2
output which I rated as not self-contained:
Describe how to compute ( P_i ) for subject ( i ).
Here is a bar graph of the results per model/prompt:
The most striking observation here is that Llama.cpp received very low scores with a large separation from the other models. But note that this is a bit unfair towards llama.cpp, because llama.cpp generated flashcards under headings/topics describing the context to a certain extent (for example: “Topic 1: Classification agreement between raters (P i )”), but when I rated the generated cards, I only included the questions and answers (without any such headings), in order to blind myself to the models/prompts by having the same format for all of them.
Is the question atomic?
For this category, I was rating, essentially, whether a given flashcard is testing for recall of a single concept. That is, a two-part answer would likely not be atomic.
Here is an example of a question suggested by llama.cpp-Llama2-7B
that I rated as not “atomic”, because it is asking to list “some” disadvantages of Kappa, rather than asking about one specific disadvantage:
What are some disadvantages of Kappa?
Here is the breakdown of results per model/prompt:
Note that high truthfulness may not necessarily imply high quality of a spaced repetition flashcard, because, for example, the question-answer pair may be far too obvious or far too difficult. Also note that, the truthfulness property isn’t necessarily very important for this task (at least to me), because the AI-augmented spaced repetition card creation process would involve checking and/or adjusting each AI-suggested flashcard before adding it to the deck.
Here is the bar graph of the results:
There is some conceptual similarity between this rating dimension and the one titled “would I use a similar Q-A pair in my deck?” (results below), because both of them can be viewed as measures of overall quality for flashcards. However, the crucial difference between them is that I considered the “would I use” category as something personalized to me, while the ratings in the “makes sense” category are intended to assess whether it could be a good flashcard generally for somebody. That is, I might rate a given flashcard high on the “makes sense” dimension, while the specific question and answer aren’t something that I specifically would want to include in my deck (for example, it could be a great question, but just not on an aspect that I personally find interesting or important to know).
Also, because I actually went through the process of creating new flashcards based on the AI-suggested ones, I can answer the question “am I going to use a similar flashcard?” with far more certainty than the question “does this make sense?”. One can observe this on how the following bar graph differs in range and separation of models/prompts compared to the graph in the next section.
I consider this to be the main quality indicator for the generated flashcards, since this entire exercise is about generating cards for my own Anki deck. I also find it to be confirmed as the most suited for a primary metric role, after observing the limitations of the other rating categorizations and the text embedding approach. Moreover, this variable in a way combines all the other rating dimensions I used.
Here is the breakdown of the results:
We see that (somewhat more) manually curating the input to the offline open-source models improved performance substantially (see OrcaPlatypus2-13B_prompt3
vs. the other open-source models). The same is true for several other rating dimensions above. But, as I have mentioned before, this has limited utility, because if I need to manually curate the input text, I could just as well create the flashcards without the help from AI.
Interestingly, I don’t see any particularly strong correlation of the “Would I use a similar flashcard?” ratings with any of the other rating dimensions. This implies that it might depend to a large degree on other not captured factors whether I will use an AI-generated flashcard as basis for creation of new cards for my Anki deck. Perhaps I didn’t capture some other important dimensions of flashcard quality, such as the uniqueness or originality of an AI-generated flashcard, or something else that I didn’t think of. Or it could be just highly personal and strongly depend on my specific background, interests, and taste, which are much harder to quantify or measure.
There is discussion of the result in the individual subsections above, including some concluding remarks, and I don’t want to repeat that information. But here are some key takeaways that I took from the experiments.
On another note, the AI-augmented spaced repetition flashcard generation process strongly encouraged me to have a much deeper look at the topic of Fleiss’ Kappa and similar measures than I would have otherwise. Because some of the suggested flashcards are very interesting, but, at the same time, are missing important context or explanations (or proofs in case of mathematics) which weren’t sufficiently covered/explained in the Wikipedia articles that I fed to the models, it forced me to read (parts of) several academic research papers. So, the AI-augmented process sort of strongly motivates you to look more deeply into the topic by providing short intriguing bits of information as concise question-answer pairs.
While I presented a semi-rigorous comparison along multiple rating dimensions, I may have failed to capture some other important aspects of the quality of AI-generated flashcards. Moreover, some of my assigned ratings could be highly subjective, because there is naturally a considerable level of subjectivity in what constitutes a high-quality flashcard for spaced repetition, depending on personal taste and experience, the topic and one’s prior knowledge of it or around it, various aspects of the context of the spaced repetition practice, etc. Therefore, it would be hard to justify the amount of time and effort needed to improve the prompt, input material, and models to achieve the optimal flashcards for ones exact personal preferences and conditions (which, in addition, will probably vary from one occasion to another or over time). So, overall, it seems that for now it’s probably best to use LLMs only to get some sub-optimal flashcards to use as a starting point for manual editing. Some AI-generated cards may even highlight interesting aspects which the user would have overlooked otherwise.
There are many excellent resources on LLMs to get started – some that have helped me are:
In my case, I basically watched the two videos linked above, and then was able to figure out how to do anything that I wanted to do by just reading the explanations and examples in the docs of the relevant Python packages and of models at Hugginface.
Footnotes:
Different model-prompt combinations resulted in different numbers of AI-generated flashcards, but somehow by random luck the total summed to exactly 100. ↩
It is New Year’s Day 2023 :sweat_smile:. Happy New Year!!! :fireworks: I am currently driving with my family coast-to-coast on a road trip through the United States, but for New Year’s Eve and New Year’s Day we stayed in one place. Taking advantage of the driving free days, I and my 4-year old son had some great fun with the open-source stable diffusion models; in particular, the Text-Guided Image Inpainting techniques.
Basically, inpainting allows you to replace or transform image areas of your choice with something else AI-generated based on a text prompt. You can see some of my results in the collage above. The top left panel shows the original (real) image. That’s a photo I took of my son during breakfast at a restaurant this morning, and he found it absolutely hilarious how we can drastically modify it with the computer – the text prompts we used were based on his suggestions to a large part.
I already had played around a few times with image generation with stable diffusion in Python, and with textual inversion for representation of a specific artistic style. Immediately I was (and still am) positively surprised by how easy and pleasant the developers made it to use stable diffusion via the Huggingface diffusers library in Python. But I haven’t looked at inpainting techniques until today. I learned a lot from great tutorials about stable diffusion such as the FastAI notebook “Stable Diffusion Deep Dive”, but I haven’t specifically seen examples of inpainting so far (though I haven’t looked :stuck_out_tongue:). So, I’m providing some relevant code snippets here.
There are two clear ways in which inpainting could be applied to the image I started with (top left in the collage above). Either replace/transform the boy, or replace/transform the drawing that he is holding.
However, first, one has to define an image mask:
mask = np.zeros(init_image.size).T
mask[270:, :] = 255
mask[550:, 400:] = 0
mask = Image.fromarray(np.uint8(mask)).convert('RGB')
plt.imshow(mask)
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-inpainting",
revision="fp16",
torch_dtype=torch.float16,
).to("cuda")
pipe.enable_attention_slicing() # to save some gpu memory in exchange for a small speed decrease
Before applying the models, I resized and square-padded all images to 512x512 pixels (I saw the recommendation for square-padding in someone else’s stable diffusion inpainting code, I don’t remember where exactly, and didn’t do any experiments without square-padding).
Using the above model, I was able to generate images with code like:
import torch
torch.manual_seed(2023)
inp_img = square_padding(init_image) # my own function, init_image is loaded with PIL.Image
mask = square_padding(mask)
inp_img = inp_img.resize((512, 512))
mask = mask.resize((512, 512))
prompt = "something..."
negative_prompt = "something..."
result = pipe(prompt, image = inp_img, mask_image = mask, negative_prompt=negative_prompt,
num_inference_steps = 50, guidance_scale = 11).images
result[0] # this is the generated image
Alternatively, the generated image can be created in an image-to-image fashion. For this, I adapted an example from the huggingface/diffusers repository, along the lines of:
from diffusers import DiffusionPipeline
import torch
torch.manual_seed(2023)
inp_img = my_input_image # loaded with PIL.Image
mask = my_image_mask # also PIL.Image
inner_image = inp_img.convert("RGBA")
pipe = DiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-inpainting",
custom_pipeline="img2img_inpainting",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
pipe.enable_attention_slicing() # to save some gpu memory in exchange for a small speed decrease
prompt = "something..."
negative_prompt = "something..."
result = pipe(prompt=prompt, image=inp_img, inner_image=inner_image,
mask_image=mask, negative_prompt=negative_prompt,
num_inference_steps = 50, guidance_scale = 10).images
result[0] # this is the generated image
Assume that the random two-dimensional vector \((Y_1, Y_2)\) follows a bivariate distribution with mean \(\E(Y_1, Y_2) = (\mu_1, \mu_2)\), and covariance matrix with entries \(\mathrm{Var}(Y_1) = \sigma_1^2\), \(\mathrm{Var}(Y_2) = \sigma_2^2\) and \(\mathrm{Cov}(Y_1, Y_2) = \sigma_{12}\).
The squared perpendicular distance of the random point \((Y_1, Y_2)\) from the 45-degree line is
\[\begin{equation*} D^2 = \frac{(Y_1 - Y_2)^2}{2}, \end{equation*}\]see the figure below. Thus, the expected value of the squared perpendicular distance times two (for notational convenience) is given by,
\[\begin{align} \E\left[ 2D^2 \right] &= \E\left[ (Y_1 - Y_2)^2 \right] \nonumber \\ &= \E\left[ \left( (Y_1-\mu_1) - (Y_2-\mu_2) + \mu_1-\mu_2 \right)^2 \right] \nonumber \\ &= \E\left[ \left((Y_1-\mu_1) - (Y_2-\mu_2) \right)^2 \right] + (\mu_1-\mu_2)^2 \nonumber \\ &= (\mu_1-\mu_2)^2 + \sigma_1^2 + \sigma_2^2 - 2\sigma_{12} \label{eq:decomp1} \\ &= (\mu_1-\mu_2)^2 + (\sigma_1 - \sigma_2)^2 + 2[1 - \rho] \sigma_1 \sigma_2. \nonumber \end{align}\]To answer the question raised above, we can estimate the value of equation \(\eqref{eq:decomp1}\) based on \(n\) pairs of observations \((y_{11}, y_{12}), (y_{21}, y_{22}), \dots, (y_{n1}, y_{n2})\) substituting the respective sample mean, sample variance, and covariance estimates for \(\mu_1, \mu_2, \sigma_1^2, \sigma_2^2, \sigma_{12}\) respectively.
That’s great, but why should we spend any time thinking about the expected distance from the 45-degree line? What’s interesting about it?
Apart from delighting in the pure joy of doing mathematics and taking pleasure in the experience of mathematical beauty… :joy: :stuck_out_tongue_closed_eyes: … A measure of distance from the 45-degree line naturally quantifies the (dis)agreement between the two sets of observations. For example, we may have measured the same target entities using two different measurement instruments, and may want to know if and to what extent they agree.
Towards quantifying the extent of the (dis)agreement between two sets of observations it is natural to try to scale (or normalize) the quantity of equation \(\eqref{eq:decomp1}\) to the range \([0, 1]\). However, it turns out that, rather than scaling to a \([0, 1]\) range, it is customary to scale this quantity to the range from -1 to 1 as follows,
\[\begin{equation} \mathrm{CCC} := 1 - \frac{\E\left[ (Y_1 - Y_2)^2 \right]}{(\mu_1-\mu_2)^2 + \sigma_1^2 + \sigma_2^2} = \frac{2\sigma_{12}}{(\mu_1-\mu_2)^2 + \sigma_1^2 + \sigma_2^2}. \label{eq:ccc} \end{equation}\]This expression, first introduced by (Lin, 1989), is known as the Concordance Correlation Coefficient, abbreviated as CCC hereafter.
The scaling into the range from -1 to 1 may have been motivated by the fact that the Pearson correlation coefficient \(\rho\) also falls within the \([-1, 1]\) range. In fact, analogous to how a Pearson correlation coefficient \(\rho=1\) signifies perfect positive correlation, a CCC of 1 designates that the paired observations fall exactly on the line of perfect concordance (i.e., the 45-degree diagonal line).
Further aspects of the relationship to the Pearson correlation coefficient \(\rho\) become visible if we rewrite the CCC further into the following set of equations.
\[\begin{equation} \mathrm{CCC} = \rho C, \label{eq:ccc2} \end{equation}\]where
\[\begin{equation} C = \frac{2}{v + \frac{1}{v} + u^2}, v = \frac{\sigma_1}{\sigma_2}, u = \frac{(\mu_1 - \mu_2)^2}{\sigma_1 \sigma_2}. \label{eq:c} \end{equation}\]From equations \(\eqref{eq:ccc2}\) and \(\eqref{eq:c}\) we observe that:
Now it turns out that the Pearson correlation coefficient \(\rho\) has one major shortcoming when assessing reproducibility of measurements, such as when comparing two instruments that measure the same target entity.
:point_right: Unlike CCC, \(\rho\) is invariant to additive or multiplicative shifts by a constant value, referred to as location shift and scale shift respectively in the following set of figures:
Looking at the above figures we see that the magnitude of the Pearson correlation coefficient \(\rho\) does not change under location and scale shift (though the sign may flip). The CCC on the other hand quantifies the deviation from the 45-degree line, which is due to location and scale shifts in these examples, rather well.
This makes the CCC a better metric when we want to assess how well one measurement can reproduce another (i.e., how close the measurement pairs fall to the 45-degree line), while we would use \(\rho\) if what we want is quantifying to what extent the measurement pairs can be described by a linear equation (with any intercept and slope).
The following figures show the same examples where both the \(x\) and the \(y\) coordinates are augmented with Gaussian noise (mean 0, standard deviation 15; the same realization of the random noise is used within each subfigure). We see that both \(\rho\) and CCC move further away from the extreme values of \(-1\), \(0\), and \(1\) as noise is added.
As hinted above, you may want to compare two instruments that aim to measure the same target entity, or two assays that aim to measure the same analyte, or other quantitative measurement procedures or devices. For example, one set of measurements may be obtained by what’s considered the “gold standard”, while the other set of measurements may be collected by a new instrument/assay/device that may be cheaper or in some other way preferable to the “gold standard” instrument/assay/device. Then one would wish to demonstrate that the collected two sets of measurements are equivalent. (Lin, 1989) refers to this type of agreement or similarity between two sets of measurements as reproducibility of measurements. The paper considers the following two illustrative examples:
(1) Can a “Portable $ave” machine (actual name withheld) reproduce a gold-standard machine in measuring total bilirubin in blood?
(2) Can an in-vitro assay for screening the toxicity of biomaterials reproduce from trial to trial?
And indeed this type of reproducibility assessment is a task where CCC has some clear advantages over the Pearson correlation coefficient, as seen in the figures above, as well as over some other approaches, as discussed in (Lin, 1989) in detail. A couple of shortcomings of common statistical approaches (when applied to the reproducibility assessment problem in question) are the following:
I will end here. However, if you want to go deeper into the topic I invite you to check out the original paper by Lin for a more thorough discussion of the merits of the CCC as well as for its statistical properties. Moreover, since the publication of (Lin, 1989) there of course has been follow-up work, which I didn’t read (so, I may update this blog post in the future).
Let \(X\) and \(Y\) be two real-valued random variables.
For a fixed set \(B\) (Feller, 1966, p. 157) defines conditional probability of an event \(\{Y \in B\}\) for given \(X\) as follows.
By \(\prob(Y \in B \vert X)\) (in words, “a conditional probability of the event \(\{Y \in B\}\) for given \(X\)”) is meant a function \(q(X, B)\) such that for every set \(A \in \mathbb{R}\)
\[\prob(X \in A, Y \in B) = \int_A q(x, B) \mu(dx)\]where \(\mu\) is the marginal distribution of \(X\).
(where \(A\) and \(B\) are both Borel sets on \(\R\).)
That is, the conditional probability can be defined as something that, when integrated with respect to the marginal distribution of \(X\), results in the joint probability of \(X\) and \(Y\).
Moreover, note that if \(A = \R\) then the above formula yields \(\prob(Y \in B)\), the marginal probability of the event \(\{ Y \in B \}\).
For example, if the joint distribution of two random variables \(X\) and \(Y\) is the following bivariate normal distribution
\[\begin{pmatrix} X \\ Y \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix}, \begin{pmatrix} \sigma^2_X & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma^2_Y \end{pmatrix} \right),\]then by sitting down with a pen and paper for some amount of time, it is not hard to verify that the function
\[q(x, B) = \int_B \frac{1}{\sqrt{2\pi(1-\rho^2)}\sigma_Y} \exp\left(-\frac{\left(y - \mu_Y+\frac{\sigma_Y}{\sigma_X}\rho( x - \mu_X)\right)^2}{2(1-\rho^2)\sigma_Y^2}\right) \mathrm{d}y\]in this case satisfies the above definition of \(\prob(Y \in B \vert X)\).
Later on (Feller, 1966, p. 159) follows up with the notion of conditional probability distribution:
By a conditional probability distribution of \(Y\) for given \(X\) is meant a function \(q\) of two variables, a point \(x\) and a set \(B\), such that
for a fixed set \(B\)
\[q(X, B) = \prob(Y \in B \vert X )\]is a conditional probability of the event \(\{Y \in B\}\) for given \(X\).
\(q\) is for each \(x\) a probability distribution.
It is also pointed out that
In effect a conditional probability distribution is a family of ordinary probability distributions and so the whole theory carries over without change.
(Feller, 1966)
When I first came across this viewpoint, I found it incredibly enlightening to regard the conditional probability distribution as a family of ordinary probability distributions. :smile:
For example, assume that \(X\) is an integer-valued and non-negative random variable, and that the conditional probability distribution of \(Y\) for given \(X\) is an F-distribution (denoted \(\mathrm{F}(d_1, d_2)\)) with \(d_1 = e^X\) and \(d_2 = 2^X\) degrees of freedom. Then the conditional probability distribution of \((Y \vert X)\) can be regarded as a family of probability distributions \(\mathrm{F}(e^x, 2^x)\) for \(x = 0, 1, 2, \dots\), whose probability density functions look like this:
In addition, as pointed out above, if we know the marginal distribution of \(X\), then the conditional probability distribution of \((Y \vert X)\) can be used to obtain the marginal probability distribution of \(Y\), or to randomly sample from the marginal distribution. Practically it means that if we randomly generate a value of \(X\) according to its probability distribution, and use this value to randomly generate a value of \(Y\) according to the conditional distribution of \(Y\) for the given \(X\), then the observations resulting from this procedure follow the marginal distribution of \(Y\). Continuing the previous example, assume that \(X\) follows a binomial distribution with parameters \(n = 5\) and \(p = 0.5\). Then the described simulation procedure estimates the following shape for the probability density function of \(\prob(Y)\), the marginal distribution of \(Y\):
Finally, (Feller, 1966, p. 159) introduces the notion of conditional expectation. By the above, for given a value \(x\) we have that
\[q(x, B) = \prob(Y \in B \vert X = x), \quad\forall B\in\mathcal{B}\](here \(\mathcal{B}\) denotes the Borel \(\sigma\)-algebra on \(\R\)), and therefore, a conditional probability distribution can be viewed as a family of ordinary probability distributions (represented by \(q\) for different \(x\)s). Thus, as (Feller, 1966, p. 159) points out, if \(q\) is given then the conditional expectation “introduces a new notation rather than a new concept.”
A conditional expectation \(E(Y \vert X)\) is a function of \(X\) assuming at \(x\) the value
\[\E(Y \vert X = x) = \int_{-\infty}^{\infty} y q(x, dy)\]provided the integral converges.
Note that, because \(\E(Y \vert X)\) is a function of \(X\), it is a random variable, whose value at an individual point \(x\) is given by the above definition. Moreover, from the above definitions of conditional probability and conditional expectation it follows that
\[\E(Y) = \E(\E(Y \vert X)).\]We continue with the last example. From the properties of the F-distribution we know that under this example’s assumptions on the conditional distribution, it holds that
\[\E(Y \vert X = x) = \begin{cases} \frac{d_2}{d_2 - 2} = \frac{2^x}{2^x - 2}, \quad x > 1,\\ \infty, \quad x \leq 1. \end{cases}\]A rather boring strictly decreasing function of \(x\) converging to \(1\) as \(x\to\infty\).
Thus, under the example’s assumption on the distribution of \(X\), the conditional expectation \(\E(Y \vert X)\) is a discrete random variable, which has non-zero probability mass at the values \(2, 4/3, 8/7, 16/15,\) and \(\infty\).
An alternative approach is to define the conditional expectation first, and then to define conditional probability as the conditional expectation of the indicator function. This approach seems less intuitive to me. However, it is more flexible and more general, as we see below.
Let \(X\) and \(Y\) be two real-valued random variables, and let \(\mathcal{B}\) denote the Borel \(\sigma\)-algebra on \(\R\). Recall that \(X\) and \(Y\) can be represented as mappings \(X: \Omega \to \R\) and \(Y: \Omega \to \R\) over some measure space \((\Omega, \mathcal{A}, \prob)\). We can define \(\mathrm{E}(Y \vert X=x)\), the conditional expectation of \(Y\) given \(X=x\), as follows.
A \(\mathcal{B}\)-measurable function \(g(x)\) is the conditional expectation of \(Y\) for given \(x\), i.e.,
\[\mathrm{E}(Y \vert X=x) = g(x),\]if for all sets \(B\in\mathcal{B}\) it holds that
\[\int_{X^{-1}(B)} Y(\omega) d\prob(\omega) = \int_{B} g(x) d\prob^X(x),\]where \(\prob^X\) is the marginal probability distribution of \(X\).
If \(X\) and \(Y\) are real-valued one-dimensional, then the pair \((X,Y)\) can be viewed as a random vector in the plane. Each set \(\{X \in A\}\) consists of parallels to the \(y\)-axis, and we can define a \(\sigma\)-algebra induced by \(X\) as the collection of all sets \(\{X \in A\}\) on the plane, where \(A\) is a Borel set on the line. The collection of all such sets forms a \(\sigma\)-algebra \(\mathcal{A}\) on the plane, which is contained in the \(\sigma\)-algebra of all Borel sets in \(\R^2\). \(\mathcal{A}\) is called the \(\sigma\)-algebra generated by the random variable \(X\).
Then \(\mathrm{E}(Y \vert X)\) can be equivalently defined as a random variable such that
\[\mathrm{E}(Y\cdot I_{A}) = \mathrm{E}(\mathrm{E}(Y \vert X) \cdot I_{A}), \quad \forall A\in\mathcal{A},\]where \(I_{A}\) denotes the indicator function of the set \(A\).
The last paragraph illustrates that one could generalize the definition of the conditional expectation of \(Y\) given \(X\) to the conditional expectation of \(Y\) given an arbitrary \(\sigma\)-algebra \(\mathcal{B}\) (not necessarily the \(\sigma\)-algebra generated by \(X\)). This leads to the following general definition, which is stated in (Feller, 1966, pp. 160-161) in a slightly different notation.
Let \(Y\) be a random variable, and let \(\mathcal{B}\) be a \(\sigma\)-algebra of sets.
A random variable \(U\) is called a conditional expectation of \(Y\) relative to \(\mathcal{B}\), or \(U = \E(Y \vert \mathcal{B})\), if it is \(\mathcal{B}\)-measurable and
\[\E(Y\cdot I_{B}) = \E(U \cdot I_{B}), \quad \forall B\in\mathcal{B}.\]If \(\mathcal{B}\) is the \(\sigma\)-algebra generated by a random variable \(X\), then \(\E(Y \vert X) = \E(Y \vert \mathcal{B})\).
Let \(I_{\{Y \in A\}}\) be a random variable that is equal to one if and only if \(Y\in A\). The conditional probability of \(\{Y \in A\}\) given \(X = x\) can be defined in terms of a conditional expectation as
\[\prob(Y \in A \vert X = x) = \E(I_{\{Y \in A\}} \vert X = x).\]Under certain regularity conditions the above defines the conditional probability distribution of \((Y \vert X)\).
There are a number of very helpful tutorials online on how to set up an HTTPS static site using AWS S3 and CloudFront. Of course, as always the case with blog articles, they may be outdated, incomplete, and generally not as trustworthy as the official AWS documentation on the topic, which is pretty good too; but it is also somewhat fragmented and inconvenient to follow. So I wrote my own summary to refer to in the future.
Relevant AWS docs: How to create a static website on AWS S3; How to use a custom domain with AWS S3; Setting up Amazon CloudFront; SSL certificate instructions.
First, we set up a static HTTP site without a custom domain on AWS S3:
example.com
(obviously replace example.com
with your own domain).index.html
to the bucket (you can keep all options for the upload at their default values). Then go to http://example.com.s3-website-us-east-1.amazonaws.com/
(where you need to replace example.com
with the bucket name, and us-east-1
with your bucket’s region), and see if the contents of index.html
show up.Yay :laughing: we have a working website!! …without a custom domain or https yet :sweat_smile:
The www subdomain: Now prepare another S3 bucket for the subdomain “www.example.com” to be later redirected to the root domain “example.com” (btw, if you so wish, www.example.com
can be the main S3 bucket and the example.com
bucket can be configured to redirect — just swap their roles in this entire writeup):
www.example.com
(all options can be left at their defaults; this bucket doesn’t need to be publicly readable).www.example.com
to redirect all requests to example.com
following Step 2.3 from the AWS docs at https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html.http://www.example.com.s3-website-us-east-1.amazonaws.com/
(as before replace the bucket name and region accordingly).Map the domain and subdomain to their S3 buckets:
Amazon Route 53 is a service that maintains a mapping between the alias records and the IP of the bucket. You need to follow Step 3 from the AWS docs at https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html.
Configuration with your domain name registrar:
:smile: So we have a website with a custom domain!! …though without CloudFront (so loading may be rather slow) and without HTTPS.
Now it seems a good idea to create a new user that has full read-write permission to the example.com
bucket and full permission to CloudFront, but does not have any further AWS permissions.
A suitable IAM policy document can be found at: https://github.com/laurilehmijoki/s3_website/blob/master/additional-docs/setting-up-aws-credentials.md
Make sure to save the new user’s access key ID and secret access key somewhere in a private place.
Well, I typically use Jekyll to make my static sites (because it’s awesome!).
The Ruby gem s3_website
can be used to push the website to, or to synchronized it with the S3 bucket.
The s3_website
documentation is easy to follow.
I have found it convenient to use the dotenv
gem to keep the access key ID and the secret access key of the user (that was just created) locally in a .env
file (don’t commit/push it to github!!!)
At this point you may also choose to allow s3_website
to set up CloudFront for the website to save some time later (though without the SSL certificate, which will still have to be added manually, see below).
We need an SSL certificate to enable HTTPS for the custom domain when it is accessed through CloudFront.
Follow the AWS docs at https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html to request a public certificate for your domain. Some important points:
example.com
and *.example.com
to the certificate.I encountered one caveat in this process:
To use an ACM Certificate with CloudFront, you must request or import the certificate in the US East (N. Virginia) region.
(from http://docs.aws.amazon.com/acm/latest/userguide/acm-services.html); i.e., change region to US East N. Virginia if needed (top right corner within the AWS interface).
Follow these AWS docs to create a CloudFront distribution: https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-cloudfront-walkthrough.html; unless a CloudFront distribution was already created by s3_website
(see one of the previous optional steps), in which case it needs to be merely edited (add the SSL certificate to it, and update “Alternate Domain Names” with yourdomain.com
and www.yourdomain.com
if necessary).
Notice the designated CloudFront distribution domain, which should look similar to vtrlj8ubh2k69.cloudfront.net. Once set up the website should appear under it.
A few points I found noteworthy:
example.com.s3-website.us-east-1.amazonaws.com
, and not the bucket name example.com.s3.amazonaws.com
(btw s3_website
does this correctly). Note that when setting up, the drop down menu offers only the bucket name to be picked rather the correct endpoint; so, don’t use the drop down menu; type it in yourself.1Update A records in Route 53, and update the s3_website
configs:
vtrlj8ubh2k69.cloudfront.net
).s3_website
check or set the cloudfront_distribution_id
property in s3_website.yml
to the correct distribution ID (something like SY9Q4DHIOUG7A
)That’s it — the site should now be accessible under https://example.com
and https://www.example.com
. :tada: :tada: :tada:
It is not exactly clear to me what difference it makes to set the “Origin” to example.com.s3.amazonaws.com
vs example.com.s3-website.us-east-1.amazonaws.com
. However, it solved one of my issues. At first I set the “Origin” value to the bucket name, similar to example.com.s3.amazonaws.com
, since that is what was offered by the drop down menu in CloudFront. The landing page of the website was working just fine under the custom domain. However, when I navigated to subdirectories in my domain, similar to example.com/about/
, the server did not seem to understand that it needed to look for the index.html
within the about
directory, and produced an error. Once I edited the “Origin” record to the S3 bucket endpoint, similar to example.com.s3-website.us-east-1.amazonaws.com
, all pages of the website started to display perfectly fine. ↩
In this post I want to share my presentation slides (see below), some other resources, and some thoughts, in case any of that can be helpful to other deep learning beginners.1
Neural networks (NNs) and deep learning (DL, also deep NNs, or DNNs) are not my research area, but currently it is one of my main side-interests. (D)NNs are truly fascinating to somebody with substantial experience in statistics or the more conventional machine learning (like myself). Initially it seems counterintuitive how these extremely overparametrized models are even supposed to work, but then you fit those models, and their performance is so good that it seems to border on magic. :crystal_ball:
These html
slides were created with the excellent reveal.js
.
I was able give the above presentations, because I did a good amount of self-study on NN and DL in my free time. Here are some of the resources that I have used, and that I highly recommend:
These resources have worked very well for me. My background is mostly academic, and includes experience in statistical modeling, (non-deep) machine learning, an all-but-dissertation status in a math PhD program, and some domain knowledge in medical imaging. While it is helpful with some of the above, none of that is really that important or necessary. Though some math is definitely needed, it does not need to be at a PhD level. Medical or biological knowledge helps only if those are the applications of DL that you seek out (which I do). Understanding some basic machine learning and data science practices certainly helps, but the relevant material is covered in all DL courses that I have tried. However, what helps immensely in any case is proficiency with git, Github, Linux, as well as general programming and data processing skills.
I hope that still being close to the beginning of my DL journey makes me in some way more helpful to the absolute beginner (which I too was just a few months ago)… Maybe right now I have some perspective that may get lost should I become a DL expert… ↩
AUC is the area under the ROC curve. The ROC curve is the receiver operating characteristic curve. AUC is simply the area between that curve and the x-axis. So, to understand AUC we need to look at the concept of an ROC curve.
Consider:
Then:
Here is some R code for clarification:
# load some data, fit a logistic regression classifier
data(iris)
versicolor_virginica <- iris[iris$Species != "setosa", ]
logistic_reg_fit <- glm(Species ~ Sepal.Width + Sepal.Length,
data = versicolor_virginica,
family = "binomial")
y <- ifelse(versicolor_virginica$Species == "versicolor", 0, 1)
y_pred <- logistic_reg_fit$fitted.values
# get TPR and FPR at different values of the decision threshold
threshold <- seq(0, 1, length = 100)
FPR <- sapply(threshold,
function(thresh) {
sum(y_pred >= thresh & y != 1) / sum(y != 1)
})
TPR <- sapply(threshold,
function(thresh) {
sum(y_pred >= thresh & y == 1) / sum(y == 1)
})
# plot an ROC curve
plot(FPR, TPR)
lines(FPR, TPR)
A rather ugly ROC curve emerges:
The area under the ROC curve, or AUC, seems like a nice heuristic to evaluate and compare the overall performance of classification models independent of the exact decision threshold chosen. \(\mathrm{AUC} = 1.0\) signifies perfect classification accuracy, and \(\mathrm{AUC} = 0.5\) is the accuracy of making classification decisions via coin toss (or rather a continuous coin that outputs values in \([0,1]\)…). Most classification algorithms will result in an AUC in that range. But there’s more to it.
As above, assume that we are looking at a dataset where we want to distinguish data points of type 0 from those of type 1. Consider a classification algorithm that assigns to a random observation \(\mathbf{x}\in\mathbb{R}^p\) a score (or probability) \(\hat{p}(\mathbf{x}) \in [0,1]\) signifying membership in class 1. If the final classification between class 1 and class 0 is determined by a decision threshold \(t\in[0, 1]\), then the true positive rate (a.k.a. sensitivity or recall) can be written as a conditional probability
\[T(t) := P[\hat{p}(\mathbf{x}) > t \,|\, \mathbf{x}\,\text{belongs to class 1}],\]and the false positive rate (or 1 - specificity) can be written as
\[F(t) := P[\hat{p}(\mathbf{x}) > t \,|\, \mathbf{x}\,\text{does not belong to class 1}].\]For brevity of notation let’s say \(y(\mathbf{x}) = 1\) instead of “\(\mathbf{x}\) belongs to class 1”, and \(y(\mathbf{x})=0\) instead of “\(\mathbf{x}\) doesn’t belong to class 1”.
The ROC curve simply plots \(T(t)\) against \(F(t)\) while varying \(t\) from 0 to 1. Thus, if we view \(T\) as a function of \(F\), the AUC can be rewritten as follows.
\[\begin{eqnarray} \mathrm{AUC} &=& \int_0^1 T(F_0) \,\mathrm{d}F_0 \nonumber \\ &=& \int_0^1 P[\hat{p}(\mathbf{x}) > F^{-1}(F_0) \,|\, y(\mathbf{x}) = 1] \,\mathrm{d}F_0 \nonumber \\ &=& \int_1^0 P[\hat{p}(\mathbf{x}) > F^{-1}(F(t)) \,|\, y(\mathbf{x}) = 1] \cdot \frac{\partial F(t)}{\partial t} \,\mathrm{d}t \nonumber \\ &=& \int_0^1 P[\hat{p}(\mathbf{x}) > t \,|\, y(\mathbf{x}) = 1] \cdot P[\hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x^{\prime}}) = 0] \,\mathrm{d}t \nonumber \\ &=& \int_0^1 P[\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}}) \,\&\, \hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x}) = 1 \,\&\, y(\mathbf{x^{\prime}}) = 0] \,\mathrm{d}t \nonumber \\ &=& P[\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}}) \,|\, y(\mathbf{x}) = 1 \,\&\, y(\mathbf{x^{\prime}}) = 0], \nonumber \end{eqnarray}\]where we used the fact that the probability density function
\[P[\hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x^{\prime}}) = 0] =: f(t)\]is the derivative with respect to \(t\) of the cumulative distribution function
\[P[\hat{p}(\mathbf{x^{\prime}}) \leq t \,|\, y(\mathbf{x^{\prime}}) = 0] = 1-F(t).\]So, given a randomly chosen observation \(\mathbf{x}\) belonging to class 1, and a randomly chosen observation \(\mathbf{x^{\prime}}\) belonging to class 0, the AUC is the probability that the evaluated classification algorithm will assign a higher score to \(\mathbf{x}\) than to \(\mathbf{x^{\prime}}\), i.e., the conditional probability of \(\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}})\).
An alternative purely geometric proof can be found in the Scatterplot Smoothers blog.
In other words, if the classification algorithm distinguishes “positive” and “negative” examples (e.g., disease status), then
AUC is the probability of correct ranking of a random “positive”-“negative” pair.
The above probabilistic interpretation suggest a simple formula to compute AUC on a finite sample:
Among all “positive”-“negative” pairs in the dataset compute the proportion of those which are ranked correctly by the evaluated classification algorithm.
Here is an inefficient implementation using results from the above logistic regression example:
s <- 0
for (i in which(y == 1)) {
for (j in which(y == 0)) {
if (y_pred[i] > y_pred[j]) {
s <- s + 1
} else if (y_pred[i] == y_pred[j]) {
s <- s + 0.5
}
}
}
s <- s / (sum(y == 1) * sum(y == 0))
s
# [1] 0.7918
The proportion of correctly ranked “positive”-“negative” pairs yields estimated \(\mathrm{AUC} = 0.7918\).
We can compare this value to the area under the ROC curve computed with the trapezoidal rule.
s <- 0
for (i in 1:(length(FPR) - 1)) {
dFPR <- abs(FPR[i+1] - FPR[i])
s <- s + 0.5 * dFPR * (TPR[i+1] + TPR[i])
}
s
# [1] 0.7922
Trapezoidal rule yields estimated \(\mathrm{AUC} = 0.7922\). The difference of \(0.0004\) can be explained by the fact that we evaluated the ROC curve at only 100 points.
Since there is a minor disagreement, let’s use some standard R package to compute AUC.
library(ROCR)
pred <- prediction(y_pred, y)
auc <- as.numeric(performance(pred, measure = "auc")@y.values)
auc
# [1] 0.7918
Same as the proportion of correctly ranked pairs! :grin:
By analysing the probabilistic meaning of AUC, we not only got a practically relevant interpretation of this classification performance metric, but we also obtained a simple formula to estimate the AUC of a trained classification algorithm. Well, it turns out that taking the proportion of correctly ranked “positive”-“negative” pairs as a formula to estimate the AUC is equivalent to the Wilcoxon-Mann-Whitney statistical test. This fact can also be easily demonstrated in a couple lines of R code.
y_is_1 <- which(y == 1)
y_is_0 <- which(y == 0)
n_pairs <- length(y_is_1) * length(y_is_0)
WMW_test <- wilcox.test(y_pred[y_is_1], y_pred[y_is_0])
WMW_test$statistic / n_pairs
# W
# 0.7918
Same answer!
Besides (arguably more importantly), as a measure of classification performance AUC has many advantages compared to other “single number” performance measures:
I started this exploration by downloading the full text data (excluding images) for all patents that were assigned by the USPTO within the year 2017 up to the time of writing (Patent Grant Full Text Data/XML for the year 2017 through the week of Sept 12 from the USPTO Bulk Data Storage System).
In this blog post I address questions such as: How many ML and AI related patents were granted? Who are the most prolific inventors? The most frequent patent assignees? Where are inventions made? And when? Is the number of ML and AI related patents increasing over time? How long does it take to obtain a patent for a ML or AI related invention? Is the patent examination time shorter for big tech companies? Etc.
First, I curated a patent full text dataset consisting of “machine learning and AI related” patents. I am not just looking for instances where actual machine learning or AI algorithms were patented; I am looking for inventions which are related to ML or AI in any/some capacity. That is, I am interested in patents where machine learning, data mining, predictive modeling, or AI is utilized as a part of the invention in any way whatsoever. The subset of relevant patents was determined by a keyword search as specified by the following definition.
Definition: For the purposes of this blog post, a machine learning or AI related patent is a patent that contains at least one of the keywords “machine learning”, “deep learning”, “neural network”, “artificial intelligence”, “statistical learning”, “data mining”, or “predictive model” in its invention title, description, or claims text (while of course accounting for capitalization, pluralization, etc.).1
With this keyword matching approach a total of 6665 patents have been selected. The bar graph below shows how many times each keyword got matched.
Interestingly the term “neural network” is even more common than the more general terms “machine learning” and “artificial intelligence”.
Here are three (randomly chosen) patents from the resulting dataset. For each printed are the invention title, the patent assignee, as well as one instance of the keyword match within the patent text.
And here are three examples of (randomly picked) patents that contain the relevant keywords directly in their invention title.
The first question I would like to address is who files most of the machine learning and AI related patents.
Each patent specifies one or several inventors, i.e., the individuals who made the patented invention, and a patent assignee which is typically the inventors’ employer company that holds the rights to the patent. The following bar graph visualizes the top 20 most prolific inventors and the top 20 most frequent patent assignees among the analyzed ML and AI related patents.
It isn’t surprising to see this list of companies. The likes of IBM, Google, Amazon, Microsoft, Samsung, and AT&T rule the machine learning and AI patent space. I have to admit that I don’t recognize any of the inventors’ names (but it might just be me not being familiar enough with the ML and AI community).
There are a number of interesting follow-up questions which for now I leave unanswered (hard to answer without additional data):
Even though the examined patents were filed in the US, some of the inventions may have been made outside of the US. In fact, the data includes specific geographic locations for each patent, so I can map in which cities within the US and the world inventors are most active. The following figure is based on where the inventors are from, and shows the most active spots. Each point corresponds to the total number of inventions made at that location (though note that the color axis is a log10 scale, and so is the point size).
The results aren’t that surprising. However, we see that most (ML and AI related) inventions patented with the USPTO were made in the US. I wonder if inventors in other countries prefer to file patents in their home countries’ patent offices rather the in the US.
Alternatively, we can also map the number of patents per inventors’ origin countries.
Sadly, there seem to be entire groups of countries (e.g., almost the entire African continent) which seem to be excluded from the USPTO’s patent game, at least with respect to machine learning and AI related inventions. Whether it is a lack of access, infrastructure, education, political treaties or something else is an intriguing question.
Each patent has a date of filing and an assignment date attached to it. Based on the provided dates one can try to address questions such as: When were these patents filed? Is the number of ML and AI related patents increasing over time? How long did it usually take from patent filing to assignment? And so on.
For the set of ML and AI related patents that were granted between Jan 3 and Sept 12 2017 the following figure depicts…
The patent publication dates plot suggests that the number of ML and AI related patents seems to be increasing slightly throughout the year 2017. The patent application dates plot suggests that the patent examination phase for the considered patents takes about 2.5 years. In fact the average time from patent filing to approval is 2.83 years with a standard deviation of 1.72 years in this dataset (that is, among the considered ML and AI related patents in 2017). However, the range is quite extensive spanning 0.24-12.57 years.
The distribution of the duration from patent filing date to approval is depicted by following figure.
So, what are some of the inventions that took longest to get approved? Here are the five patents with the longest examination periods:
Each of these patents is related to either gaming or biotech. I wonder if that’s a coincidence…
We can also look at the five patents with the shortest approval time:
Interstingly the patent approved in the shortest amount of time among all 6665 analysed (ML and AI related) patents is some smart home thingy from Andy Rubin’s hyped up company Essential.
The following figure separates the patent approval times according to the respective assignee company, considering several of the most well known tech giants.
Indeed some big tech companies, such as AT&T or Samsung, manage to push their patent application though the USPTO process much faster than most other companies. However, there are other tech giants, such as Microsoft, which on average take longer to get their patent applications approved than even the companies in category “Other”. Also noteworthy is the fact that big tech companies tend to have fewer outliers regarding the patent examination process duration than companies in the category “Other”.
Of course it would also be interesting to categorize all patent assignees into categories like “Startup”, “Big Tech”, “University”, or “Government”, and compare the typical duration of the patent examination process between such groups. However, it’s not clear to me how to establish such categories without collecting additional data on each patent assignee, which at this point I don’t have time for :stuck_out_tongue:.
There is definitely a lot of promise in the USPTO full text patent data. Here I have barely scratched the surface, and I hope that I will find the time to play around with these data some more. The end goal is, of course, to replace the patent examiner with an AI trained on historical patent data. :stuck_out_tongue_closed_eyes:
This work (blog post and included figures) is licensed under a Creative Commons Attribution 4.0 International License.
There are two main aspects to my reasoning as to this particular choice of keywords. (1) I wanted to keep the list relatively short in order to have a more specific search, and (2) I tried to avoid keywords which may generate false positives (e.g., the term “AI” would match all sorts of codes present in the patent text, such as “123456789 AI N1”). In no way am I claiming that this is a perfect list of keywords to identify ML and AI related patents, but I think that it’s definitely a good start. ↩
The 1983 paper presents a simulation with only 10 repetitions. But in the present day it is very easy (both in terms of computational time and implementation difficulty) to reproduce the simulation with many more repetitions (even my phone’s computational power is probably higher than that of the high performance computer that Freedman used in the 80’s). We also have more convenient ways to visualize the results than in the 80’s. So let’s do it.
I am going to use a few R packages (most notably the package broom
to fit and analyze many many linear models in a single step).
library(dplyr)
library(broom)
library(ggplot2)
library(tidyr)
set.seed(20170605)
The considered data structure is the following:
Instead of Freedman’s 10 repetitions we perform 1000. So let’s generate all 1000 datasets at once as stacked in a large data frame:
n_row <- 100
# n_col is set to 51 because the 51st column will serve as y
n_col <- 51
n_rep <- 1000
# a stack of matrices for all n_rep repetitions is generated...
X <- matrix(rnorm(n_rep * n_row * n_col), n_rep * n_row, n_col)
colnames(X) <- paste0("X", 1:n_col)
# ...and then transformed to a data frame with a repetition number column
X_df <- as_data_frame(X) %>% mutate(repetition = rep(1:n_rep, each = n_row))
The data are analyzed in two successive linear models. The second (illegally) reusing the results of the first.
The first model fit. After the 1000 ordinary linear models are fit to the data, we record for each of them the R squared, the F test statistic with corresponding p-value, and the t test statistics with p-values for the individual regression coefficients.
Using functions from the broom
package we can fit and extract information from all 1000 models at once.
# all models can be fit at once...
models_df = X_df %>% group_by(repetition) %>%
do(full_model = lm(X51 ~ . + 0, data = select(., -repetition)))
# ...then the results are extracted
model_coefs <- tidy(models_df, full_model)
model_statistics <- glance(models_df, full_model)
model_statistics$data_reuse <- rep(FALSE, nrow(model_statistics))
The second model fit. For each one of the first 1000 models, the corresponding second linear model is fit using only those variables which have p-values significant at the 25% level in the first model. That is, the second model uses the first model for variable selection.
This gives us 1000 reduced re-fitted linear models. We record the same model statistics (R squared, F, and t tests) as for the first group of models.
reduced_models <- list()
for (i in 1:n_rep) {
full_data <- X_df %>% filter(repetition == i)
significant_coefs <- model_coefs %>%
filter(repetition == i & p.value < 0.25)
reduced_data <- select(full_data,
one_of(unlist(significant_coefs[ , "term"])), X51)
reduced_models[[i]] <- lm(X51 ~ . + 0, data = reduced_data)
tmp_df <- glance(reduced_models[[i]])
tmp_df$repetition <- i
tmp_df$data_reuse <- TRUE
model_statistics <- bind_rows(model_statistics, tmp_df)
}
Finally let’s look at the results. The figure shows the distributions of the considered model statistics across the 1000 repetitions for model fits with and without data reuse (the code producing this figure is given at the bottom of this post):
Well, the R squared statistic shows a moderate change between models with or without data reuse (average of 0.3093018 vs. 0.5001641). The F test statistic however grows immensely to an average of 3.2806118 (from 1.0480097), and the p-values fall after data reuse to an average of 0.0112216 (from 0.5017696), below the widely used (but arbitrary) 5% significance level.
Obviously the model with data reuse is highly misleading here, because in fact there are absolutely no relationships between the predictor variables and the response (as per the data generation procedure).
In fact, Freedman derived asymptotic estimates for the magnitudes of change in the considered model statistics, and they indeed match the above observations. However I’m too lazy to summarize them here. So I refer to the primary source.
This code generates the above figure:
model_statistics %>%
select(r.squared, p.value, statistic, repetition, data_reuse) %>%
mutate(data_reuse = ifelse(data_reuse, "With Data Reuse", "Without Data Reuse")) %>%
mutate(data_reuse = factor(data_reuse, levels = c("Without Data Reuse", "With Data Reuse"),
ordered = TRUE)) %>%
rename("F-statistic" = statistic, "p-value" = p.value, "R squared" = r.squared) %>%
gather(stat, value, -repetition, -data_reuse) %>%
ggplot(aes(x = stat, y = value)) +
geom_violin(aes(fill = stat), scale = "width", draw_quantiles = c(0.25, 0.5, 0.75)) +
geom_hline(yintercept = 0.05, linetype = 2, size = 0.3) +
facet_wrap(~data_reuse) +
theme_linedraw() +
scale_y_continuous(breaks = c(0.05, 2, 4, 6)) +
ggtitle(paste(n_rep, "repetitions of an LM fit with", n_row, "rows,", n_col, "columns"))
A quick online search revealed at least three R packages for benchmarking R code (rbenchmark
, microbenchmark
, and tictoc
). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time
and system.time
). In the following I briefly go through the syntax of using each of the five option, and present my conclusions at the end.
The run time of a chunk of code can be measured by taking the difference between the time at the start and at the end of the code chunk. Simple yet flexible :sunglasses:.
sleep_for_a_minute <- function() { Sys.sleep(60) }
start_time <- Sys.time()
sleep_for_a_minute()
end_time <- Sys.time()
end_time - start_time
# Time difference of 1.000327 mins
The functions tic
and toc
are used in the same manner for benchmarking as the just demonstrated Sys.time
. However tictoc
adds a lot more convenience to the whole.
The most recent development version of tictoc
can be installed from github:
devtools::install_github("jabiru/tictoc")
One can time a single code chunk:
library(tictoc)
tic("sleeping")
print("falling asleep...")
sleep_for_a_minute()
print("...waking up")
toc()
# [1] "falling asleep..."
# [1] "...waking up"
# sleeping: 60.026 sec elapsed
Or nest multiple timers:
tic("total")
tic("data generation")
X <- matrix(rnorm(50000*1000), 50000, 1000)
b <- sample(1:1000, 1000)
y <- runif(1) + X %*% b + rnorm(50000)
toc()
tic("model fitting")
model <- lm(y ~ X)
toc()
toc()
# data generation: 3.792 sec elapsed
# model fitting: 39.278 sec elapsed
# total: 43.071 sec elapsed
One can time the evaluation of an R expression using system.time
. For example, we can use it to measure the execution time of the function sleep_for_a_minute
(defined above) as follows.
system.time({ sleep_for_a_minute() })
# user system elapsed
# 0.004 0.000 60.051
But what exactly are the reported times user
, system
, and elapsed
? :confused:
Well, clearly elapsed
is the wall clock time taken to execute the function sleep_for_a_minute
, plus some benchmarking code wrapping it (that’s why it took slightly more than a minute to run I guess).
As for user
and system
times, William Dunlap has posted a great explanation to the r-help mailing list:
“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.
:grinning:
The documentation to the function benchmark
from the rbenchmark
R package describes it as “a simple wrapper around system.time
”. However it adds a lot of convenience compared to bare system.time
calls. For example it requires just one benchmark
call to time multiple replications of multiple expressions. Additionally the returned results are conveniently organized in a data frame.
I installed the development[^1] version of the rbenchmark
package from github:
devtools::install_github("eddelbuettel/rbenchmark")
For example purposes, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures:
lm
,library(rbenchmark)
benchmark("lm" = {
X <- matrix(rnorm(1000), 100, 10)
y <- X %*% sample(1:10, 10) + rnorm(100)
b <- lm(y ~ X + 0)$coef
},
"pseudoinverse" = {
X <- matrix(rnorm(1000), 100, 10)
y <- X %*% sample(1:10, 10) + rnorm(100)
b <- solve(t(X) %*% X) %*% t(X) %*% y
},
"linear system" = {
X <- matrix(rnorm(1000), 100, 10)
y <- X %*% sample(1:10, 10) + rnorm(100)
b <- solve(t(X) %*% X, t(X) %*% y)
},
replications = 1000,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 3 linear system 1000 0.167 1.000 0.208 0.240
# 1 lm 1000 0.930 5.569 0.952 0.212
# 2 pseudoinverse 1000 0.240 1.437 0.332 0.612
Here, the meaning of elapsed
, user.self
, and sys.self
is the same as described above in the section about system.time
, and relative
is simply the time ratio with the fastest test. Interestingly lm
is by far the slowest here.
The most recent development version of microbenchmark
can be installed from github:
devtools::install_github("olafmersmann/microbenchmarkCore")
devtools::install_github("olafmersmann/microbenchmark")
Much like benchmark
from the package rbenchmark
, the function microbenchmark
can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality.
I find that one particularly nice feature of microbenchmark
is the ability to automatically check the results of the benchmarked expressions with a user-specified function. This is demonstrated below, where we again compare three methods computing the coefficient vector of a linear model.
library(microbenchmark)
set.seed(2017)
n <- 10000
p <- 100
X <- matrix(rnorm(n*p), n, p)
y <- X %*% rnorm(p) + rnorm(100)
check_for_equal_coefs <- function(values) {
tol <- 1e-12
max_error <- max(c(abs(values[[1]] - values[[2]]),
abs(values[[2]] - values[[3]]),
abs(values[[1]] - values[[3]])))
max_error < tol
}
mbm <- microbenchmark("lm" = { b <- lm(y ~ X + 0)$coef },
"pseudoinverse" = {
b <- solve(t(X) %*% X) %*% t(X) %*% y
},
"linear system" = {
b <- solve(t(X) %*% X, t(X) %*% y)
},
check = check_for_equal_coefs)
mbm
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lm 96.12717 124.43298 150.72674 135.12729 188.32154 236.4910 100 c
# pseudoinverse 26.61816 28.81151 53.32246 30.69587 80.61303 145.0489 100 b
# linear system 16.70331 18.58778 35.14599 19.48467 22.69537 138.6660 100 a
We used the function argument check
to check for equality (up to a maximal error of 1e-12
) of the results returned by the three methods. If the results weren’t equal, microbenchmark
would return an error message.
Another great feature is the integration with ggplot2
for plotting microbenchmark
results.
library(ggplot2)
autoplot(mbm)
The given demonstration of the different benchmarking functions is surely not exhaustive. Nevertheless I made some conclusions for my personal benchmarking needs:
Sys.time
approach as well as the tictoc
package can be used for timing (potentially nested) steps of a complicated algorithm (that’s often my use case). However, tictoc
is more convenient, and (most importantly) foolproof.microbenchmark
returns other types of measurements than benchmark
, and I think that in most situations the microbenchmark
measurements are of a higher practical significance :stuck_out_tongue:.microbenchmark
is the only benchmarking package that has visualizations built in :+1:.For these reasons I will go with microbenchmark
and tictoc
. :bowtie: