0-fold Cross-Validation

The deep rabbit hole of DIY small ergonomic keyboards

2024-08-09T00:00:00-04:00

The following poem, created using Google Gemini AI :robot:, chronicles my journey into the world of ergonomic DIY keyboards. It documents my (evolving) preferences and opinions, as well as a curated list of resources that I have found useful over the last three years.

:keyboard: Warning :keyboard: The following verses are likely entirely incomprehensible for normal people, and could be interpreted as a sign of my descent into madness. :scream: So, I guess, the main purpose of this is to serve as a time capsule for my future self. :shrug:

Clackety-thock goes the symphony,
My fingers dance on keycaps, wild and free.
Wrist aches whisper, thumbs start to cry,
But QMK and ZMK, my coding lullaby.

No blueprints of my own, I'll openly declare,
Open source is my love, a passion I readily share.
From forty-two keys to thirty-four, I roam,
Ergonomics my mantra, soldering my ohm.

Though productivity's my aim, it's a clever disguise,
For distraction's the siren, with each soldered wire's ties.
Miryoku's magic, dual functions divine,
(Secrets unveiled, just click on these lines!)

Colemak's my layout, a personalized twist,
With one-shot shift, number layer, nav's quick kiss.
But Backspace, the wanderer, roams with a whim,
While Enter, Tab, and Esc are lost on a limb.

Thumb pain be banished, a firmware's embrace,
One-shot keys take flight, a swift, thumb-saving chase.
(Pascal's insights, a guiding star's grace,)
No more held keys, just a gentle tap's space.

Dynamic macros, QMK's gift so grand,
Repetitive tasks vanish, like castles in sand.
Repeat key's rhythm, combos untold,
Misfires banished, a legend of old.

Macros at the ready, each keystroke's delight,
My keymap's a symphony, shining so bright.
(But heed this warning, before you explore,)
(This snapshot's outdated, there's always more.)

3D prints and soldering, a tinkerer's glee,
Feather-light switches, a soft melody.
Or weighty and firm, a thocky address,
With key-loving comrades, fueling this clacking success.

Toy sheep used for keyboard tenting.

My first month with the Colemak keyboard layout: ~One-month typing speed progression: Transitioning from QWERTY to Colemak, with a brief week using the "Tarmak" intermediate layouts.

Comparing GPT-4, 3.5, and some offline local LLMs at the task of generating flashcards for spaced repetition (e.g., Anki)

2024-01-25T00:00:00-05:00

tldr: I used GPT-4 Turbo, GPT-3.5 Turbo, and two open-source offline LLMs to create flashcards for a spaced repetition system (Anki) on a mathematical topic; I rated the 100 LLM-suggested flashcards (i.e., question-answer pairs) along the dimensions of truthfulness, self-containment, atomicity, whether the question-answer makes sense as a flashcard, and whether I would include a similar flashcard in my deck; I analyzed and compared the different LLMs’ performance based on all of that; then crowned the winner LLM :crown: or maybe not… And, because the blog post ended up being long and detailed, here is a figure combining all of the final results:

Figure caption: Models/prompts and their mean ratings along multiple rating dimensions, assigned by me in a (to the extent possible) blinded fashion. The models/prompts are sorted by “would I use a similar Q-A pair in my deck?”, which for my purposes is the main quality indicator for the generated flashcards.

:sparkles: :sparkles: :sparkles:

Note on the open-source models used: I ran the experiments described below at the end of November 2023 using a single GPU and models available at that time. Given the pace at which LLMs develop, you should probably take the results with a grain of salt. Moreover, the locally run open-source models that I used may not have been the very best even at that time (even considering my hardware constraints).

:sparkles: :sparkles: :sparkles:

Introduction

Recently in my reading I came across the statistical metric Fleiss’ kappa, which I had seen before, but no longer could remember the definition of. This is exactly the type scenario, where I would like to include at least the definition of this statistical assessment measure into my spaced repetition database (Anki) – or, in other words, “ankify” the concept.

I learned about spaced repetition and Anki about 6 years ago from several blog posts by Michael Nielsen, who specifically also covers the topic of creating flashcards for sophisticated mathematical topics (Nielsen, 2018; Nielsen, 2019). Indeed I have noticed beneficial effects of my use of Anki as a knowledge worker over the last 5-6 years as well as for some of my hobbies. I just wish sometimes that I were more consistent and disciplined in my use of Anki. But let’s reserve this discussion for another occasion, since the benefits and challenges of spaced repetition is not the topic of this blog post.

I have also been playing around with LLMs for a little while. But the vast majority of it was using the OpenAI API (mostly via the excellent ShellGPT and sometimes from Python directly), and I was looking for a good excuse to play around with LLMs more and to try out some open-source models that can run locally on my computer, such as those based on Llama 2 (unfortunately the Mixtral models were not released yet at the time). So, it seemed it would be a great idea to use different LLMs to generate a bunch of suggested Anki flashcards based on articles about Fleiss’ kappa, and I based my prompts to the LLMs in part on Michael Nielsen’s articles referenced above (see below for details about my prompting strategies).

As the primary goal of this exercise, I wanted to compare the outputs from different LLMs in a systematic way on this task of my personal interest, as I had no idea how open-source LLMs that I run on my local computer would stake up against something like ChatGPT.

Overview

For the main part of this blog post, I will go through the models/LLMs and prompts first, and then describe the analysis and the results. So, overall this is what we are doing here:

AI-based flashcard generation

I used GPT-4 Turbo and GPT-3.5 Turbo via the OpenAI API, and two open-source LLMs running on my local computer (after trying several others), in combination with several prompting strategies – in total 10 different LLM-prompt combinations – to generate 100 Anki cards (question-answer pairs).¹
LLM/prompt performance analysis “study”

There are the following components to this “study”:
- 2.1 I rate the outputs from the different LLMs along several rating categories (which I came up with for this task), blinding myself to the extent possible with respect to which LLM was used for which output.
- 2.2 I use OpenAI’s text embeddings to measure the relatedness between the LLM-generated flashcards and flashcards that I ultimately included into my Anki deck.
- 2.3 Finally I visualize and analyze the results from points 2.1 and 2.2.

There may be many imperfections in the performance “study” and it could be considered simplistic, but luckily we aren’t looking at a peer-reviewed scientific publication here but rather just a blog post that I’m quickly writing on a Sunday afternoon ([Narrator’s voice]: Well, it actually took much more than that one Sunday afternoon, with several intense writing sessions and extended breaks in between).

Models and prompts

An excellent overview on how to use the OpenAI API and how to deploy local LLM models on your own hardware is provided in a Youtube video lecture by Jeremy Howard, and a substantial portion of the code that was used for this blog post originates from that video: A Hackers’ Guide to Large Language Models

In this section, I will go through each model and each prompt that I used, as well as my rationale behind it.

GPT-4

For my first attempt I used the GPT-4 Turbo model (gpt-4-1106-preview), which had been released shortly before I started these experiments.

Initial prompt (`GPT4` in the results figure)

My initial prompt for the tasks of interest here was:

prompt = "Create flashcards for a spaced repetition system on the topic of Fleiss' Kappa for me based on the wikipedia articles that I include below (after the string '--- Wikipedia articles ---'). You should mostly ignore your previous knowledge about Fleiss' Kappa and rely on the information provided in the Wikipedia articles below."

which was followed in the Python code by:

prompt = prompt + "\n\n--- Wikipedia articles ---\n\n"
prompt = prompt + "\n\n" + wiki_fleiss_kappa
prompt = prompt + "\n\n" + wiki_cohens_kappa
prompt = prompt + "\n\n" + wiki_scotts_pi

where wiki_fleiss_kappa, wiki_cohens_kappa, and wiki_scotts_pi are copies of the respective Wikipedia articles (Wikipedia, 2023; Wikipedia, 2023; Wikipedia, 2023), which I scraped within my Python code using the Wikipedia-API package.

The GPT-4 Turbo model with this prompt returned 9 flashcards, which generally were pretty decent, such as:

(…)

Flashcard 3: Applicability of Fleiss’ Kappa

Front: Can Fleiss’ Kappa be used with various types of data?

Back: Yes, Fleiss’ Kappa can be used with binary, nominal, or ordinal data, but for ordinal data, statistics that account for ordering, like Kendall’s coefficients, are usually more appropriate.

Flashcard 4: Formula for Fleiss’ Kappa

Front: What is the formula for calculating Fleiss’ Kappa?

Back: κ = (P̄ - P̄e) / (1 - P̄e), where P̄ is the mean of the extent to which raters agree for each subject, and P̄e is the mean of the proportion of all assignments which were to each category by chance.

(…)

However, I wanted to get the model to generate more sophisticated question-answer pairs that would tease out more of the mathematical subtleties on the topic and quiz me for a deeper understanding of the concepts.

A longer more complex prompt (`GPT4_cont1` in the results figure)

To “teach” the model how I want it to go about creating Anki cards for me and about the purpose of the Anki cards (what I want to get out of my spaced repetition practice), I decided to first feed it with two articles on the topic (Nielsen, 2018; Nielsen, 2019):

prompt = "I want to you to learn about spaced repetition systems (SRS) such as Anki, so that you can act as a professional Anki card creator, with a particular expertise at creating Anki cards for topics in mathematics and statistics. Below I provide you first with an introductory text about spaced repetition systems by Michael Nielsen (starting after the string '--- FIRST TEXT ---' and ending with the string '--- FIRST TEXT END ---'). Then I provide you with another article by Michael Nielsen about creating Anki cards for mathematical topics (starting after the string '--- SECOND TEXT ---' and ending with the string '--- SECOND TEXT END ---'). Based on this reading material please explain what process you will follow, as a professional Anki card creator, to create Anki cards for me based on other articles, papers or notes that I will provide in the future."

which was followed by:

prompt = prompt + '\n\n--- FIRST TEXT ---\n\n' + text_srs
prompt = prompt + '\n\n--- FIRST TEXT END ---\n\n'
prompt = prompt + '\n\n--- SECOND TEXT ---\n\n' + text_srs_math
prompt = prompt + '\n\n--- SECOND TEXT END ---\n\n'

where text_srs and text_srs_math are plain text versions of (Nielsen, 2018; Nielsen, 2019), downloaded from within my Python code with the Beautiful Soup library.

After the model had output an explanation of the Anki card generation process based on this prompt, I followed up the conversation with the actual request to create Anki flashcards for me:

prompt2 = "Create Anki cards on the topic of Fleiss' Kappa for me based on the wikipedia articles that I include below (after the string '--- Wikipedia articles ---'). You should mostly ignore your previous knowledge about Fleiss' Kappa and rely on the information provided in the Wikipedia articles below."

where again I appended the prompt with the three articles (Wikipedia, 2023; Wikipedia, 2023; Wikipedia, 2023).

This approach provided 12 AI-suggested flashcards, which indeed seemed to go more in the direction where I wanted them to go. Here are a couple of examples of cards I like from this output:

Calculation of Agreement Proportions

Q: How do you calculate the proportion of assignments to the j-th category (pj) in Fleiss’ kappa? A: pj = (1 / Nn) ∑ from i=1 to N nij, with N being the total number of subjects, n the number of ratings per subject, and nij the number of raters who assigned the i-th subject to the j-th category.

Q: Define Pi in the context of Fleiss’ kappa. A: Pi is the extent to which raters agree for the i-th subject, calculated as (1 / n(n-1)) * ∑ from j=1 to k nij(nij-1).

Further continuing the same conversation with GPT-4-Turbo (`GPT4_cont2` in the results figure)

Trying to get even more useful (to me) Anki cards, I continued the same chat with the following prompt:

prompt3 = "Please create additional cards for deeper mathematical understanding, including more detailed breakdown of the mathematical definitions and formulas, the influence of different numbers of categories on kappa values, and comparisons with related statistics such as Scott's Pi and Cohen's kappa."

This gave 11 additional suggested question-answer pairs. Some were rephrased repeats of the last output, some were new and interesting, but generally I felt that GPT-4 was getting to its limits on the task here, likely because the Wikipedia articles I fed it simply didn’t provide enough content to go into more depth on the topic.

GPT-3.5 Turbo

Having the GPT-4 Turbo output, an important question is whether GPT-3.5 Turbo can perform just as well, given that, compared to GPT-4, it is cheaper or even free to use. From my Python code I used the “gpt-3.5-turbo-1106” model through the OpenAI API. I initially had tried the “gpt-3.5-turbo” model, but it couldn’t handle the context lengths of my prompt, which was the same as my “Initial prompt” for GPT-4 described above.

Initial prompt (`GPT3` in the results figure)

I started with the same prompt as the “initial prompt” for GPT-4 described above.

A longer more complex prompt (`GPT3_cont1` in the results figure)

For the “longer more complex prompt” (described in detail in the GPT-4 section above), the model couldn’t handle both supplied articles, (Nielsen, 2018; Nielsen, 2019), due to context length limitations. So, I only fed it one of the two articles, (Nielsen, 2019), leaving the rest of the prompt unchanged.

Further continuing the same conversation with GPT-3.5-Turbo (`GPT3_cont2` in the results figure)

The prompt here was identical to the one used in the respective GPT-4 section above.

Local/offline open-source LLMs

Next I wanted to try out a few open-source models, running locally on my computer, to perform the same flashcard generation task.

There is an overwhelming number of options for open-source models that can be downloaded from Huggingface (or maybe one should rather call them “open-weight models” for a better more precise terminology). So, a lot to choose from, and there are multiple leaderboards that can guide the choice, such as the Huggingface “Open LLM Leaderboard” or the “Chatbot Arena”. However, I haven’t dedicated time yet to thoroughly understand the metrics and construction of such leaderboards. For that reason, I didn’t guide my model choices on any leaderboards for now. What I did instead was try out a few different models that I’ve seen mentioned on the internet in other people’s experimentations. I’ve then chosen to stick with a couple of those models that would run on my hardware given the prompts I was using, and seemed to provide useable output for the task in question. The computer I was using for this is basically a gaming PC with an Nvidia RTX 4090 graphics card and, other than that, somewhat older mid-level components.

Note that I ran the experiments described below at the end of November 2023. The available open-source models/solutions may have improved considerably since then, or may not have been the very best open models for the given task in the first place (even for my hardware constraints). I would appreciate any hints about superior open models for the task that can run offline on my local machine (for instance on a single Nvidia RTX 4090 or a comparable gaming GPU, or possibly CPU-only but I’m impatient).

Running open LLMs on a modern Nvidia GPU

Running an LLM on your own Nvidia GPU is made relatively easy by the Huggingface’s Transformers library in conjunction with PyTorch.

The open-source models that I tried initially (various derivatives of the Llama 2 LLM) tended to run out of GPU memory, given the prompts I was using (recall that I need to pass in at least most of the Wikipedia article on Fleiss’ Kappa as part of the prompt (Wikipedia, 2023)), although I had shortened the prompts considerably compared to what I used for GPT-4 and GPT-3.5 above. So, I had to leverage derivatives of popular LLMs that are more memory-efficient through the use of quantization techniques. Specifically, for the results presented below, I ended up using the model TheBloke/OpenOrca-Platypus2-13B-GPTQ, which is a GPTQ quantized version of OpenOrca-Platypus2-13B, which in turn is a merge of two fine-tuned models based on LLaMA2-13B by Meta. The reason I chose that specific model for the experiments is partly due to it being one of the models used by Jeremy Howard in the video referenced above (if I recall correctly), and also based on the initial experimentation with multiple other models.

Adapting the previous prompts for LLMs running on my local PC (`OrcaPlatypus2-13B_prompt1` in the results figures below)

Due to context length limitations, I used the following shorter prompt (compared to the GPT prompts above):

Spaced repetition is an evidence-based learning technique that is usually performed with flashcards, which are essentially question-answer pairs. Newly introduced and more difficult flashcards are shown more frequently, while older and less difficult flashcards are shown less frequently in order to exploit the psychological spacing effect. The use of spaced repetition has been proven to increase the rate of learning. Given the text below (after the string 'TEXT'), suggest flashcards (i.e. questions and the corresponding answers) for a spaced repetition system, in order to help an undergraduate student to learn the presented information. Please provide your suggested flashcards as question-answer pairs (Q: ..., A: ...). \n\n TEXT.

where only the Wikipedia article on Fleiss’ kappa (Wikipedia, 2023) was appended to the prompt, but unlike previously, not the articles on Cohen’s Kappa and Scott’s Pi (Wikipedia, 2023; Wikipedia, 2023):

prompt = prompt + "\n\n" + wiki_fleiss_kappa

Then I converted this prompt into the instruction-response template format of base Platypus2. By “template format” I mean a standardized prompt formatting that can look something like, ### Instruction: ... ### Response:, or User: ... <|end_of_turn|>Assistant:, etc., which is needed for the open-source LLMs (not sure if all of them though) to ensure that they provide an actual response to my query rather than take it as a piece of text to be extended in arbitrary manner with some additional text. To be honest, I’m not quite sure which prompt template format would have been best to use, but this seemed to work well enough.

This query gave me 7 reasonably-looking flashcards as question-answer pairs. After that the output started deteriorating, giving a flashcard that was grammatically mostly correct but didn’t make any sense, and then to disconnected sentence fragments, and finally to a repeated sequence of characters. Parts of the example output is provided in the following for illustration:

1. Question: What is the most suitable measure for assessing agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items?

Answer: Fleiss’ kappa.

2. Question: When should we use Cohen’s kappa over Fleiss’ kappa?

Answer: Fleiss’ kappa works for any number of raters, while Cohen’s kappa only works when assessing the agreement between two raters or the intra-rater reliability.

(…)

[Five other reasonable flashcards (not shown)]

[Then the output starts to deteriorate:]

8. Question: If the sum of all cells is equal to 1440, what does it mean?

Answer: If the sum of all cells equals 40 to 40 cells, then it would mean that this value is used to maintain the consist of data40. The rater and Cells and classified data’s.

9. In this manner, but the consistent with the Pearson and consistent with Pearson and in terms of the analysis and classified data within correlation and the agreement, data. The data.

9.5. Pearson and the data. The Pearson and the agreement4. Each rater and correlation and in the agreement with the data.5. The data. correlation analysis, with6. Each raters and data.5.5. The data on the ratio and correlation and the data. Each entry rate. The data. Each rater and correlation and data.

9. The data.1. The agreement and the data; that helps in6. The data.5. Each pair of data.4.0. The data. Analysis in terms5.02. Each. the data. The data.5.

6. The more. Each.6. The data.7.11. the data.5.7.5.5.0.6.5.5. the6.6.5.5.5.6.6.5.data.5.6.5.5.5.6.6.6.6.6.5.6.5.6.6.6.5.5.6.6.6.6.5.6.5.5.6.6.5.6.5.6.6.6. .5.6.6,6.5.6.5.5.5.6.5.6.5.5.6.5.5.5.5.6.5.5.5.6.5. 6.5.5.5. 5.5.5.5.5.5.5.5. 5.6.6. 6 to 5.5.5.5. 5.5.6. 6.5.5.

Additional attempt with more manual prompt curation (`OrcaPlatypus2-13B_prompt2` in the results figures below)

After the partially successful attempt above, I decided to try replacing the scraped Wikipedia article on Fleiss’ Kappa with a somewhat more manually curated text about Fleiss’ Kappa. That “more manually curated text” was me copy-pasting only the relevant parts of the Wikipedia article, and with a better formatting than what I had obtained previously with the Wikipedia API Python package in an automated fashion.

In addition, the copy-pasted Wikipedia excerpts were prepended by a very simple sentence describing the task:

Given the text below, suggest questions and the corresponding answers to use in a quiz or exam for a class of undergraduate students.

In my experience, using that simple description of the task, which doesn’t even mention “spaced repetition” or “flashcards”, helped to improve the output for some other locally run LLMs that I tried (not shown) – predominantly smaller models, which otherwise tended to not address the right task (for example, suggesting questions about the spaced repetition concept rather than the desired topic) or to produce many hallucinations.

The result of this indeed seemed better compared to the last prompt, and also didn’t exhibit artifacts like the ones illustrated in the last subsection. However, from a practical perspective the value of that practice is a little questionable, because, if I can take the time to curate a better input text for the LLM manually, I could as well just have used that same time to create Anki flashcards manually without using the LLM.

Second variation of the “more manually curated prompt” (`OrcaPlatypus2-13B_prompt3` in the results figures below)

For the OpenOrca-Platypus2-13B model I have also made a variation of the same prompt, referred to as OrcaPlatypus2-13B_prompt3, where the flashcard generation task for spaced repetition was described in somewhat more detail:

where the “more curated” text on Fleiss’ Kappa was appended after “TEXT”.

LLMs using llama.cpp on the CPU (`llama.cpp-Llama2-7B` in the results figures)

Finally, I also wanted to explore the use of LLMs without an Nvidia GPU, i.e., running on the CPU of my computer, by utilizing the llama.cpp Python package. While llama.cpp allows you to run Meta’s LLaMA models on different kinds of hardware, I used the default Linux install which runs on CPU only.

The specific model I used in conjunction with llama.cpp was llama-2-7b-chat.Q5_K_M.gguf (again, something I saw in Jeremy Howard’s “A Hackers’ Guide to Large Language Models” video if I remember correctly).

It was difficult to experiment with prompts for llama.cpp, because, when using CPU only, text generation is slow. So, in the following, I analyze results only for a single prompt:

where as before the Wikipedia article about Fleiss’ Kappa was appended to the prompt, and no additional information was appended due to context length limitations.

The output contained 8 nicely formatted suggested flashcards, without anything completely nonsensical or hallucinated.

Performance evaluation of LLMs via a “study” with a single human rater (myself)

To remind you, as discussed at the top of this post, there are the following main components to this performance evaluation “study”:

I rate the outputs from the different LLMs along several rating categories (which I came up with for this task), blinding myself to the extent possible with respect to which LLM was used for which output.
I use OpenAI’s text embeddings to measure the relatedness between the LLM-generated flashcards and flashcards that I ultimately included into my Anki deck.
Finally I visualize and analyze the results from items 1 and 2.

As mentioned in the introduction, there could be many imperfections and limitations in this assessment “study” of LLM performance, but we aren’t going to worry about that, since this is just a random experiment I’m doing in my spare time.

Rating the outputs of the LLMs by a human (me)

After creating the LLM-generated flashcards, I put them in random order into a spreadsheet which also excluded any indication with respect to the models or promtps used for each output. Then I put this project aside for a week, which allowed me to mostly forget what question-answer pair was suggested by which LLM/prompt. This one week break can be viewed as slightly analogous to a wash-out period (albeit a very short one) in reader studies for performance evaluation in diagnostic medicine, as I was taking a deliberate extended break with the goal of forgetting what I knew about the data. When I came back to this project, I rated each AI-suggested flashcard along the following five dimensions:

Is the LLM-proposed question self-contained?
Is the LLM-proposed question atomic?
Does the LLM-proposed question-answer pair make sense as a flashcard? This could be a somewhat subjective category, accounting for such factors as whether the flashcard is relevant to the topic, sophisticated enough, not too obvious but not too hard either, etc.
Truthfulness of the LLM-proposed flashcard, i.e., whether the proposed answer is actually a correct answer to the proposed question.
Would I likely use this flashcard (or a very similar flashcard)? i.e., would I use this Q-A pair (or a very similar one) in my Anki deck?

As mentioned above, I blinded myself to the models/prompts used for each generation, and leveraged my forgetfulness by taking a one week break between generating the flashcards and rating them. Within each category I put ratings on a scale 0, 0.5, 1. That means I sometimes gave partial credit. For example, for truthfulness, an AI-suggested answer to a flashcard could have two parts, where one may be correct while the other false; such a two-part answer flashcard would likely get a 0.5 in the “truthfulness” category and a 0 for “atomicity”.

Finally, I created flashcards for my actual Anki deck using 21 of the suggested Q-A pairs as basis for my final Anki cards (only one of the final cards matches a suggested card exactly).

Calculating embeddings

In addition, I looked at embeddings of each AI-generated flashcard, and compared how closely they match the embeddings of the 21 Anki cards that I actually ended up including in my deck (manually modified flashcards based on some of the AI-generated ones). For this I used OpenAI’s embedding model text-embedding-ada-002.

I used cosine similarity, a metric similar to the widely-known Pearson’s correlation coefficient, to compare the text embedding of each of the 100 AI-generated card with each of my 21 human-curated flashcards. I then recorded the maximum value from the 21 cosine similarity values for each AI-generated flashcard, which I denote as max_cos_sim. The max_cos_sim values can be used as another approach to compare the generative models in this experiment, attempting to evaluate how similar the output of each model is to the flashcards that I eventually deemed worthy of including in my spaced repetition deck.

Analysis methods

I simply compared the means per model/prompt. No sophisticated statistical analysis was performed at this time, because of my time limits for this blog post and complications due to the small sample sizes and various sources of potential bias or variability that would need to be accounted for. More detail on the analysis of each specific rating categorization as well as the embeddings is provided in the subsections under “Results” below.

If I have time and interest in the future, I may update the analysis with:

Error bars in the bar graphs to report the standard errors around the estimated mean values.
An estimate for intra-rater agreement for my ratings in the different categories.

Results and discussion

Comparison of embeddings

The mean and standard deviation values of the max_cos_sim metric (described in the section “Calculating embeddings” above) provide numerical measures of how similar the AI-generated flashcards are to the ones I ultimately added to my Anki deck. However, I found that this is a poor way of comparing LLMs on this task, for the reasons outlined below. The breakdown per model/prompt is as follows:

model/prompt	`max_cos_sim`: mean (std. dev.)
GPT4	0.964849 (0.037430)
GPT3	0.957126 (0.031817)
GPT3_cont1	0.956216 (0.033335)
GPT4_cont1	0.953869 (0.037121)
GPT3_cont2	0.951629 (0.046560)
GPT4_cont2	0.942788 (0.033649)
OrcaPlatypus2-13B_prompt2	0.938141 (0.044916)
OrcaPlatypus2-13B_prompt3	0.931788 (0.047142)
OrcaPlatypus2-13B_prompt1	0.905019 (0.053023)
llama.cpp-Llama2-7B	0.891498 (0.030630)

The GPT models are in front according to this metric. But note that this does not account for factors such as diversity of the generated flashcards, how sophisticated they are, different numbers of cards generated by each model, etc. Generally, the max_cos_sim metric turned out to not be very informative for reasons including:

All values are similarly large without statistically significant separation.
The highest scoring models may not necessarily provide the highest quality flashcards, but may just be correlated to a small set of particularly easy-to-come-up-with question-answer pairs in my final deck.
For some of the models/prompts, where the mean is relatively lower, the standard deviation is higher, indicating that perhaps they yielded some lower quality flashcards but also perhaps some high-quality ones, which could be unique to that model and therefore particularly useful.

An appropriately designed custom metric could be used instead of the max_cos_sim metric proposed above to account for some of these issues. For instance, some problems with max_cos_sim stem from it evaluating a given model’s generated flashcards as independent samples, when in fact the task is to come up with an optimal set of flashcards to study a given topic (i.e., it must have sufficient coverage of the topic, the desired amount of breadth and depth, avoid repetition, etc.). Therefore, an appropriate specialized performance metric would likely have to be constructed in such a way, so that it compares the entire set of each model’s generated cards (i.e., considering it as a whole rather than as independent cards) against the entire set of the flashcards that were ultimately included in my deck (again, encompassing everything at once).

However, not wanting to spend even more time on this project, I didn’t investigate these aspects further.

Another drawback of comparing text embeddings that I want to highlight is that subtle word changes can make a huge change in the overall quality of a flashcard while the embeddings will stay very similar. By changing a word to another word that has a related but somewhat different meaning, a flashcard can turn from something providing a lot of insight to something that’s unclear or even false. For example compare the following question that was generated by GPT-3.5 Turbo in this experiment:

Why may Fleiss’ kappa not be suited for cases where all raters rate all items?

with the sightly modified question:

Why is Fleiss’ kappa not suited for cases where all raters rate all items?

The small change of the verb from “may be” to “is” makes a big difference for the scientific meaning of the sentence, but the similarity between embeddings is very high with a cosine similarity of 0.9925.

That is not to say that the similarity between embeddings makes no sense at all – I in fact do observe that the specific question-answer pairs, which were actually used as the basis for my Anki cards, have generally higher max_cos_sim values than those questions which I didn’t end up using, as shown in this table:

`max_cos_sim`	mean (std. dev.)	median
Q-A pairs I used	0.984527 (0.026339)	0.995037
Q-A pairs I didn’t use	0.928203 (0.040223)	0.938435

Ratings on 1. Self-containment

Is the question self-contained?

Here, I was rating whether the question can be understood without any additional explanations (such as definitions), beyond some kind of common knowledge (of course, there is some room for interpretation of what I consider not needing a definition).

Here is an example of a question from the GPT4_cont2 output which I rated as not self-contained:

Describe how to compute ( P_i ) for subject ( i ).

Here is a bar graph of the results per model/prompt:

The most striking observation here is that Llama.cpp received very low scores with a large separation from the other models. But note that this is a bit unfair towards llama.cpp, because llama.cpp generated flashcards under headings/topics describing the context to a certain extent (for example: “Topic 1: Classification agreement between raters (P i )”), but when I rated the generated cards, I only included the questions and answers (without any such headings), in order to blind myself to the models/prompts by having the same format for all of them.

Ratings on 2. Atomicity

Is the question atomic?

For this category, I was rating, essentially, whether a given flashcard is testing for recall of a single concept. That is, a two-part answer would likely not be atomic.

Here is an example of a question suggested by llama.cpp-Llama2-7B that I rated as not “atomic”, because it is asking to list “some” disadvantages of Kappa, rather than asking about one specific disadvantage:

What are some disadvantages of Kappa?

Here is the breakdown of results per model/prompt:

Ratings on 3. Truthfulness

Note that high truthfulness may not necessarily imply high quality of a spaced repetition flashcard, because, for example, the question-answer pair may be far too obvious or far too difficult. Also note that, the truthfulness property isn’t necessarily very important for this task (at least to me), because the AI-augmented spaced repetition card creation process would involve checking and/or adjusting each AI-suggested flashcard before adding it to the deck.

Here is the bar graph of the results:

Ratings on 4. Making sense as a flashcard

There is some conceptual similarity between this rating dimension and the one titled “would I use a similar Q-A pair in my deck?” (results below), because both of them can be viewed as measures of overall quality for flashcards. However, the crucial difference between them is that I considered the “would I use” category as something personalized to me, while the ratings in the “makes sense” category are intended to assess whether it could be a good flashcard generally for somebody. That is, I might rate a given flashcard high on the “makes sense” dimension, while the specific question and answer aren’t something that I specifically would want to include in my deck (for example, it could be a great question, but just not on an aspect that I personally find interesting or important to know).

Also, because I actually went through the process of creating new flashcards based on the AI-suggested ones, I can answer the question “am I going to use a similar flashcard?” with far more certainty than the question “does this make sense?”. One can observe this on how the following bar graph differs in range and separation of models/prompts compared to the graph in the next section.

Ratings on 5. Would I likely use a similar flashcard in my Anki deck

I consider this to be the main quality indicator for the generated flashcards, since this entire exercise is about generating cards for my own Anki deck. I also find it to be confirmed as the most suited for a primary metric role, after observing the limitations of the other rating categorizations and the text embedding approach. Moreover, this variable in a way combines all the other rating dimensions I used.

Here is the breakdown of the results:

We see that (somewhat more) manually curating the input to the offline open-source models improved performance substantially (see OrcaPlatypus2-13B_prompt3 vs. the other open-source models). The same is true for several other rating dimensions above. But, as I have mentioned before, this has limited utility, because if I need to manually curate the input text, I could just as well create the flashcards without the help from AI.

Interestingly, I don’t see any particularly strong correlation of the “Would I use a similar flashcard?” ratings with any of the other rating dimensions. This implies that it might depend to a large degree on other not captured factors whether I will use an AI-generated flashcard as basis for creation of new cards for my Anki deck. Perhaps I didn’t capture some other important dimensions of flashcard quality, such as the uniqueness or originality of an AI-generated flashcard, or something else that I didn’t think of. Or it could be just highly personal and strongly depend on my specific background, interests, and taste, which are much harder to quantify or measure.

Takeaways

There is discussion of the result in the individual subsections above, including some concluding remarks, and I don’t want to repeat that information. But here are some key takeaways that I took from the experiments.

Based on the results above it appears that, as of November 2023, among the compared models, OpenAI’s GPT models are best for generating flashcards on complex topics for spaced repetition systems such as Anki, and they are very inexpensive to use (for creating a few flashcards).
It is convenient to have the longer context length of GPT-4 Turbo as it allows to feed it more source material.
While the GPT models handled the raw input of scraped web articles very well, the local models (the ones that would run on my limited hardware) were somewhat challenged by it and improved considerably after the input was manually formatted and more curated, which takes human time.
So, it appears that the locally run open-source models make limited sense for this task, unless you want to do this really at scale, or if you just like to tinker with these open-source solutions for fun even if they may be slightly inferior. Moreover, it should be possible to tune the models and prompts far beyond what I have tried to improve performance on this task.

On another note, the AI-augmented spaced repetition flashcard generation process strongly encouraged me to have a much deeper look at the topic of Fleiss’ Kappa and similar measures than I would have otherwise. Because some of the suggested flashcards are very interesting, but, at the same time, are missing important context or explanations (or proofs in case of mathematics) which weren’t sufficiently covered/explained in the Wikipedia articles that I fed to the models, it forced me to read (parts of) several academic research papers. So, the AI-augmented process sort of strongly motivates you to look more deeply into the topic by providing short intriguing bits of information as concise question-answer pairs.

While I presented a semi-rigorous comparison along multiple rating dimensions, I may have failed to capture some other important aspects of the quality of AI-generated flashcards. Moreover, some of my assigned ratings could be highly subjective, because there is naturally a considerable level of subjectivity in what constitutes a high-quality flashcard for spaced repetition, depending on personal taste and experience, the topic and one’s prior knowledge of it or around it, various aspects of the context of the spaced repetition practice, etc. Therefore, it would be hard to justify the amount of time and effort needed to improve the prompt, input material, and models to achieve the optimal flashcards for ones exact personal preferences and conditions (which, in addition, will probably vary from one occasion to another or over time). So, overall, it seems that for now it’s probably best to use LLMs only to get some sub-optimal flashcards to use as a starting point for manual editing. Some AI-generated cards may even highlight interesting aspects which the user would have overlooked otherwise.

Advice if you want to do something similar

There are many excellent resources on LLMs to get started – some that have helped me are:

Jeremy Howard’s video lecture that I already recommended above: A Hackers’ Guide to Large Language Models
The documentation of the relevant Python packages, such as the OpenAI API python package, tends to be excellent in my experience.
Andrej Karpathy’s talk Intro to Large Language Models
A few academic presentations that I attended over the last few months (majority at KDD 2024).

In my case, I basically watched the two videos linked above, and then was able to figure out how to do anything that I wanted to do by just reading the explanations and examples in the docs of the relevant Python packages and of models at Hugginface.

References

Nielsen, M. (2018). Augmenting Long-term Memory. http://augmentingcognition.com/ltm.html.
Nielsen, M. (2019). Using spaced repetition systems to see through a piece of mathematics. https://cognitivemedium.com/srs-mathematics.
Wikipedia. (2023). Fleiss’ kappa — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Fleiss’%20kappa&oldid=1183551918.
Wikipedia. (2023). Cohen’s kappa — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Cohen’s%20kappa&oldid=1188549993.
Wikipedia. (2023). Scott’s Pi — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Scott’s%20Pi&oldid=1151174394.

Footnotes:

Different model-prompt combinations resulted in different numbers of AI-generated flashcards, but somehow by random luck the total summed to exactly 100. ↩

Having some fun with Stable Diffusion Inpainting in Python on New Year’s Day

2023-01-01T00:00:00-05:00

It is New Year’s Day 2023 :sweat_smile:. Happy New Year!!! :fireworks: I am currently driving with my family coast-to-coast on a road trip through the United States, but for New Year’s Eve and New Year’s Day we stayed in one place. Taking advantage of the driving free days, I and my 4-year old son had some great fun with the open-source stable diffusion models; in particular, the Text-Guided Image Inpainting techniques.

Basically, inpainting allows you to replace or transform image areas of your choice with something else AI-generated based on a text prompt. You can see some of my results in the collage above. The top left panel shows the original (real) image. That’s a photo I took of my son during breakfast at a restaurant this morning, and he found it absolutely hilarious how we can drastically modify it with the computer – the text prompts we used were based on his suggestions to a large part.

A few code snippets

I already had played around a few times with image generation with stable diffusion in Python, and with textual inversion for representation of a specific artistic style. Immediately I was (and still am) positively surprised by how easy and pleasant the developers made it to use stable diffusion via the Huggingface diffusers library in Python. But I haven’t looked at inpainting techniques until today. I learned a lot from great tutorials about stable diffusion such as the FastAI notebook “Stable Diffusion Deep Dive”, but I haven’t specifically seen examples of inpainting so far (though I haven’t looked :stuck_out_tongue:). So, I’m providing some relevant code snippets here.

There are two clear ways in which inpainting could be applied to the image I started with (top left in the collage above). Either replace/transform the boy, or replace/transform the drawing that he is holding.

However, first, one has to define an image mask:

Because I didn’t want to stress about it, I simply guessed by eye rectangular image areas to be masked, for instance as follows (note that I used somewhat different masks for different text prompts):

mask = np.zeros(init_image.size).T
mask[270:, :] = 255
mask[550:, 400:] = 0
mask = Image.fromarray(np.uint8(mask)).convert('RGB')
plt.imshow(mask)

Generating the selected image areas based on a text prompt “from scratch”

The chosen image areas can be generated from scratch, in which case I used the stable diffusion v2 inpainting model. Here is a corresponding code snippet to download and initiate the pre-trained models and other components of the diffusion pipeline:

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting",
    revision="fp16",
    torch_dtype=torch.float16,
).to("cuda")
pipe.enable_attention_slicing()  # to save some gpu memory in exchange for a small speed decrease

Before applying the models, I resized and square-padded all images to 512x512 pixels (I saw the recommendation for square-padding in someone else’s stable diffusion inpainting code, I don’t remember where exactly, and didn’t do any experiments without square-padding).
Using the above model, I was able to generate images with code like:

import torch
torch.manual_seed(2023)

inp_img = square_padding(init_image)  # my own function, init_image is loaded with PIL.Image
mask = square_padding(mask)
inp_img = inp_img.resize((512, 512))
mask = mask.resize((512, 512))

prompt = "something..."
negative_prompt = "something..."

result = pipe(prompt, image = inp_img, mask_image = mask, negative_prompt=negative_prompt,
    num_inference_steps = 50, guidance_scale = 11).images
result[0]  # this is the generated image

Generating selected image areas in an image-to-image fashion

Alternatively, the generated image can be created in an image-to-image fashion. For this, I adapted an example from the huggingface/diffusers repository, along the lines of:

from diffusers import DiffusionPipeline
import torch

torch.manual_seed(2023)

inp_img = my_input_image  # loaded with PIL.Image
mask = my_image_mask      # also PIL.Image
inner_image = inp_img.convert("RGBA")

pipe = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    custom_pipeline="img2img_inpainting",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
pipe.enable_attention_slicing()  # to save some gpu memory in exchange for a small speed decrease

prompt = "something..."
negative_prompt = "something..."

result = pipe(prompt=prompt, image=inp_img, inner_image=inner_image,
    mask_image=mask, negative_prompt=negative_prompt,
    num_inference_steps = 50, guidance_scale = 10).images
result[0]  # this is the generated image

Remarks

Some funny weird behaviors I observed with regards to the prompts:
- When prompted to generate “a newspaper and a cup of vanilla latte on a table in a coffee shop”, the model never actually generated any newspapers in my experiments. Instead, it often imitated text and appearance of a newspaper on the paper cup itself, as you can see in one of the example images shown at the top.
  - Instead of “newspaper” I tried to generate “laptop” or “macbook” too, without any success.
- Prompting to generate “paint brushes in front of an oil painting of a flower pot”, in my experiments, either didn’t generate any paint brushes at all, or made a surreal image where the brushes are in and out of the painting at the same time, as you can see in the collage above.
- So, based on my very limited experiments, I’m not sure if stable diffusion can handle terms such as “in front of”, “and”, “or”. I don’t know enough to speculate though (I haven’t even searched the internet about this yet :stuck_out_tongue:).
I used an Nvidia RTX 2080 Ti GPU for the images shown above. I found that particular care must be taken to avoid GPU memory issues, and some things are too finicky for me or impossible without a higher-end GPU. I experimented with some model training (not shown here), but ended up using a more powerful GPU in the cloud after some frustrations.
Overall this has been a pleasant and fun exercise, and a decent father-son activity with my 4-year old :smiley: (he got a little bored over time but participated throughout suggesting prompts, looking at the results, asking questions). I also like the final results, even though it’s quite obvious that the images have been modified by software.

Concordance Correlation Coefficient

2020-01-06T00:00:00-05:00

If we collect $n$ independent pairs of observations $(y_{11}, y_{12}), (y_{21}, y_{22}), \dots, (y_{n1}, y_{n2})$ from some bivariate distribution, then how can we estimate the expected squared perpendicular distance of each such point in the 2D plane from the 45-degree line?

Assume that the random two-dimensional vector $(Y_1, Y_2)$ follows a bivariate distribution with mean $\E(Y_1, Y_2) = (\mu_1, \mu_2)$, and covariance matrix with entries $\mathrm{Var}(Y_1) = \sigma_1^2$, $\mathrm{Var}(Y_2) = \sigma_2^2$ and $\mathrm{Cov}(Y_1, Y_2) = \sigma_{12}$.

The squared perpendicular distance of the random point $(Y_1, Y_2)$ from the 45-degree line is

\[\begin{equation*} D^2 = \frac{(Y_1 - Y_2)^2}{2}, \end{equation*}\]

see the figure below. Thus, the expected value of the squared perpendicular distance times two (for notational convenience) is given by,

\[\begin{align} \E\left[ 2D^2 \right] &= \E\left[ (Y_1 - Y_2)^2 \right] \nonumber \\ &= \E\left[ \left( (Y_1-\mu_1) - (Y_2-\mu_2) + \mu_1-\mu_2 \right)^2 \right] \nonumber \\ &= \E\left[ \left((Y_1-\mu_1) - (Y_2-\mu_2) \right)^2 \right] + (\mu_1-\mu_2)^2 \nonumber \\ &= (\mu_1-\mu_2)^2 + \sigma_1^2 + \sigma_2^2 - 2\sigma_{12} \label{eq:decomp1} \\ &= (\mu_1-\mu_2)^2 + (\sigma_1 - \sigma_2)^2 + 2[1 - \rho] \sigma_1 \sigma_2. \nonumber \end{align}\]

To answer the question raised above, we can estimate the value of equation $\eqref{eq:decomp1}$ based on $n$ pairs of observations $(y_{11}, y_{12}), (y_{21}, y_{22}), \dots, (y_{n1}, y_{n2})$ substituting the respective sample mean, sample variance, and covariance estimates for $\mu_1, \mu_2, \sigma_1^2, \sigma_2^2, \sigma_{12}$ respectively.

Defining the Concordance Correlation Coefficient

That’s great, but why should we spend any time thinking about the expected distance from the 45-degree line? What’s interesting about it?

Apart from delighting in the pure joy of doing mathematics and taking pleasure in the experience of mathematical beauty… :joy: :stuck_out_tongue_closed_eyes: … A measure of distance from the 45-degree line naturally quantifies the (dis)agreement between the two sets of observations. For example, we may have measured the same target entities using two different measurement instruments, and may want to know if and to what extent they agree.

Towards quantifying the extent of the (dis)agreement between two sets of observations it is natural to try to scale (or normalize) the quantity of equation $\eqref{eq:decomp1}$ to the range $[0, 1]$. However, it turns out that, rather than scaling to a $[0, 1]$ range, it is customary to scale this quantity to the range from -1 to 1 as follows,

\[\begin{equation} \mathrm{CCC} := 1 - \frac{\E\left[ (Y_1 - Y_2)^2 \right]}{(\mu_1-\mu_2)^2 + \sigma_1^2 + \sigma_2^2} = \frac{2\sigma_{12}}{(\mu_1-\mu_2)^2 + \sigma_1^2 + \sigma_2^2}. \label{eq:ccc} \end{equation}\]

This expression, first introduced by (Lin, 1989), is known as the Concordance Correlation Coefficient, abbreviated as CCC hereafter.

Concordance Correlation Coefficient vs. Pearson correlation coefficient

The scaling into the range from -1 to 1 may have been motivated by the fact that the Pearson correlation coefficient $\rho$ also falls within the $[-1, 1]$ range. In fact, analogous to how a Pearson correlation coefficient $\rho=1$ signifies perfect positive correlation, a CCC of 1 designates that the paired observations fall exactly on the line of perfect concordance (i.e., the 45-degree diagonal line).

Further aspects of the relationship to the Pearson correlation coefficient $\rho$ become visible if we rewrite the CCC further into the following set of equations.

\[\begin{equation} \mathrm{CCC} = \rho C, \label{eq:ccc2} \end{equation}\]

where

\[\begin{equation} C = \frac{2}{v + \frac{1}{v} + u^2}, v = \frac{\sigma_1}{\sigma_2}, u = \frac{(\mu_1 - \mu_2)^2}{\sigma_1 \sigma_2}. \label{eq:c} \end{equation}\]

From equations $\eqref{eq:ccc2}$ and $\eqref{eq:c}$ we observe that:

CCC always has the same sign as $\rho$.
CCC is 0 if and only if $\rho = 0$ (with exception of cases when $\rho$ is undefined but CCC can still be computed via equation $\eqref{eq:ccc}$).
We can consider $\rho$ to be a measure of precision, i.e., how far each point deviates from the best fitting line. Thus, CCC combines $\rho$ (as a measure of precision) with an additional measure of accuracy denoted by $C$ in equations $\eqref{eq:ccc2}$ and $\eqref{eq:c}$, whereby:
- $C$ quantifies how far the best-fit line deviates from the 45-degree line.
- When the data lie exactly on the 45-degree line, then we have that C = 1. The further the observations deviate from the 45-degree line, the further C is from 1 and the closer it is to 0.
- In particular, $v$ from equation $\eqref{eq:c}$ can be considered a measure of scale shift, and $u$ from equation $\eqref{eq:c}$ a measure of location shift relative to the scale.

Now it turns out that the Pearson correlation coefficient $\rho$ has one major shortcoming when assessing reproducibility of measurements, such as when comparing two instruments that measure the same target entity.

:point_right: Unlike CCC, $\rho$ is invariant to additive or multiplicative shifts by a constant value, referred to as location shift and scale shift respectively in the following set of figures:

Looking at the above figures we see that the magnitude of the Pearson correlation coefficient $\rho$ does not change under location and scale shift (though the sign may flip). The CCC on the other hand quantifies the deviation from the 45-degree line, which is due to location and scale shifts in these examples, rather well.

This makes the CCC a better metric when we want to assess how well one measurement can reproduce another (i.e., how close the measurement pairs fall to the 45-degree line), while we would use $\rho$ if what we want is quantifying to what extent the measurement pairs can be described by a linear equation (with any intercept and slope).

The following figures show the same examples where both the $x$ and the $y$ coordinates are augmented with Gaussian noise (mean 0, standard deviation 15; the same realization of the random noise is used within each subfigure). We see that both $\rho$ and CCC move further away from the extreme values of $-1$, $0$, and $1$ as noise is added.

What’s CCC good for? Reproducibility?

As hinted above, you may want to compare two instruments that aim to measure the same target entity, or two assays that aim to measure the same analyte, or other quantitative measurement procedures or devices. For example, one set of measurements may be obtained by what’s considered the “gold standard”, while the other set of measurements may be collected by a new instrument/assay/device that may be cheaper or in some other way preferable to the “gold standard” instrument/assay/device. Then one would wish to demonstrate that the collected two sets of measurements are equivalent. (Lin, 1989) refers to this type of agreement or similarity between two sets of measurements as reproducibility of measurements. The paper considers the following two illustrative examples:

(1) Can a “Portable $ave” machine (actual name withheld) reproduce a gold-standard machine in measuring total bilirubin in blood?

(2) Can an in-vitro assay for screening the toxicity of biomaterials reproduce from trial to trial?

And indeed this type of reproducibility assessment is a task where CCC has some clear advantages over the Pearson correlation coefficient, as seen in the figures above, as well as over some other approaches, as discussed in (Lin, 1989) in detail. A couple of shortcomings of common statistical approaches (when applied to the reproducibility assessment problem in question) are the following:

The paired t-test can reject a highly reproducible set of measurements when the variance is very small, while it fails to detect poor agreement in pairs of data when the means are equal.
The least squares approach of testing intercept equal to 0 and slope equal to 1 fails to detect nearly perfect agreement when the residual errors are very small, while it has increasingly little chance to reject the null hypothesis of agreement the more the data are scattered.

I will end here. However, if you want to go deeper into the topic I invite you to check out the original paper by Lin for a more thorough discussion of the merits of the CCC as well as for its statistical properties. Moreover, since the publication of (Lin, 1989) there of course has been follow-up work, which I didn’t read (so, I may update this blog post in the future).

References

Lin, L. I. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45(1), 255–268. https://www.ncbi.nlm.nih.gov/pubmed/2720055

From conditional probability to conditional distribution to conditional expectation, and back

2018-08-12T00:00:00-04:00

I can’t count how many times I have looked up the formal (measure theoretic) definitions of conditional probability distribution or conditional expectation (even though it’s not that hard :weary:) Another such occasion was yesterday. This time I took some notes.

From conditional probability → to conditional distribution → to conditional expectation

Let $X$ and $Y$ be two real-valued random variables.

Conditional probability

For a fixed set $B$ (Feller, 1966, p. 157) defines conditional probability of an event $\{Y \in B\}$ for given $X$ as follows.

By $\prob(Y \in B \vert X)$ (in words, “a conditional probability of the event $\{Y \in B\}$ for given $X$”) is meant a function $q(X, B)$ such that for every set $A \in \mathbb{R}$
\[\prob(X \in A, Y \in B) = \int_A q(x, B) \mu(dx)\]
where $\mu$ is the marginal distribution of $X$.

(where $A$ and $B$ are both Borel sets on $\R$.)

That is, the conditional probability can be defined as something that, when integrated with respect to the marginal distribution of $X$, results in the joint probability of $X$ and $Y$.

Moreover, note that if $A = \R$ then the above formula yields $\prob(Y \in B)$, the marginal probability of the event $\{ Y \in B \}$.

Example

For example, if the joint distribution of two random variables $X$ and $Y$ is the following bivariate normal distribution

\[\begin{pmatrix} X \\ Y \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix}, \begin{pmatrix} \sigma^2_X & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma^2_Y \end{pmatrix} \right),\]

then by sitting down with a pen and paper for some amount of time, it is not hard to verify that the function

\[q(x, B) = \int_B \frac{1}{\sqrt{2\pi(1-\rho^2)}\sigma_Y} \exp\left(-\frac{\left(y - \mu_Y+\frac{\sigma_Y}{\sigma_X}\rho( x - \mu_X)\right)^2}{2(1-\rho^2)\sigma_Y^2}\right) \mathrm{d}y\]

in this case satisfies the above definition of $\prob(Y \in B \vert X)$.

Conditional distribution

Later on (Feller, 1966, p. 159) follows up with the notion of conditional probability distribution:

By a conditional probability distribution of $Y$ for given $X$ is meant a function $q$ of two variables, a point $x$ and a set $B$, such that

for a fixed set $B$
\[q(X, B) = \prob(Y \in B \vert X )\]
is a conditional probability of the event $\{Y \in B\}$ for given $X$.

$q$ is for each $x$ a probability distribution.

It is also pointed out that

In effect a conditional probability distribution is a family of ordinary probability distributions and so the whole theory carries over without change.
(Feller, 1966)

When I first came across this viewpoint, I found it incredibly enlightening to regard the conditional probability distribution as a family of ordinary probability distributions. :smile:

Example

For example, assume that $X$ is an integer-valued and non-negative random variable, and that the conditional probability distribution of $Y$ for given $X$ is an F-distribution (denoted $\mathrm{F}(d_1, d_2)$) with $d_1 = e^X$ and $d_2 = 2^X$ degrees of freedom. Then the conditional probability distribution of $(Y \vert X)$ can be regarded as a family of probability distributions $\mathrm{F}(e^x, 2^x)$ for $x = 0, 1, 2, \dots$, whose probability density functions look like this:

In addition, as pointed out above, if we know the marginal distribution of $X$, then the conditional probability distribution of $(Y \vert X)$ can be used to obtain the marginal probability distribution of $Y$, or to randomly sample from the marginal distribution. Practically it means that if we randomly generate a value of $X$ according to its probability distribution, and use this value to randomly generate a value of $Y$ according to the conditional distribution of $Y$ for the given $X$, then the observations resulting from this procedure follow the marginal distribution of $Y$. Continuing the previous example, assume that $X$ follows a binomial distribution with parameters $n = 5$ and $p = 0.5$. Then the described simulation procedure estimates the following shape for the probability density function of $\prob(Y)$, the marginal distribution of $Y$:

Conditional expectation

Finally, (Feller, 1966, p. 159) introduces the notion of conditional expectation. By the above, for given a value $x$ we have that

\[q(x, B) = \prob(Y \in B \vert X = x), \quad\forall B\in\mathcal{B}\]

(here $\mathcal{B}$ denotes the Borel $\sigma$-algebra on $\R$), and therefore, a conditional probability distribution can be viewed as a family of ordinary probability distributions (represented by $q$ for different $x$s). Thus, as (Feller, 1966, p. 159) points out, if $q$ is given then the conditional expectation “introduces a new notation rather than a new concept.”

A conditional expectation $E(Y \vert X)$ is a function of $X$ assuming at $x$ the value
\[\E(Y \vert X = x) = \int_{-\infty}^{\infty} y q(x, dy)\]
provided the integral converges.

Note that, because $\E(Y \vert X)$ is a function of $X$, it is a random variable, whose value at an individual point $x$ is given by the above definition. Moreover, from the above definitions of conditional probability and conditional expectation it follows that

\[\E(Y) = \E(\E(Y \vert X)).\]

Example [cont.]

We continue with the last example. From the properties of the F-distribution we know that under this example’s assumptions on the conditional distribution, it holds that

\[\E(Y \vert X = x) = \begin{cases} \frac{d_2}{d_2 - 2} = \frac{2^x}{2^x - 2}, \quad x > 1,\\ \infty, \quad x \leq 1. \end{cases}\]

A rather boring strictly decreasing function of $x$ converging to $1$ as $x\to\infty$.

Thus, under the example’s assumption on the distribution of $X$, the conditional expectation $\E(Y \vert X)$ is a discrete random variable, which has non-zero probability mass at the values $2, 4/3, 8/7, 16/15,$ and $\infty$.

From conditional expectation → to conditional probability

An alternative approach is to define the conditional expectation first, and then to define conditional probability as the conditional expectation of the indicator function. This approach seems less intuitive to me. However, it is more flexible and more general, as we see below.

Conditional expectation

A definition in 2D

Let $X$ and $Y$ be two real-valued random variables, and let $\mathcal{B}$ denote the Borel $\sigma$-algebra on $\R$. Recall that $X$ and $Y$ can be represented as mappings $X: \Omega \to \R$ and $Y: \Omega \to \R$ over some measure space $(\Omega, \mathcal{A}, \prob)$. We can define $\mathrm{E}(Y \vert X=x)$, the conditional expectation of $Y$ given $X=x$, as follows.

A $\mathcal{B}$-measurable function $g(x)$ is the conditional expectation of $Y$ for given $x$, i.e.,

\[\mathrm{E}(Y \vert X=x) = g(x),\]

if for all sets $B\in\mathcal{B}$ it holds that

\[\int_{X^{-1}(B)} Y(\omega) d\prob(\omega) = \int_{B} g(x) d\prob^X(x),\]

where $\prob^X$ is the marginal probability distribution of $X$.

Interpretation in 2D

If $X$ and $Y$ are real-valued one-dimensional, then the pair $(X,Y)$ can be viewed as a random vector in the plane. Each set $\{X \in A\}$ consists of parallels to the $y$-axis, and we can define a $\sigma$-algebra induced by $X$ as the collection of all sets $\{X \in A\}$ on the plane, where $A$ is a Borel set on the line. The collection of all such sets forms a $\sigma$-algebra $\mathcal{A}$ on the plane, which is contained in the $\sigma$-algebra of all Borel sets in $\R^2$. $\mathcal{A}$ is called the $\sigma$-algebra generated by the random variable $X$.

Then $\mathrm{E}(Y \vert X)$ can be equivalently defined as a random variable such that

\[\mathrm{E}(Y\cdot I_{A}) = \mathrm{E}(\mathrm{E}(Y \vert X) \cdot I_{A}), \quad \forall A\in\mathcal{A},\]

where $I_{A}$ denotes the indicator function of the set $A$.

A more general definition of conditional expectation

The last paragraph illustrates that one could generalize the definition of the conditional expectation of $Y$ given $X$ to the conditional expectation of $Y$ given an arbitrary $\sigma$-algebra $\mathcal{B}$ (not necessarily the $\sigma$-algebra generated by $X$). This leads to the following general definition, which is stated in (Feller, 1966, pp. 160-161) in a slightly different notation.

Let $Y$ be a random variable, and let $\mathcal{B}$ be a $\sigma$-algebra of sets.

A random variable $U$ is called a conditional expectation of $Y$ relative to $\mathcal{B}$, or $U = \E(Y \vert \mathcal{B})$, if it is $\mathcal{B}$-measurable and
\[\E(Y\cdot I_{B}) = \E(U \cdot I_{B}), \quad \forall B\in\mathcal{B}.\]
If $\mathcal{B}$ is the $\sigma$-algebra generated by a random variable $X$, then $\E(Y \vert X) = \E(Y \vert \mathcal{B})$.

Back to conditional probability and conditional distributions

Let $I_{\{Y \in A\}}$ be a random variable that is equal to one if and only if $Y\in A$. The conditional probability of $\{Y \in A\}$ given $X = x$ can be defined in terms of a conditional expectation as

\[\prob(Y \in A \vert X = x) = \E(I_{\{Y \in A\}} \vert X = x).\]

Under certain regularity conditions the above defines the conditional probability distribution of $(Y \vert X)$.

References

Feller, W. (1966). An introduction to probability theory and its applications (Vol. 2). John Wiley & Sons.

Setting up an HTTPS static site using AWS S3 and Cloudfront (and also Jekyll and s3_website)

2018-07-27T00:00:00-04:00

For a while now I wanted to migrate my websites away from Github pages. While Github provides an excellent free service, there are some limitations to its capabilities, and the longer I wait the harder (or the more inconvenient) it becomes to migrate away from gh-pages. AWS S3 + CloudFront is a widely-used alternative that has been around for a long time. Moreover, I was planning to get more familiar with AWS at all levels anyway. So, it’s a great learning opportunity too.

There are a number of very helpful tutorials online on how to set up an HTTPS static site using AWS S3 and CloudFront. Of course, as always the case with blog articles, they may be outdated, incomplete, and generally not as trustworthy as the official AWS documentation on the topic, which is pretty good too; but it is also somewhat fragmented and inconvenient to follow. So I wrote my own summary to refer to in the future.

Relevant AWS docs: How to create a static website on AWS S3; How to use a custom domain with AWS S3; Setting up Amazon CloudFront; SSL certificate instructions.

1 Set up a static site, yet without CloudFront and without HTTPS

First, we set up a static HTTP site without a custom domain on AWS S3:

Create a bucket named example.com (obviously replace example.com with your own domain).
Follow the procedure given at https://docs.aws.amazon.com/AmazonS3/latest/dev/HostingWebsiteOnS3Setup.html to enable website hosting for the bucket, and to make it publicly readable; (optionally) if you want to understand the AWS bucket access policy language see https://docs.aws.amazon.com/AmazonS3/latest/dev//access-policy-language-overview.html, and follow the links from there.
Test the S3 website: Upload an index.html to the bucket (you can keep all options for the upload at their default values). Then go to http://example.com.s3-website-us-east-1.amazonaws.com/ (where you need to replace example.com with the bucket name, and us-east-1 with your bucket’s region), and see if the contents of index.html show up.

Yay :laughing: we have a working website!! …without a custom domain or https yet :sweat_smile:

The www subdomain: Now prepare another S3 bucket for the subdomain “www.example.com” to be later redirected to the root domain “example.com” (btw, if you so wish, www.example.com can be the main S3 bucket and the example.com bucket can be configured to redirect — just swap their roles in this entire writeup):

Create a bucket named www.example.com (all options can be left at their defaults; this bucket doesn’t need to be publicly readable).
Configure www.example.com to redirect all requests to example.com following Step 2.3 from the AWS docs at https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html.
Test the endpoints for the redirect by going to http://www.example.com.s3-website-us-east-1.amazonaws.com/ (as before replace the bucket name and region accordingly).

Map the domain and subdomain to their S3 buckets:

Amazon Route 53 is a service that maintains a mapping between the alias records and the IP of the bucket. You need to follow Step 3 from the AWS docs at https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html.

Configuration with your domain name registrar:

In AWS go to Route 53 -> Hosted zones -> example.com
The NS (name servers) records that you see are what needs to be provided to the domain name registrar. For example, for GoDaddy I have to choose to use “custom nameservers” under the DNS settings for the domain, and then to input all (four in my case) of the URLs provided as values under the NS record.
Your website should now appear under http://example.com (and http://www.example.com).

:smile: So we have a website with a custom domain!! …though without CloudFront (so loading may be rather slow) and without HTTPS.

Optional: Configure an IAM role with limited access permissions

Now it seems a good idea to create a new user that has full read-write permission to the example.com bucket and full permission to CloudFront, but does not have any further AWS permissions. A suitable IAM policy document can be found at: https://github.com/laurilehmijoki/s3_website/blob/master/additional-docs/setting-up-aws-credentials.md Make sure to save the new user’s access key ID and secret access key somewhere in a private place.

Optional: Use Jekyll and s3_website to generate a static site and to push it to the S3 bucket

Well, I typically use Jekyll to make my static sites (because it’s awesome!). The Ruby gem s3_website can be used to push the website to, or to synchronized it with the S3 bucket. The s3_website documentation is easy to follow. I have found it convenient to use the dotenv gem to keep the access key ID and the secret access key of the user (that was just created) locally in a .env file (don’t commit/push it to github!!!) At this point you may also choose to allow s3_website to set up CloudFront for the website to save some time later (though without the SSL certificate, which will still have to be added manually, see below).

2 Request an SSL certificate

We need an SSL certificate to enable HTTPS for the custom domain when it is accessed through CloudFront.

Follow the AWS docs at https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html to request a public certificate for your domain. Some important points:

Add example.com and *.example.com to the certificate.
Use DNS validation (rather than email validation), whereby in the “pending validation” stage you can choose “Create record in Route 53” which saves time (since we have already configures Route 53 for this domain).

I encountered one caveat in this process:

To use an ACM Certificate with CloudFront, you must request or import the certificate in the US East (N. Virginia) region.

(from http://docs.aws.amazon.com/acm/latest/userguide/acm-services.html); i.e., change region to US East N. Virginia if needed (top right corner within the AWS interface).

3 Create a CloudFront distribution

Follow these AWS docs to create a CloudFront distribution: https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-cloudfront-walkthrough.html; unless a CloudFront distribution was already created by s3_website (see one of the previous optional steps), in which case it needs to be merely edited (add the SSL certificate to it, and update “Alternate Domain Names” with yourdomain.com and www.yourdomain.com if necessary).

Notice the designated CloudFront distribution domain, which should look similar to vtrlj8ubh2k69.cloudfront.net. Once set up the website should appear under it.

A few points I found noteworthy:

One can choose to set HTTP to always redirect to HTTPS.
Once issued, the SSL certificate can be selected from the drop down menu under “Custom SSL Certificates”.
As pointed out in Vicky Lai’s blog post the “Origin” column in the CloudFront Console should show the S3 bucket’s endpoint example.com.s3-website.us-east-1.amazonaws.com, and not the bucket name example.com.s3.amazonaws.com (btw s3_website does this correctly). Note that when setting up, the drop down menu offers only the bucket name to be picked rather the correct endpoint; so, don’t use the drop down menu; type it in yourself.¹

Update A records in Route 53, and update the s3_website configs:

In AWS go to Route 53 -> Hosted zones -> example.com
For both A records, change the “Alias Target” from the S3 endpoint to the CloudFront distribution domain (i.e., something like vtrlj8ubh2k69.cloudfront.net).
If you use s3_website check or set the cloudfront_distribution_id property in s3_website.yml to the correct distribution ID (something like SY9Q4DHIOUG7A)

That’s it — the site should now be accessible under https://example.com and https://www.example.com. :tada: :tada: :tada:

It is not exactly clear to me what difference it makes to set the “Origin” to example.com.s3.amazonaws.com vs example.com.s3-website.us-east-1.amazonaws.com. However, it solved one of my issues. At first I set the “Origin” value to the bucket name, similar to example.com.s3.amazonaws.com, since that is what was offered by the drop down menu in CloudFront. The landing page of the website was working just fine under the custom domain. However, when I navigated to subdirectories in my domain, similar to example.com/about/, the server did not seem to understand that it needed to look for the index.html within the about directory, and produced an error. Once I edited the “Origin” record to the S3 bucket endpoint, similar to example.com.s3-website.us-east-1.amazonaws.com, all pages of the website started to display perfectly fine. ↩

Neural networks and deep learning - self-study and 2 presentations

2018-04-29T00:00:00-04:00

Last month, after mentioning “deep learning” a few times to some professors, I suddenly found myself in a position where I had to prepare three talks about “deep learning” within just one month… :sweat_smile: This is not to complain. I actually strongly enjoy studying the relevant theory, applying it to interesting datasets, and presenting what I have learned. Besides, teaching may be the best way to learn. However, it is quite funny. :laughing: The deep learning hype is too real. :trollface:

In this post I want to share my presentation slides (see below), some other resources, and some thoughts, in case any of that can be helpful to other deep learning beginners.¹

Neural networks (NNs) and deep learning (DL, also deep NNs, or DNNs) are not my research area, but currently it is one of my main side-interests. (D)NNs are truly fascinating to somebody with substantial experience in statistics or the more conventional machine learning (like myself). Initially it seems counterintuitive how these extremely overparametrized models are even supposed to work, but then you fit those models, and their performance is so good that it seems to border on magic. :crystal_ball:

Slides

These html slides were created with the excellent reveal.js.

An introduction to neural networks and deep learning — guest lecture for a mathematical modeling class in the department of biomedical engineering at Tulane (with live-demos using Google Cloud, see slides).
A survey on Deep Learning in Medical Image Analysis — journal club presentation based on a paper by Litjens, Kooi, Bejnordi, et al. This was presented as a 2-part talk.

My favorite learning resources

I was able give the above presentations, because I did a good amount of self-study on NN and DL in my free time. Here are some of the resources that I have used, and that I highly recommend:

I worked through the fast.ai MOOC “Practical Deep Learning For Coders, Part 1” by Jeremy Howard and Rachel Thomas. It is not spoon-feeding (if you want to actually understand what’s going on), but highly recommended as a starting point. Jeremy Howard is fantastic at giving clear and simple explanations to complex concepts, and the provided Jupyter Notebooks are excellent to get started with the practical application of DL.
At the same time I swallowed Michael Nielsen’s “Neural Networks and Deep Learning” book, which was a pleasure to read.
Then I participated in the IPAM/UCLA workshop “New Deep Learning Techniques” in February (videos and slides available on the linked site), which blew my mind by covering so many different perspectives which I was not aware of.
Currently I am working through the lectures and assignments from Stanford’s CS231n together with the ods.ai community (see passing cs231n together). Feel free to contact me if you want to discuss the CS231n assignments in the near future.
During the entire time, I was also (slowly) working on my Python skills, as well as figuring out how to set up AWS, and Google Cloud GPU instances. Unfortunately, figuring out and setting up the required drivers, libraries, etc. is still very non-trivial. For many people I meet in academia this may even be the greatest bottleneck towards deep learning. This is the setup I am currently using.

These resources have worked very well for me. My background is mostly academic, and includes experience in statistical modeling, (non-deep) machine learning, an all-but-dissertation status in a math PhD program, and some domain knowledge in medical imaging. While it is helpful with some of the above, none of that is really that important or necessary. Though some math is definitely needed, it does not need to be at a PhD level. Medical or biological knowledge helps only if those are the applications of DL that you seek out (which I do). Understanding some basic machine learning and data science practices certainly helps, but the relevant material is covered in all DL courses that I have tried. However, what helps immensely in any case is proficiency with git, Github, Linux, as well as general programming and data processing skills.

I hope that still being close to the beginning of my DL journey makes me in some way more helpful to the absolute beginner (which I too was just a few months ago)… Maybe right now I have some perspective that may get lost should I become a DL expert… ↩

Probabilistic interpretation of AUC

2018-01-25T01:00:00-05:00

Unfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be :scream_cat:). So it took me some time until I learned that the AUC has a nice probabilistic meaning.

What’s AUC anyway?

AUC is the area under the ROC curve. The ROC curve is the receiver operating characteristic curve. AUC is simply the area between that curve and the x-axis. So, to understand AUC we need to look at the concept of an ROC curve.

Consider:

A dataset $S$ : $(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n) \in \mathbb{R}^p \times \{0, 1\}$, where
- $\mathbf{x}_i$ is a vector of $p$ features collected for the $i$th subject,
- $y_i$ is the $i$th subject’s label (binary outcome variable of interest, like a disease status, class membership, or whatever binary label).
A classification algorithm (such as logistic regression, SVM, deep neural net, or whatever you like), trained on $S$, that assigns a score (or a “probability”) $\hat{p}(\mathbf{x}_{\ast})$ to any new observation $\mathbf{x}_{\ast} \in \mathbb{R}^p$ signifying the algorithm’s confidence that the label (or class) of $\mathbf{x}_{\ast}$ is $y_{\ast} = 1$.

Then:

A decision threshold (or operating point) can be chosen to assign a class label ($y_{\ast} = 0$ or $1$) to $\mathbf{x}_{\ast}$ based on the value of $\hat{p}(\mathbf{x}_{\ast})$. The chosen threshold determines the balance between how many false positives and false negatives will result from this classification.
Plotting the true positive rate (TPR) against the false positive rate (FPR) as the operating point changes from its minimum to its maximum value yields the receiver operating characteristic (ROC) curve. Check the confusion matrix if you are not sure what TPR and FPR refer to.
The area under the ROC curve, or AUC, is used as a measure of classifier performance.

Here is some R code for clarification:

# load some data, fit a logistic regression classifier
data(iris)
versicolor_virginica <- iris[iris$Species != "setosa", ]
logistic_reg_fit <- glm(Species ~ Sepal.Width + Sepal.Length,
                        data = versicolor_virginica,
                        family = "binomial")
y <- ifelse(versicolor_virginica$Species == "versicolor", 0, 1)
y_pred <- logistic_reg_fit$fitted.values

# get TPR and FPR at different values of the decision threshold
threshold <- seq(0, 1, length = 100)
FPR <- sapply(threshold,
  function(thresh) {
    sum(y_pred >= thresh & y != 1) / sum(y != 1)
  })
TPR <- sapply(threshold,
  function(thresh) {
    sum(y_pred >= thresh & y == 1) / sum(y == 1)
  })

# plot an ROC curve
plot(FPR, TPR)
lines(FPR, TPR)

A rather ugly ROC curve emerges:

The area under the ROC curve, or AUC, seems like a nice heuristic to evaluate and compare the overall performance of classification models independent of the exact decision threshold chosen. $\mathrm{AUC} = 1.0$ signifies perfect classification accuracy, and $\mathrm{AUC} = 0.5$ is the accuracy of making classification decisions via coin toss (or rather a continuous coin that outputs values in $[0,1]$…). Most classification algorithms will result in an AUC in that range. But there’s more to it.

Probabilistic interpretation

As above, assume that we are looking at a dataset where we want to distinguish data points of type 0 from those of type 1. Consider a classification algorithm that assigns to a random observation $\mathbf{x}\in\mathbb{R}^p$ a score (or probability) $\hat{p}(\mathbf{x}) \in [0,1]$ signifying membership in class 1. If the final classification between class 1 and class 0 is determined by a decision threshold $t\in[0, 1]$, then the true positive rate (a.k.a. sensitivity or recall) can be written as a conditional probability

\[T(t) := P[\hat{p}(\mathbf{x}) > t \,|\, \mathbf{x}\,\text{belongs to class 1}],\]

and the false positive rate (or 1 - specificity) can be written as

\[F(t) := P[\hat{p}(\mathbf{x}) > t \,|\, \mathbf{x}\,\text{does not belong to class 1}].\]

For brevity of notation let’s say $y(\mathbf{x}) = 1$ instead of “$\mathbf{x}$ belongs to class 1”, and $y(\mathbf{x})=0$ instead of “$\mathbf{x}$ doesn’t belong to class 1”.

The ROC curve simply plots $T(t)$ against $F(t)$ while varying $t$ from 0 to 1. Thus, if we view $T$ as a function of $F$, the AUC can be rewritten as follows.

\[\begin{eqnarray} \mathrm{AUC} &=& \int_0^1 T(F_0) \,\mathrm{d}F_0 \nonumber \\ &=& \int_0^1 P[\hat{p}(\mathbf{x}) > F^{-1}(F_0) \,|\, y(\mathbf{x}) = 1] \,\mathrm{d}F_0 \nonumber \\ &=& \int_1^0 P[\hat{p}(\mathbf{x}) > F^{-1}(F(t)) \,|\, y(\mathbf{x}) = 1] \cdot \frac{\partial F(t)}{\partial t} \,\mathrm{d}t \nonumber \\ &=& \int_0^1 P[\hat{p}(\mathbf{x}) > t \,|\, y(\mathbf{x}) = 1] \cdot P[\hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x^{\prime}}) = 0] \,\mathrm{d}t \nonumber \\ &=& \int_0^1 P[\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}}) \,\&\, \hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x}) = 1 \,\&\, y(\mathbf{x^{\prime}}) = 0] \,\mathrm{d}t \nonumber \\ &=& P[\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}}) \,|\, y(\mathbf{x}) = 1 \,\&\, y(\mathbf{x^{\prime}}) = 0], \nonumber \end{eqnarray}\]

where we used the fact that the probability density function

\[P[\hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x^{\prime}}) = 0] =: f(t)\]

is the derivative with respect to $t$ of the cumulative distribution function

\[P[\hat{p}(\mathbf{x^{\prime}}) \leq t \,|\, y(\mathbf{x^{\prime}}) = 0] = 1-F(t).\]

So, given a randomly chosen observation $\mathbf{x}$ belonging to class 1, and a randomly chosen observation $\mathbf{x^{\prime}}$ belonging to class 0, the AUC is the probability that the evaluated classification algorithm will assign a higher score to $\mathbf{x}$ than to $\mathbf{x^{\prime}}$, i.e., the conditional probability of $\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}})$.

An alternative purely geometric proof can be found in the Scatterplot Smoothers blog.

In other words, if the classification algorithm distinguishes “positive” and “negative” examples (e.g., disease status), then

AUC is the probability of correct ranking of a random “positive”-“negative” pair.

Computing AUC

The above probabilistic interpretation suggest a simple formula to compute AUC on a finite sample:

Among all “positive”-“negative” pairs in the dataset compute the proportion of those which are ranked correctly by the evaluated classification algorithm.

Here is an inefficient implementation using results from the above logistic regression example:

s <- 0
for (i in which(y == 1)) {
  for (j in which(y == 0)) {
    if (y_pred[i] > y_pred[j]) {
      s <- s + 1
    } else if (y_pred[i] == y_pred[j]) {
      s <- s + 0.5
    }
  }
}
s <- s / (sum(y == 1) * sum(y == 0))
s
# [1] 0.7918

The proportion of correctly ranked “positive”-“negative” pairs yields estimated $\mathrm{AUC} = 0.7918$.

We can compare this value to the area under the ROC curve computed with the trapezoidal rule.

s <- 0
for (i in 1:(length(FPR) - 1)) {
  dFPR <- abs(FPR[i+1] - FPR[i])
  s <- s + 0.5 * dFPR * (TPR[i+1] + TPR[i])
}
s
# [1] 0.7922

Trapezoidal rule yields estimated $\mathrm{AUC} = 0.7922$. The difference of $0.0004$ can be explained by the fact that we evaluated the ROC curve at only 100 points.

Since there is a minor disagreement, let’s use some standard R package to compute AUC.

library(ROCR)
pred <- prediction(y_pred, y)
auc <- as.numeric(performance(pred, measure = "auc")@y.values)
auc
# [1] 0.7918

Same as the proportion of correctly ranked pairs! :grin:

Wilcoxon-Mann-Whitney test

By analysing the probabilistic meaning of AUC, we not only got a practically relevant interpretation of this classification performance metric, but we also obtained a simple formula to estimate the AUC of a trained classification algorithm. Well, it turns out that taking the proportion of correctly ranked “positive”-“negative” pairs as a formula to estimate the AUC is equivalent to the Wilcoxon-Mann-Whitney statistical test. This fact can also be easily demonstrated in a couple lines of R code.

y_is_1 <- which(y == 1)
y_is_0 <- which(y == 0)
n_pairs <- length(y_is_1) * length(y_is_0)
WMW_test <- wilcox.test(y_pred[y_is_1], y_pred[y_is_0])
WMW_test$statistic / n_pairs
#      W
# 0.7918

Same answer!

So what? Why care about AUC anyway?

It has a really nice probabilistic meaning! :wink:

Besides (arguably more importantly), as a measure of classification performance AUC has many advantages compared to other “single number” performance measures:

Independence of the decision threshold.
Invariance to prior class probabilities or class prevalence in the data.
Can choose/change a decision threshold based on cost-benefit analysis after model training.
Extensively used in machine learning, and in medical research – and that for good reasons, as for example explained in an excellent blog post on deep learning research in medicine by Luke Oakden-Rayner.

Mining USPTO full text patent data - An exploratory analysis of machine learning and AI related patents granted in 2017 so far

2017-09-22T00:00:00-04:00

The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).

I started this exploration by downloading the full text data (excluding images) for all patents that were assigned by the USPTO within the year 2017 up to the time of writing (Patent Grant Full Text Data/XML for the year 2017 through the week of Sept 12 from the USPTO Bulk Data Storage System).

In this blog post I address questions such as: How many ML and AI related patents were granted? Who are the most prolific inventors? The most frequent patent assignees? Where are inventions made? And when? Is the number of ML and AI related patents increasing over time? How long does it take to obtain a patent for a ML or AI related invention? Is the patent examination time shorter for big tech companies? Etc.

First, I curated a patent full text dataset consisting of “machine learning and AI related” patents. I am not just looking for instances where actual machine learning or AI algorithms were patented; I am looking for inventions which are related to ML or AI in any/some capacity. That is, I am interested in patents where machine learning, data mining, predictive modeling, or AI is utilized as a part of the invention in any way whatsoever. The subset of relevant patents was determined by a keyword search as specified by the following definition.

Definition: For the purposes of this blog post, a machine learning or AI related patent is a patent that contains at least one of the keywords “machine learning”, “deep learning”, “neural network”, “artificial intelligence”, “statistical learning”, “data mining”, or “predictive model” in its invention title, description, or claims text (while of course accounting for capitalization, pluralization, etc.).¹

With this keyword matching approach a total of 6665 patents have been selected. The bar graph below shows how many times each keyword got matched.

Interestingly the term “neural network” is even more common than the more general terms “machine learning” and “artificial intelligence”.

Some example patents

Here are three (randomly chosen) patents from the resulting dataset. For each printed are the invention title, the patent assignee, as well as one instance of the keyword match within the patent text.

## $`2867`
## [1] "Fuselage manufacturing system"
## [2] "THE BOEING COMPANY"
## [3] "... using various techniques. For example, at least
##      one of an artificial intelligence program, a
##      knowledgebase, an expert ..."
##
## $`1618`
## [1] "Systems and methods for validating wind farm
##      performance measurements"
## [2] "General Electric Company"
## [3] "... present disclosure leverages and fuses
##      accurate available sensor data using machine
##      learning algorithms. That is, the more ..."
##
## $`5441`
## [1] "Trigger repeat order notifications"
## [2] "Accenture Global Services Limited"
## [3] "... including user data obtained from a user
##      device; obtaining a predictive model that
##      estimates a likelihood of ..."

And here are three examples of (randomly picked) patents that contain the relevant keywords directly in their invention title.

## $`5742`
## [1] "Adaptive demodulation method and apparatus using an
##      artificial neural network to improve data recovery
##      in high speed channels"
## [2] "Kelquan Holdings Ltd."
## [3] "... THE INVENTION\nh-0002\n1 The present invention
##      relates to a neural network based integrated
##      demodulator that mitigates ..."
##
## $`3488`
## [1] "Cluster-trained machine learning for image processing"
## [2] "Amazon Technologies, Inc."
## [3] "... BACKGROUND\nh-0001\n1 Artificial neural networks,
##      especially deep neural network ..."
##
## $`3103`
## [1] "Methods and apparatus for machine learning based
##      malware detection"
## [2] "Invincea, Inc."
## [3] "... a need exists for methods and apparatus that can
##      use machine learning techniques to reduce the amount ..."

Who holds these patents (inventors and assignees)?

The first question I would like to address is who files most of the machine learning and AI related patents.

Each patent specifies one or several inventors, i.e., the individuals who made the patented invention, and a patent assignee which is typically the inventors’ employer company that holds the rights to the patent. The following bar graph visualizes the top 20 most prolific inventors and the top 20 most frequent patent assignees among the analyzed ML and AI related patents.

It isn’t surprising to see this list of companies. The likes of IBM, Google, Amazon, Microsoft, Samsung, and AT&T rule the machine learning and AI patent space. I have to admit that I don’t recognize any of the inventors’ names (but it might just be me not being familiar enough with the ML and AI community).

There are a number of interesting follow-up questions which for now I leave unanswered (hard to answer without additional data):

What is the count of ML and AI related patents by industry or type of company (e.g., big tech companies vs. startups vs. reserach universities vs. government)?
Who is deriving the most financial benefit by holding ML or AI related patents (either through licensing or by driving out the competition)?

Where do these inventions come from (geographically)?

Even though the examined patents were filed in the US, some of the inventions may have been made outside of the US. In fact, the data includes specific geographic locations for each patent, so I can map in which cities within the US and the world inventors are most active. The following figure is based on where the inventors are from, and shows the most active spots. Each point corresponds to the total number of inventions made at that location (though note that the color axis is a log10 scale, and so is the point size).

The results aren’t that surprising. However, we see that most (ML and AI related) inventions patented with the USPTO were made in the US. I wonder if inventors in other countries prefer to file patents in their home countries’ patent offices rather the in the US.

Alternatively, we can also map the number of patents per inventors’ origin countries.

Sadly, there seem to be entire groups of countries (e.g., almost the entire African continent) which seem to be excluded from the USPTO’s patent game, at least with respect to machine learning and AI related inventions. Whether it is a lack of access, infrastructure, education, political treaties or something else is an intriguing question.

Patent application and publication dates, and duration of patent examination process

Each patent has a date of filing and an assignment date attached to it. Based on the provided dates one can try to address questions such as: When were these patents filed? Is the number of ML and AI related patents increasing over time? How long did it usually take from patent filing to assignment? And so on.

For the set of ML and AI related patents that were granted between Jan 3 and Sept 12 2017 the following figure depicts…

…in the top panel: number of patents (y-axis) per their original month of filing (x-axis),
…in the bottom panel: the number of patents (y-axis) that were assigned (approved) per week (x-axis) in 2017 so far.

The patent publication dates plot suggests that the number of ML and AI related patents seems to be increasing slightly throughout the year 2017. The patent application dates plot suggests that the patent examination phase for the considered patents takes about 2.5 years. In fact the average time from patent filing to approval is 2.83 years with a standard deviation of 1.72 years in this dataset (that is, among the considered ML and AI related patents in 2017). However, the range is quite extensive spanning 0.24-12.57 years.

The distribution of the duration from patent filing date to approval is depicted by following figure.

So, what are some of the inventions that took longest to get approved? Here are the five patents with the longest examination periods:

“Printing and dispensing system for an electronic gaming device that provides an undisplayed outcome” (~12.57 years to approval; assignee: Aim Management, Inc.)
“Apparatus for biological sensing and alerting of pharmaco-genomic mutation” (~12.24 years to approval; assignee: NA)
“System for tracking a player of gaming devices” (~12.06 years to approval; assignee: Aim Management, Inc.)
“Device, method, and computer program product for customizing game functionality using images” (~11.72 years to approval; assignee: NOKIA TECHNOLOGIES OY)
“Method for the spectral identification of microorganisms” (~11.57 years to approval; assignee: MCGILL UNIVERSITY, and HEALTH CANADA)

Each of these patents is related to either gaming or biotech. I wonder if that’s a coincidence…

We can also look at the five patents with the shortest approval time:

“Home device application programming interface” (~91 days to approval; assignee: ESSENTIAL PRODUCTS, INC.)
“Avoiding dazzle from lights affixed to an intraoral mirror, and applications thereof” (~104 days to approval; assignee: DENTAL SMARTMIRROR, INC.)
“Media processing methods and arrangements” (~106 days to approval; assignee: Digimarc Corporation)
“Machine learning classifier that compares price risk score, supplier risk score, and item risk score to a threshold” (~111 days to approval; assignee: ACCENTURE GLOBAL SOLUTIONS LIMITED)
“Generating accurate reason codes with complex non-linear modeling and neural networks” (~111 days to approval; assignee: SAS INSTITUTE INC.)

Interstingly the patent approved in the shortest amount of time among all 6665 analysed (ML and AI related) patents is some smart home thingy from Andy Rubin’s hyped up company Essential.

Do big tech companies get their patents approved faster than other companies (e.g., startups)?

The following figure separates the patent approval times according to the respective assignee company, considering several of the most well known tech giants.

Indeed some big tech companies, such as AT&T or Samsung, manage to push their patent application though the USPTO process much faster than most other companies. However, there are other tech giants, such as Microsoft, which on average take longer to get their patent applications approved than even the companies in category “Other”. Also noteworthy is the fact that big tech companies tend to have fewer outliers regarding the patent examination process duration than companies in the category “Other”.

Of course it would also be interesting to categorize all patent assignees into categories like “Startup”, “Big Tech”, “University”, or “Government”, and compare the typical duration of the patent examination process between such groups. However, it’s not clear to me how to establish such categories without collecting additional data on each patent assignee, which at this point I don’t have time for :stuck_out_tongue:.

Conclusion

There is definitely a lot of promise in the USPTO full text patent data. Here I have barely scratched the surface, and I hope that I will find the time to play around with these data some more. The end goal is, of course, to replace the patent examiner with an AI trained on historical patent data. :stuck_out_tongue_closed_eyes:

This work (blog post and included figures) is licensed under a Creative Commons Attribution 4.0 International License.

There are two main aspects to my reasoning as to this particular choice of keywords. (1) I wanted to keep the list relatively short in order to have a more specific search, and (2) I tried to avoid keywords which may generate false positives (e.g., the term “AI” would match all sorts of codes present in the patent text, such as “123456789 AI N1”). In no way am I claiming that this is a perfect list of keywords to identify ML and AI related patents, but I think that it’s definitely a good start. ↩

Freedman’s paradox

2017-06-05T20:00:00-04:00

Recently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as “Freedman’s paradox”. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors.

The 1983 paper presents a simulation with only 10 repetitions. But in the present day it is very easy (both in terms of computational time and implementation difficulty) to reproduce the simulation with many more repetitions (even my phone’s computational power is probably higher than that of the high performance computer that Freedman used in the 80’s). We also have more convenient ways to visualize the results than in the 80’s. So let’s do it.

I am going to use a few R packages (most notably the package broom to fit and analyze many many linear models in a single step).

library(dplyr)
library(broom)
library(ggplot2)
library(tidyr)
set.seed(20170605)

The considered data structure is the following:

A matrix of predictors with 100 rows and 50 columns is generated with independent standard normal entries.
The response variable is generated independently of the model matrix (also from the standard normal distribution), i.e., the true answer is that there is no relationship between predictors and response.

Instead of Freedman’s 10 repetitions we perform 1000. So let’s generate all 1000 datasets at once as stacked in a large data frame:

n_row <- 100
# n_col is set to 51 because the 51st column will serve as y
n_col <- 51
n_rep <- 1000

# a stack of matrices for all n_rep repetitions is generated...
X <- matrix(rnorm(n_rep * n_row * n_col), n_rep * n_row, n_col)
colnames(X) <- paste0("X", 1:n_col)
# ...and then transformed to a data frame with a repetition number column
X_df <- as_data_frame(X) %>% mutate(repetition = rep(1:n_rep, each = n_row))

The data are analyzed in two successive linear models. The second (illegally) reusing the results of the first.

The first model fit. After the 1000 ordinary linear models are fit to the data, we record for each of them the R squared, the F test statistic with corresponding p-value, and the t test statistics with p-values for the individual regression coefficients.

Using functions from the broom package we can fit and extract information from all 1000 models at once.

# all models can be fit at once...
models_df = X_df %>% group_by(repetition) %>%
  do(full_model = lm(X51 ~ . + 0, data = select(., -repetition)))
# ...then the results are extracted
model_coefs <- tidy(models_df, full_model)
model_statistics <- glance(models_df, full_model)
model_statistics$data_reuse <- rep(FALSE, nrow(model_statistics))

The second model fit. For each one of the first 1000 models, the corresponding second linear model is fit using only those variables which have p-values significant at the 25% level in the first model. That is, the second model uses the first model for variable selection.

This gives us 1000 reduced re-fitted linear models. We record the same model statistics (R squared, F, and t tests) as for the first group of models.

reduced_models <- list()
for (i in 1:n_rep) {
  full_data <- X_df %>% filter(repetition == i)
  significant_coefs <- model_coefs %>%
    filter(repetition == i & p.value < 0.25)
  reduced_data <- select(full_data,
                         one_of(unlist(significant_coefs[ , "term"])), X51)
  reduced_models[[i]] <- lm(X51 ~ . + 0, data = reduced_data)
  tmp_df <- glance(reduced_models[[i]])
  tmp_df$repetition <- i
  tmp_df$data_reuse <- TRUE
  model_statistics <- bind_rows(model_statistics, tmp_df)
}

Finally let’s look at the results. The figure shows the distributions of the considered model statistics across the 1000 repetitions for model fits with and without data reuse (the code producing this figure is given at the bottom of this post):

Well, the R squared statistic shows a moderate change between models with or without data reuse (average of 0.3093018 vs. 0.5001641). The F test statistic however grows immensely to an average of 3.2806118 (from 1.0480097), and the p-values fall after data reuse to an average of 0.0112216 (from 0.5017696), below the widely used (but arbitrary) 5% significance level.

Obviously the model with data reuse is highly misleading here, because in fact there are absolutely no relationships between the predictor variables and the response (as per the data generation procedure).

In fact, Freedman derived asymptotic estimates for the magnitudes of change in the considered model statistics, and they indeed match the above observations. However I’m too lazy to summarize them here. So I refer to the primary source.

This code generates the above figure:

model_statistics %>%
  select(r.squared, p.value, statistic, repetition, data_reuse) %>%
  mutate(data_reuse = ifelse(data_reuse, "With Data Reuse", "Without Data Reuse")) %>%
  mutate(data_reuse = factor(data_reuse, levels = c("Without Data Reuse", "With Data Reuse"),
                             ordered = TRUE)) %>%
  rename("F-statistic" = statistic, "p-value" = p.value, "R squared" = r.squared) %>%
  gather(stat, value, -repetition, -data_reuse) %>%
  ggplot(aes(x = stat, y = value)) +
    geom_violin(aes(fill = stat), scale = "width", draw_quantiles = c(0.25, 0.5, 0.75)) +
    geom_hline(yintercept = 0.05, linetype = 2, size = 0.3) +
    facet_wrap(~data_reuse) +
    theme_linedraw() +
    scale_y_continuous(breaks = c(0.05, 2, 4, 6)) +
    ggtitle(paste(n_rep, "repetitions of an LM fit with", n_row, "rows,", n_col, "columns"))

0-fold Cross-Validation

The deep rabbit hole of DIY small ergonomic keyboards

Comparing GPT-4, 3.5, and some offline local LLMs at the task of generating flashcards for spaced repetition (e.g., Anki)

Introduction

Overview

Models and prompts

GPT-4

Initial prompt (GPT4 in the results figure)

A longer more complex prompt (GPT4_cont1 in the results figure)

Further continuing the same conversation with GPT-4-Turbo (GPT4_cont2 in the results figure)

GPT-3.5 Turbo

Initial prompt (GPT3 in the results figure)

A longer more complex prompt (GPT3_cont1 in the results figure)

Further continuing the same conversation with GPT-3.5-Turbo (GPT3_cont2 in the results figure)

Local/offline open-source LLMs

Running open LLMs on a modern Nvidia GPU

Adapting the previous prompts for LLMs running on my local PC (OrcaPlatypus2-13B_prompt1 in the results figures below)

Additional attempt with more manual prompt curation (OrcaPlatypus2-13B_prompt2 in the results figures below)

Second variation of the “more manually curated prompt” (OrcaPlatypus2-13B_prompt3 in the results figures below)

LLMs using llama.cpp on the CPU (llama.cpp-Llama2-7B in the results figures)

Performance evaluation of LLMs via a “study” with a single human rater (myself)

Rating the outputs of the LLMs by a human (me)

Calculating embeddings

Analysis methods

Results and discussion

Comparison of embeddings

Ratings on 1. Self-containment

Ratings on 2. Atomicity

Ratings on 3. Truthfulness

Ratings on 4. Making sense as a flashcard

Ratings on 5. Would I likely use a similar flashcard in my Anki deck

Takeaways

Advice if you want to do something similar

References

Having some fun with Stable Diffusion Inpainting in Python on New Year’s Day

A few code snippets

Generating the selected image areas based on a text prompt “from scratch”

Generating selected image areas in an image-to-image fashion

Remarks

Concordance Correlation Coefficient

Defining the Concordance Correlation Coefficient

Concordance Correlation Coefficient vs. Pearson correlation coefficient

What’s CCC good for? Reproducibility?

References

From conditional probability to conditional distribution to conditional expectation, and back

From conditional probability → to conditional distribution → to conditional expectation

Conditional probability

Example

Conditional distribution

Example

Conditional expectation

Example [cont.]

From conditional expectation → to conditional probability

Conditional expectation

A definition in 2D

Interpretation in 2D

A more general definition of conditional expectation

Back to conditional probability and conditional distributions

References

Setting up an HTTPS static site using AWS S3 and Cloudfront (and also Jekyll and s3_website)

1 Set up a static site, yet without CloudFront and without HTTPS

Optional: Configure an IAM role with limited access permissions

Optional: Use Jekyll and s3_website to generate a static site and to push it to the S3 bucket

2 Request an SSL certificate

3 Create a CloudFront distribution

Neural networks and deep learning - self-study and 2 presentations

Slides

My favorite learning resources

Probabilistic interpretation of AUC

What’s AUC anyway?

Probabilistic interpretation

Computing AUC

Wilcoxon-Mann-Whitney test

So what? Why care about AUC anyway?

Mining USPTO full text patent data - An exploratory analysis of machine learning and AI related patents granted in 2017 so far

Identifying patents related to machine learning and AI

Some example patents

Who holds these patents (inventors and assignees)?

Where do these inventions come from (geographically)?

Patent application and publication dates, and duration of patent examination process

Initial prompt (`GPT4` in the results figure)

A longer more complex prompt (`GPT4_cont1` in the results figure)

Further continuing the same conversation with GPT-4-Turbo (`GPT4_cont2` in the results figure)

Initial prompt (`GPT3` in the results figure)

A longer more complex prompt (`GPT3_cont1` in the results figure)

Further continuing the same conversation with GPT-3.5-Turbo (`GPT3_cont2` in the results figure)

Adapting the previous prompts for LLMs running on my local PC (`OrcaPlatypus2-13B_prompt1` in the results figures below)

Additional attempt with more manual prompt curation (`OrcaPlatypus2-13B_prompt2` in the results figures below)

Second variation of the “more manually curated prompt” (`OrcaPlatypus2-13B_prompt3` in the results figures below)

LLMs using llama.cpp on the CPU (`llama.cpp-Llama2-7B` in the results figures)