Skip to main content Link Search Menu Expand Document (external link)

Latest posts in English

See also all posts ordered by category or date.

LLMs and RAG in education

This summer, I was in Palermo where the Educational Data Mining, Learning at Scale & AI in Education conferences were held.

I presented Anav Agrawal’s RAG tool, which we use in our course.

Anav Agrawal, Jill-Jênn Vie. AlgoAce: Retrieval-Augmented Generation for Assistance in Competitive Programming. CSEDM 2025 - 9th Educational Data Mining in Computer Science Education Workshop, Jul 2025, Palermo, Italy. [paper] [code]

I’ll tell you about what some colleagues from Cornell University and Berkeley University presented because it’s truly impressive.

Cornell

The professor (Rene Kizilcec) has 270 students.
He uploads his PDFs to some website hita.ai developed by former students.
Students can ask questions anonymously (it sources the answers to the pages of the course slides or to specific points in a course video) and the professor sees the anonymous conversations (this part is the most valuable).
He has automatic analytics on the most frequently asked questions.
The (6k?) students have asked 64k questions on the platform (local LLM) in a few years.
The professor uses it to verify that students are actually reading the articles he assigns (students have to debate a question with an AI that has read the article).

He told the anecdote that his colleague noticed that 5 students were cheating and told them: “Wow, I’m really impressed by the richness of your reasoning, how about you come and discuss it in my office?” and then the students said “Nooo, sorry, we cheated.”

Berkeley

Narges Norouzi has 900-1700 students per cohort in their first CS1 course, required for all students in the computer science and engineering departments.
Students are allowed to use an autograder for their homework but not for graded labs or projects https://sp25.datastructur.es/policies/ see also, from a different u, https://eecs-autograder.github.io/autograder.io
They reuse the (1 million) code submissions from students over the previous (7) years to help students debug, see what types of hints are useful, which directly feeds into their (the professors’) research. They have 105,000 queries from 2,000 students from the 2023-2024 academic year to their local LLM.
They are not allowed to conduct randomized controlled trials; their ethics committee (IRB) prohibits it, due to the impactness of this course and its scale.
They show that students complete homework faster with AI (unsurprisingly) but not practical assignments faster (they even have some negative results on practical assignments).
They had two papers nominated for best paper at AIED 2025 (below), one of which uses the curriculum to automatically determine students’ progress in the course (knowledge tracing) and make appropriate recommendations.

Modeling Student Knowledge Progression in Intelligent Tutoring Interactions Abigail O’Neill, Kanav Mittal, Hanna Schlegel, Gireeja Ranade and Narges Norouzi https://drive.google.com/file/d/1As0EAEXOeyTnDqaMCOQqYY__BaN_d9gq/view?usp=sharing

Askademia: A Real-Time AI System for Automatic Responses to Student Questions Gaurav Tyagi, Meenakshi Mittal, Azalea Bailey, Gireeja Ranade and Narges Norouzi https://drive.google.com/file/d/1TOAuYZutWz8IWmWj_NLneL8atbUEcoVK/view?usp=drive_link

Longest increasing subsequence

If we want to find the longest (strictly) increasing subsequence of an array $a$ of size $n$, of course we can assume that $dp[i]$ is the answer for the first $i$ elements and then, as a LIS of size $n$ contains a LIS of size $n - 1$:

\[dp[i] = \max_{\substack{j < i\\ a_j < a_i}} dp[j] + 1\]

This gives a first algorithm in $O(n^2)$. Can we do better?

  • We do not need to remember all LIS of size $\ell$. We just need to remember the smallest end for a LIS of size $\ell$, called $t[\ell]$.
  • The list of smallest ends happens to be increasing (but not necessarily a subsequence). Each $t[\ell]$ forces $t[\ell - 1]$ to be lower.

In this case, we can binary search for the opt LIS to which we can add one element. This gives $O(n \log n)$. I think this is an example of Divide and Conquer DP.

I was wondering what was the $O(n \log n)$ that relies on a segment tree. A blog post on Codeforces had the answer (of course!). We need to define a new dp:
$dp[i, v]$ is the LIS using first $i$ elements finishing in $v$. Then we just need to do a min query over $dp[i, 1:v - 1]$.

(A notebook is on the way.)

Approximations of the Euclidean metric traveling salesman problem

Back in the good ol’ agrégation days, I remember I used as développement1 a nice 2-approx algorithm for the traveling salesman problem where the weights on the edges are given by the Euclidean distance between nodes.

  1. This must mean nothing to non-French people but anyway. 

Count particular rectangles in a matrix

Given a matrix with distinct values, a rectangle consists of 4 cells at the intersection of two distinct rows and two distinct columns. It is good if the largest 2 values of the 4, are on the same row or the same column. Count the number of good rectangles in linear time, in the size of the matrix.

Suffix Array

Given a string s, sort all cyclic shifts of s. Formally produce a table p such that p[j]=i if s[i:]+s[:i] has rank j among all cyclic shifts.

PC Trees

A data structure representing all permutations satisfying constraints of the form: for a given set $S\subseteq\{0,1,\ldots,n-1\}$ the elements of the permutations on $\{0,1,\ldots,n-1\}$ have to be consecutive in circular manner.

Largest rectangle under an histogram

You are given an histogram and want to identify the area of the largest rectangle that fits under the histogram.

AI and LLM in education

In my new role as scientific advisor at the French Ministry of Education, I have recently been looking for meta-analyses and RCTs about impact of AI (and LLMs) in education.

Cover trees

Cover a tree with paths or caterpillars.

SWERC 2022 Practice Session - Bloggers

Abstract: Maintain two tables $t_0$, $t_1$. Updates: given $i,j,c$ increment $t_c$ between indices $i$ and $j$. After every update output $\sum_k \max{t_0[k],t_1[k]}$.