mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Puzzles (https://www.mersenneforum.org/forumdisplay.php?f=18)

 henryzz 2021-06-02 12:44

Matching problem

Suppose you have n letters and n addressed envelopes that you need to match. You can find matches by pairing all letters with envelopes and a friend will tell you how many matches there are(not which).

What is the expected number of times you need to pair the letters with envelopes to get all matching?

What is the maximum number of pairings you could need to find all matches?

What would your strategy be to minimize the above numbers? Do they have the same solutions?

For easier discussion let's use n=10 as an example although a general formula would be great.

 Viliam Furik 2021-06-02 14:32

If we assume the envelopes are coded 1 to n, and letters are also coded 1 to n, but the pairings of these numbers are unknown, the maximum number of tries you need should be n! because you can simply try all the possible arrangements of the letters while fixing the envelopes.

When you are certain that some letter is in its envelope, leave it there and never move it until you're done.

That should do it. I am not sure if it's the best way to do it, but it should be a fairly good way to do it.

As for the expected number of times you bother your friend, I don't know, and it's probable that someone else here will find a better algorithm or the number before I do either.

 Viliam Furik 2021-06-02 16:38

I will demonstrate what I think is the worst-case scenario for n=5, using a bit of strategy in choosing the switches, such that I will try to not place the same letter twice to where I know it doesn't belong.

([x] denotes that letter is for sure in its correct envelope)

Begin
Step 1 -> 2, 4, 5, 1, 3
Friend -> 0 (call 1)
Step 2 -> 2, 3, 5, 1, 4 (3-4 switch)
Friend -> 0 (call 2)
Step 2 -> 5, 3, 2, 1, 4 (2-5 switch)
Friend -> 0 (call 3)
Step 2 -> 5, 1, 2, 3, 4 (1-3 switch)
Friend -> 0 (call 4)
Step 2 -> 5, 1, 4, 3, 2 (2-4 switch)
Friend -> 0 (call 5)
Step 2 -> 3, 1, 4, 5, 2 (3-5 switch)
Friend -> 0 (call 6)
Step 2 -> 3, 1, 4, 2, 5 (2-5 switch)
Friend -> 1 (call 7)
Step 2 -> 3, 5, 4, 2, 1 (1-5 switch)
Friend -> 0 (call 8)
Step 2 -> 3, 1, 4, 2, [5] (rollback)
Step 2 -> 3, 4, 1, 2, [5] (1-4 switch)
Friend -> 1 (call 9)
Step 2 -> 1, 4, 3, 2, [5] (1-3 switch)
Friend -> 3 (call 10)
Step 2 -> [1], 2, [3], 4, [5] (2-4 switch)
At this point, you don't even need to call your friend, because you are certain you got it correct - If your correctness number is n-2, you know you just have to switch the two remaining letters you didn't yet mark as correct.

That is 10 calls for n=5 in the worst case. If someone finds a mistake in my example, please, point it out.

 Walter 2021-06-02 19:50

4 Attachment(s)
My approach would be to formulate the problem as a 0-1 linear program.

We will have variables [TEX]$x_{ij} \in \{0,1\}$[/TEX] which are 1 if letter [TEX]$i$[/TEX] matches with envelope [TEX]$j$[/TEX] and 0, otherwise.

Clearly, every assignment of letters to envelopes would fulfill the following constraints:

[TEX]$\sum_{i\in I} x_{ij} = 1 \: \forall j \in J$[/TEX]
[TEX]$\sum_{j\in J} x_{ij} = 1 \: \forall i \in I$[/TEX]

If we solve this problem we will get a unique assignment and we can receive the number of correct guesses, which we can denote as [TEX]$c$[/TEX]. With that, we can add one new constraint every time this hint is given. Assuming our previous solution is given as a set of [TEX]$ij$[/TEX] pairs, denoted as [TEX]$S$[/TEX], this additional constraint can be expressed as follows:

[TEX]$\sum_{ij \in S} x_{ij} == c$[/TEX]

This constraint essentially says that the correct solution includes [TEX]$c$[/TEX] of the matchings in [TEX]$S$[/TEX].

I believe this is the strategy that uses the given information the most efficient way possible and minimizes the number of attempts needed to uncover all of the correct pairs. Unlike using only the information from the last (or maybe last two "rounds") it combines all knowledge it retrieves in this process.

It is pretty hard to calculate the expected number of "rounds", but a simulation can help to quantify it. I have implemented this quickly in Python to get an idea. I have added a sample run to illustrate how this method choses the pairs in each round. It is clearly not what a human would do or in any way similar to the previous answer, but seems to be quite efficient.

I'm attaching a graph showing the frequencies (+cumulative frequencies) of the problem being "solved" after a given number of rounds. It can be seen that, if we have 10 letters and 10 envelopes, the expected number of iterations seems to be somewhere around 10. In my experiment (2000 random experiments), the highest number of tries I needed was 15 (just once), but the most common number seems to be 10 (473 times out of 2000 experiments).

I am also attaching the Python code (simulate.txt - seems like I can't upload .py files) in case anyone wants to experiment with it and adjust it to chose a different strategy.

Out of curiosity, I have also done a quick experiment and have adjusted the number of letter/ envelope pairs to 20. Obviously, there are a lot more possible assignments. Nevertheless, the expected numbers seems to be somewhere around 27, which is still quite decent.

 Viliam Furik 2021-06-02 20:37

I am looking at the code, but I still can't get how the heck does it decide what's best to continue with... How do the constraints work and how do they affect the next "prediction"?

 Walter 2021-06-02 20:50

2 Attachment(s)
I like your approach, Viliam. I would be really curious to see how it performs in a Monte Carlo simulation, but it isn't as easy to implement as my method.

You can even improve the performance a bit if you keep track of pairings that are wrong. For example, in step 2c, you learn about two numbers that are wrong and it would never make sense to switch either of them back again. In the example that you provided, you are doing one move that could be avoided:

You already know (from the very first step) that 4 can't be in the second position, but you are doing a swap later in the process that brings it back into that position.

By the way, after 8 steps, the solution is actually clear if you combine all of the knowledge you have gained in the process. To show this, I have created a 5x5 grid representing all possible pairs of letters/ envelopes. Whenever your "friend" tells you that a solution has 0 correct choices, you cross out a possible solution.

After the first 8 steps, you got the 0-correct scenario 7 times, so the grid is quite full already (see attached screenshot). Based on this, you can already deduce the correct solution for letter 4 and 5 and don't need to continue trying with those. Furthermore, you have had one solution that gave you 1 correct position:

3, 1, 4, 2, 5

Since from all the 0-correct scenarios you know that the solution will be x, x, x, 4, 5, you can now deduce that 3, 1, 4, 2 is also incorrect. That crosses out one more square and tells you that the solution will be x, x, 3, 4, 5. Furthermore, for position 1, the candidates are either 1 or 4, but 4 is already taken. From that, the whole solution is know.

Nevertheless, a pretty good strategy, but you can combine it with additional logic reasoning to reduce the number of steps.

 uau 2021-06-02 21:14

[QUOTE=Viliam Furik;579808]I am looking at the code, but I still can't get how the heck does it decide what's best to continue with... How do the constraints work and how do they affect the next "prediction"?[/QUOTE]
I'm not familiar with docplex which seems to be what the code uses to generate the specific guesses, but it seems like it probably just chooses some arbitrary permutation which has the given number of matches with every previous try - if your first try got 4 matches and second 5, then on third try it'll try some permutation that has 4 choices in common with the first and 5 with second.

 Walter 2021-06-02 21:24

[QUOTE=Viliam Furik;579808]I am looking at the code, but I still can't get how the heck does it decide what's best to continue with... How do the constraints work and how do they affect the next "prediction"?[/QUOTE]

Viliam, not sure if you are familiar with linear programming/ constraint programming, but it might be a bit confusing at first and requires a slightly different way of thinking. This is certainly not the easiest example to get started with.

The whole idea behind my approach is to describe the correct solution in terms of the knowledge we gain in every step. The knowledge you gain in one single step isn't valuable by itself, but the knowledge you gain in multiple steps combined, tells you a lot about the solution. Think of it as [B]ruling out[/B] solutions, rather than moving towards a better solution at every step. At the beginning, the possible solution space is [TEX]$n!$[/TEX]. With every iteration and every constraint that we add, we cut off some of the solutions and therefore reduce the solution space, until we are left with only one single unique solution.

Let's look at a much easier example with n=3; let's assume the correct solution is [1, 2, 3]

If our first guess is [3,1,2], the number of correct guesses is 0.

Based on this, we can not only rule out [3,1,2], but also [3,2,1], [1,3,2] and [2,3,1] (because each of them has one position in common with a solution we know is completely wrong), and that's exactly what adding that constraint would do.

About choosing the next solution: as I said, all that the constraints do is to reduce the solution space from [TEX]$n!$[/TEX] to a smaller number of solutions. So after adding a constraint, we just let the MIP solver chose one of the remaining solutions in the solution space at random. The whole idea isn't necessarily to chose a better solution in every iteration, so "predict the next solution" isn't quite the right wording.

A common problem that is solved with a binary integer program is Sudoku. It is certainly much easier to understand than this. If you are curious about the method I am using here, this would probably be a good starting point: [url]https://www.mathworks.com/help/optim/ug/sudoku-puzzles-problem-based.html[/url]

 Walter 2021-06-02 21:26

[QUOTE=uau;579812]I'm not familiar with docplex which seems to be what the code uses to generate the specific guesses, but it seems like it probably just chooses some arbitrary permutation which has the given number of matches with every previous try - if your first try got 4 matches and second 5, then on third try it'll try some permutation that has 4 choices in common with the first and 5 with second.[/QUOTE]

Yes, you are completely right. And I realize this is a much more intuitive explanation than my attempt :bow:

Think of docplex as a black box. There are plenty of other libraries I could have used: CBC, Gurobi, OR-Tools, one of many SAT solvers etc.

 uau 2021-06-02 21:32

Trying something that could be the correct answer given all previous ones (which I believe is what Walter's code does) is not always the optimal answer, either for average or worst-case behavior (at least worst case for remaining tries in a particular situation, I don't show this to necessarily affect the global worst case):

Suppose that n = 1000000. Your first random guess gets 999998 matches right (quite lucky). Now, if you keep trying guesses that could be the correct answer, they can only swap one pair from your initial guess. Since you can only change two places at once, you'd expect to need around 146000 tries to even try changing one of the incorrect positions (you'd try changing twice that number of places, and 50% chance that both the incorrect positions are outside that set). Some kind of bisection strategy should be able to do much better than that.

 Walter 2021-06-02 22:38

[QUOTE=uau;579816]Trying something that could be the correct answer given all previous ones (which I believe is what Walter's code does) is not always the optimal answer, either for average or worst-case behavior (at least worst case for remaining tries in a particular situation, I don't show this to necessarily affect the global worst case):

Suppose that n = 1000000. Your first random guess gets 999998 matches right (quite lucky). Now, if you keep trying guesses that could be the correct answer, they can only swap one pair from your initial guess. Since you can only change two places at once, you'd expect to need around 146000 tries to even try changing one of the incorrect positions (you'd try changing twice that number of places, and 50% chance that both the incorrect positions are outside that set). Some kind of bisection strategy should be able to do much better than that.[/QUOTE]

Interesting case! Would definitely be curious to see thoughts on how to handle this kind of situation better. I am not so sure. isn't 9999998 correct matches just as bad as only 2 correct matches in terms of information value?

I did a few additional tests. Unfortunately it isn't practical to try my method with such a high n because the matrix gets too big, but if I use n = 100 and force exactly this kind of situation, I am on average only using about 30 tries. My feeling is telling me that that's pretty decent compared to what you would expect as an average runtime.

So, with n = 1M, it is true that my approach would probably use over 100k tries in this worst case scenario (it needs to "touch" each wrong position at least once). However, what would the average case scenario look like? My gut feeling is telling me it would be somewhere in that range too. But might completely off on that estimate. It is pretty hard to imagine this game with such a lot of letters and envelopes...

 chalsall 2021-06-02 22:44

[QUOTE=Walter;579823]It is pretty hard to imagine this game with such a lot of letters and envelopes...[/QUOTE]

It's trivial to imagine.

 R. Gerbicz 2021-06-02 22:49

[QUOTE=henryzz;579773]
What is the expected number of times you need to pair the letters with envelopes to get all matching?[/QUOTE]

You need at least (1-eps)*n guesses on average (for fixed eps>0 and for that large n).
Proof:
On the k-th level (after k guesses) there are at most (n+1)^k nodes, because each answer could be only 0..n. If we found the permutation, then that node is a leaf in the tree.

let L=argmax( h: 1+(n+1)+...+(n+1)^h<=n! )
ofcourse L=(1-eps)*n.

Then for the average expected number of guesses: F(n)>=((L+1)*(n!-sum(h=0,L,(n+1)^h))+sum(k=0,L,k*(n+1)^k))/n!
and from this F(n)>(1-eps2)*n.
(it could be possible to improve this estimation a very little).

My wild guess is that you can get the permutation with only O(n*log(n)) guesses,
as your ideas, for this you only need to halve the search space, what is not very unlikely
because we have much more, (n+1) different answers on the guesses.
[ we know: log2(n!)=O(n*log(n)) ].

ps. even n=10 could be a very hard problem.

 uau 2021-06-02 22:54

[QUOTE=Walter;579823]So, with n = 1M, it is true that my approach would probably use over 100k tries in this worst case scenario (it needs to "touch" each wrong position at least once). However, what would the average case scenario look like? My gut feeling is telling me it would be somewhere in that range too. But might completely off on that estimate. It is pretty hard to imagine this game with such a lot of letters and envelopes...[/QUOTE]

By "bisection strategy", I meant something like trying to rotate large blocks of guesses (if your permutation was 012345, rotating 3 first gets 120345). If the block was originally all correct, it's now all incorrect, so the number of matches drops by size of the rotated block. If one of the incorrect pair was there, it drops by less. This allows you to bisect the location of the relevant two entries, and solve the problem a lot faster.

The issue with trying only attempts that could be correct given previous ones is that while it may be optimal use of previous information to maximize the chance of solving the problem in one more try, it is not an optimal way to gain the maximal amount of new information from the try if it turns out to not be correct.

 chalsall 2021-06-02 23:00

[QUOTE=R. Gerbicz;579825]ps. even n=10 could be a very hard problem.[/QUOTE]

Like I said... Trivial to imagine...

 Dr Sardonicus 2021-06-02 23:39

It's not clear to me what is allowed. Apparently you are allowed to switch pairs, and keep track of which pairs you've switched. Given n envelopes and n letters, a given pairing-up of letters and envelopes corresponds to one of the n! permutations of n elements.

Assuming you eventually reach complete matching up (identity permutation) by switching one pair at a time, this would correspond to expressing the (inverse of) the original permutation as a product of transpositions. I suppose it is possible to determine, by repeated switching, [i]which[/i] of the letters are in the correct envelopes. Perhaps it is possible to aim for a nested series of subgroups of S[sub]n[/sub].

Amusingly, if you suppose that each attempt is totally from scratch, and any of the n! permutations is equally likely, the "expected" number of tries before you match them all up (identity permutation) is n!. This is actually not hard to prove. Unfortunately, n! grows quite rapidly with n; 10! is already 3,628,800 so this approach is likely to take a while.

In fact, the probability that the desired matching has [i]not[/i] been found in n! steps is (1 - 1/n!)^n!, which for n >=10 is very close to 1/e = .367879+

The number of "derangements" - permutations with no fixed points - [i]none[/i] of the letters is in the right envelope - is $$n!\sum_{i=0}^n\frac{(-1)^{i}}{i!}$$ - sometimes denoted !n or n¡ or "subfactorial n". If n is at all large, this is very close to n!/e.

 Viliam Furik 2021-06-02 23:51

I was watching Netflix and returned back to this thread only now.

Walter: So your program is basically doing what you explained with the Excel pictures? If so, now I get it; even though still not quite educated in your code, I get the idea.

To explain my 1-4 switch - I wanted it to be as many steps as possible whilst preserving the minimalism. In the situation before the 9th call, there were only four switches possible. 2-3 was absolutely bad because both 2 and 3 would get to already known bad positions. 3-1 and 2-4 were, as I realize now, the best choices. I went with 1-4 because 4 would get to a known bad position, but 1 would not. Thus I considered it acceptable as I was trying to get the worst-case scenario. I didn't realize that I should have kept my focus on making the worst best switches based on my information. (worst by randomness, best by strategy)

 uau 2021-06-03 00:03

[QUOTE=R. Gerbicz;579825]My wild guess is that you can get the permutation with only O(n*log(n)) guesses,
as your ideas, for this you only need to halve the search space, what is not very unlikely
because we have much more, (n+1) different answers on the guesses.
[ we know: log2(n!)=O(n*log(n)) ].[/QUOTE]
I'm pretty sure O(n*log(n)) is possible on average. For that it's enough that you can find a single correct position in O(log(n)) tries on average, and then repeat that for the remaining problem of size n-1. I believe this is possible with a bisection-like strategy. If you change some subset of the elements, and the number of correct matches changes, then you know the subset has at least one correct element in the version with more matches.

You can get at least one correct element overall by trying random permutations. I think the expected number of tries for just a simple strategy of "pick some half of remaining elements, randomly permute them" before you see a change in the number of matches would have a finite upper bound over all n. That'd allow you to bisect to the location of one correct element in O(log(n)).

 uau 2021-06-05 18:42

It occurred to me that you can also get O(n*log(n)) by kind of the opposite strategy: instead of proving for one element at a time what its correct value must be, you can also collect permutations with zero matches that prove where every element can not be, and get enough of those in O(n*log(n)) to rule out any wrong value.

With each element having 1/n chance of being correct, all being wrong is about 1/e probability. So just trying random permutations will get you zero-matches ones which rule all values in that permutation. You can repeat this until you get mostly repeats of already-ruled-out values. With a constant multiple of n tries, you can rule out any particular percentage of possible values. So with O(n) tries, you can assume that for element in the permutation, about 90% of values have been ruled out, and n/10 possible values are left.

To proceed further, you can specifically select some of the remaining possible values in a permutation. If each has one in n/10 chance of being correct, you can again place n/10 of them in the same permutation and still have a decent chance of there being zero matches overall. Select all other places with known-false values (should be pretty easy with 90% possibilities being known false). So you get one new known-wrong value for each of n/10 places, and need to do this 10 times to get one more for all n.

So how many tries total? When there are k possible values remaining, you can try about k places in parallel, and need n/k to process all n. Total is sum of n/k for k from 2 to n (or n/10, if you get to 90% ruled out by totally random permutations as above), which is O(n*log(n)).

 R. Gerbicz 2021-06-06 00:36

An offline version of the puzzle, where you need to give all of your guesses in advance,
so you can profit nothing from the previous answers.
For n=10 here it is a solution that is using "only" 23 guesses, which is quite good if you compare this to log2(10!)=21.79
( proving nothing, so we could get even better than log2(n!) ).
Hence there is no two different permutations of 1..10 where you give the same 23 answers.
[CODE]
P=[[3,10,6,4,8,9,2,7,5,1],\
[3,10,1,7,9,6,5,2,4,8],\
[5,3,6,2,4,10,7,8,1,9],\
[2,6,5,8,7,1,9,4,3,10],\
[8,10,3,7,1,5,4,9,6,2],\
[6,5,10,8,4,2,3,1,7,9],\
[3,8,7,2,4,5,6,10,9,1],\
[4,1,7,5,2,3,10,9,8,6],\
[8,4,10,5,1,2,3,6,7,9],\
[10,9,8,2,5,3,6,1,4,7],\
[4,9,7,2,8,6,3,1,5,10],\
[4,5,1,9,3,10,8,7,6,2],\
[3,9,7,10,4,2,5,8,6,1],\
[1,4,2,6,10,8,5,7,3,9],\
[3,4,6,9,7,1,8,10,2,5],\
[2,10,9,4,6,1,5,8,7,3],\
[1,9,8,6,5,10,7,4,3,2],\
[1,10,3,2,9,8,4,7,5,6],\
[7,3,8,4,2,10,6,5,1,9],\
[1,5,10,8,9,6,7,3,4,2],\
[7,3,5,4,6,1,9,10,2,8],\
[9,6,2,1,4,3,7,5,8,10],\
[7,2,9,8,6,1,4,10,3,5]];
[/CODE]

 Citrix 2021-06-06 05:55

In worst case scenario this can be solved in ~ N^2 tries.

The faster practical solution (not sure if this is allowed) is to create a hash table and assign each unique letter a number.
Set the letter and envelopes set in ascending order based on the same hash table.
Then this can then be solved in ~k*N trials. (depending on the definition of the problem- the problem is slightly vague)

 All times are UTC. The time now is 04:34.