20-40 samples can find significant improvements in 10,000+ examples. Wanna know how?

Open Problems in Active Learning for Multi-Objective Optimization

1 The general problem:

Given rows of data with many \(X\) independent values and many \(Y\) goals where \(Y=f(X)\):
- Look at the fewest \(Y\) values
- To find what \(X\) values predict for the best \(Y\) values
Curently, we are stuck at at around 30-40 labels (have been for about a year). Can we get this down to 15-20?

Full dendogram generation, then reflection across the whole structure.
- e.g. sample rows at a frequency equal to leaf diversity
Divide larger data sets in two (at random)
- Does a model learnt from first half part work for second part?
- How large must the first part be before we can learn a model stable for the second part?

Early stopping? Track progress to date then stop early
Pool learning:
- Current tool is a “pool” learner that can access all the unlabelled examples.
- An alternate approach is a “stream” learner that eats the data in “eras” of size (say) 1000
- So instead of starting from zero all the time
  - Restart each era with the model learned from era[-1]
Why does the faster heuristic work? can we use that to build a better algorithm?

How much of fig7 of https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2011-8.pdf can we cover?
some of the goals we are exploring are a little dull. Can we do better?
- For https://arxiv.org/pdf/2311.17483.pdf, fig 9, what support can you add to support (say) 5 of the lefthand side requirements?
- This one is challenging. How would you generate the data to explore?
  - But wait! we only need under 40 examples. Does that help us?