20-40 samples can find significant improvements in 10,000+ examples. Wanna know how?
August 22, 2024
For very fast decision making, there is a cognitive science case that we work from less than a dozen examples:
While first proposed in 1981, this STM/LTM theory still remains relevant 10. This theory can be used to explain both expert competency and incompetency in software engineering tasks such as understanding code 11.
How fast can we gather expert opinion?
Evidence from “cost estimation”
Evidence from “Repertory Grids”
Advice on how long to fill in a rep grid?
Overall, we get, for reflective labels on data:
One commonly cited rule of thumb [^call] is to have at least 10 times the number of training data instances attributes 16 17.
Data is spread out across a d-dimensional chessboard where each dimension is divided into \(b\) bins 21.
The target is some subset of the data that falls into some of the chessboard cells:
Richard Hamlet, Probable correctness theory, 1980 22.
Some what ifs: - If we apply Cohen’s rule (things are indistinguishable if less than \(d{\times}\sigma\) apart, - And if variables are Gaussian ranging \(-3 \le x \le 3\). - Then that space divides into regions of size \(p=\frac{d}{6}\)
scenario | d | p | C | n(c,p) | \(\log_2(n(c,p))\) |
---|---|---|---|---|---|
medium effect, non-safety critical | 0.35 | 0.06 | 0.95 | 50 | 6 |
small effect, safety criticali | 0.2 | 0.03 | 0.9999 | 272 | 8 |
tiny effects, ultra-safety critical | n/a | one in a million | six sigma (0.999999) |
13,815,504 | 24 |
Note the above table makes some very optimistic assumptions about the problem:
But it also tells us that the only way we can reason about safety critical systems is via some sorting heuristic (so we can get the log2 effect) [^call]: Application of machine learning techniques in small sample clinical studies, from StackExchange.com https://stats.stackexchange.com/questions/1856/application-of-machine-learning-techniques-in-small-sample-clinical-studies
In the following, the author says LLMs not learners but given the results of this subject, I think an edit is in order:
Need another name
Generalize to new tasks via a sequence of prompts, starting composed of natural language instructions,
Few-shot learning is a subfield of machine learning and deep learning that aims to teach AI models how to learn from only a small number of labeled training data.
More generally “n-shot learning” a category of artificial intelligence that also includes:
Applications:
Methods:
March 2024: Google query: “few-shot learning and ‘software engineering’”
In the first 100 returns, after paper70, no more published few shot learning papers in SE.
In the remaining 70 papers:
year | citations | venue | j=journal; c=conf; |
title | data | |
---|---|---|---|---|---|---|
2023 | 1 | Icse_NLBSE | w | Few-Shot Learning for Issue Report Classification | 200 + 200 | |
2023 | 2 | SSBSE | c | . Search-based Optimisation of LLM Learning Shots for Story Point Estimation | 6 to 10 | |
2023 | 2 | ICSE | c | Log Parsing with Prompt-based Few-shot Learning | 4 to 128. most improvement before 16 | |
2023 | 3 | AST | c | FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning | 400+ | |
2023 | 5 | ICSE | c | Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning | 6-7 (for code generation (40 to 50 (for code repair) | |
2022 | 7 | Soft.Lang.Eng | c | Neural Language Models and Few Shot Learning for Systematic Requirements Processing in MDSE | 8 to 11 | |
2023 | 12 | ICSE | c | Towards using Few-Shot Prompt Learning for Automating Model Completion | 212 classes | |
2020 | 15 | IEEE ACCECSS | j | Few-Shot Learning Based Balanced Distribution Adaptation for Heterogeneous Defect Prediction | 100s - 1000s | |
2019 | 21 | Big Data | j | . Exploring the applicability of low-shot learning in mining software repositories | 100 =>70% accuracy; 100s ==> 90% accuracy | |
2021 | 27 | ESEM | c | An Empirical Examination of the Impact of Bias on Just-in-time Defect Prediction | 10^3 samples of defects | |
2020 | 29 | ICSE | c | Unsuccessful Story about Few Shot Malware Family Classification and Siamese Network to the Rescue | 10,000s ? | |
2022 | 65 | ASE | c | Few-shot training LLMs for project-specific code-summarization | 10 samples | |
2022 | 101 | FSE | c | Less Training_ More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning | ? |
P. Norvig. (2011) The Unreasonable Effectiveness of Data. Youtube. https://www.youtube.com/watch?v=yvDCzhbjYWs↩︎
F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu, “Sample size vs. bias in defect prediction,” in Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, 2013, pp. 147–157.↩︎
S. Amasaki, “Cross-version defect prediction: use historical data, crossproject data, or both?” Empirical Software Engineering, pp. 1–23, 2020.↩︎
S. McIntosh and Y. Kamei, “Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction,” IEEE Transactions on Software Engineering, vol. 44, no. 5, pp. 412–428, 2017.↩︎
S. N.C., S. Majumder and T. Menzies, “Early Life Cycle Software Defect Prediction. Why? How?,” 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, ES, 2021, pp. 448-459, doi: 10.1109/ICSE43902.2021.00050.↩︎
Jill Larkin, John McDermott, Dorothea P. Simon, and Herbert A. Simon. 1980. Expert and Novice Performance in Solving Physics Problems. Science 208, 4450 (1980), 1335–1342. DOI:http://dx.doi.org/10.1126/science.208.4450.1335 arXiv:http://science.sciencemag.org/content/208/4450/1335.full.pdf↩︎
N. Cowan. 2001. The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behav Brain Sci 24, 1 (Feb 2001), 87–114.↩︎
George A Miller. 1956. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological review 63, 2 (1956), 81.↩︎
Jill Larkin, John McDermott, Dorothea P. Simon, and Herbert A. Simon. 1980. Expert and Novice Performance in Solving Physics Problems. Science 208, 4450 (1980), 1335–1342. DOI:http://dx.doi.org/10.1126/science.208.4450.1335 arXiv:http://science.sciencemag.org/content/208/4450/1335.full.pdf↩︎
Recently, Ma et al. [^wei14] used evidence from neuroscience and functional MRIs to argue that STM capacity might be better measured using other factors than “number of items”. But even they conceded that “the concept of a limited (STM) has considerable explanatory power for behavioral data”.↩︎
Susan Wiedenbeck, Vikki Fix, and Jean Scholtz. 1993. Characteristics of the mental representations of novice and expert programmers: an empirical study. International Journal of Man-Machine Studies 39, 5 (1993), 793–812.↩︎
Valerdi, Ricardo. “Heuristics for systems engineering cost estimation.” IEEE Systems Journal 5.1 (2010): 91-98.↩︎
Kington, Alison. “Defining Teachers’ Classroom Relationships.” (2009). https://eprints.worc.ac.uk/1885/1/Kington%202009.pdf↩︎
Easterby-Smith, Mark. “The Design, Analysis and Interpretation of Repertory Grids.” Int. J. Man Mach. Stud. 13 (1980): 3-24.↩︎
Helen M. Edwards, Sharon McDonald, S. Michelle Young, The repertory grid technique: Its place in empirical software engineering research, Information and Software Technology, Volume 51, Issue 4, 2009, Pages 785-798, ISSN 0950-5849,↩︎
Austin PC, Steyerberg EW. Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models. Stat Methods Med Res. 2017 Apr;26(2):796-808. doi: 10.1177/0962280214558972. Epub 2014 Nov 19. PMID: 25411322; PMCID: PMC5394463.↩︎
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996 Dec;49(12):1373-9. doi: 10.1016/s0895-4356(96)00236-3. PMID: 8970487.↩︎
Alvarez, L., & Menzies, T. (2023). Don’t Lie to Me: Avoiding Malicious Explanations With STEALTH. IEEE Software, 40(3), 43-53.↩︎
Zhu, X., Vondrick, C., Fowlkes, C.C. et al. Do We Need More Training Data?. Int J Comput Vis 119, 76–92 (2016). https://doi-org.prox.lib.ncsu.edu/10.1007/s11263-015-0812-2↩︎
Menzies, T., Turhan, B., Bener, A., Gay, G., Cukic, B., & predictors. In Proceedings of the 4th international workshop on Predictor models in software engineering (pp. 47-54).↩︎
J. Nam, W. Fu, S. Kim, T. Menzies and L. Tan, “Heterogeneous Defect Prediction,” in IEEE Transactions on Software Engineering, vol. 44, no. 9, pp. 874-896, 1 Sept. 2018, doi: 10.1109/TSE.2017.2720603.↩︎
Hamlet, Richard G. “Probable correctness theory.” Information processing letters 25.1 (1987): 17-25.↩︎