How Much Data Needed for Learning?

20-40 samples can find significant improvements in 10,000+ examples. Wanna know how?

0.1 How Much data Do we need for Learning?

0.2 A more informed position: The question is wrong

0.3 Another question: How much data can you handle?

For very fast decision making, there is a cognitive science case that we work from less than a dozen examples:

While first proposed in 1981, this STM/LTM theory still remains relevant ¹⁰. This theory can be used to explain both expert competency and incompetency in software engineering tasks such as understanding code ¹¹.

0.4 Another question: How much data can you get?

0.5 Advice from Mathematics

One commonly cited rule of thumb [^call] is to have at least 10 times the number of training data instances attributes ¹⁶ ¹⁷.

0.6 Historically, how much data was enough?

0.7 Maths

0.7.1 Chess board model

Data is spread out across a d-dimensional chessboard where each dimension is divided into \(b\) bins ²¹.

The target is some subset of the data that falls into some of the chessboard cells:

0.7.2 Probable Correctness Theory

Some what ifs: - If we apply Cohen’s rule (things are indistinguishable if less than \(d{\times}\sigma\) apart, - And if variables are Gaussian ranging \(-3 \le x \le 3\). - Then that space divides into regions of size \(p=\frac{d}{6}\)

But it also tells us that the only way we can reason about safety critical systems is via some sorting heuristic (so we can get the log2 effect) [^call]: Application of machine learning techniques in small sample clinical studies, from StackExchange.com https://stats.stackexchange.com/questions/1856/application-of-machine-learning-techniques-in-small-sample-clinical-studies

scenario	d	p	C	n(c,p)	\(\log_2(n(c,p))\)
medium effect, non-safety critical	0.35	0.06	0.95	50	6
small effect, safety criticali	0.2	0.03	0.9999	272	8
tiny effects, ultra-safety critical	n/a	one in a million	six sigma (0.999999)	13,815,504	24

1 # Few shot Learning

In the following, the author says LLMs not learners but given the results of this subject, I think an edit is in order:

Generalize to new tasks via a sequence of prompts, starting composed of natural language instructions,

Few-shot learning is a subfield of machine learning and deep learning that aims to teach AI models how to learn from only a small number of labeled training data.

More generally “n-shot learning” a category of artificial intelligence that also includes:

1.0.1 Few Shot Learning in SE

March 2024: Google query: “few-shot learning and ‘software engineering’”

In the first 100 returns, after paper70, no more published few shot learning papers in SE.

year	citations	venue	j=journal; c=conf; w=workshop	title	pdf	data
2023	1	Icse_NLBSE	w	Few-Shot Learning for Issue Report Classification	pdf	200 + 200
2023	2	SSBSE	c	. Search-based Optimisation of LLM Learning Shots for Story Point Estimation	pdf	6 to 10
2023	2	ICSE	c	Log Parsing with Prompt-based Few-shot Learning	pdf	4 to 128. most improvement before 16
2023	3	AST	c	FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning	pdf	400+
2023	5	ICSE	c	Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning	pdf	6-7 (for code generation (40 to 50 (for code repair)
2022	7	Soft.Lang.Eng	c	Neural Language Models and Few Shot Learning for Systematic Requirements Processing in MDSE	pdf	8 to 11
2023	12	ICSE	c	Towards using Few-Shot Prompt Learning for Automating Model Completion	pdf	212 classes
2020	15	IEEE ACCECSS	j	Few-Shot Learning Based Balanced Distribution Adaptation for Heterogeneous Defect Prediction	pdf	100s - 1000s
2019	21	Big Data	j	. Exploring the applicability of low-shot learning in mining software repositories	pdf	100 =>70% accuracy; 100s ==> 90% accuracy
2021	27	ESEM	c	An Empirical Examination of the Impact of Bias on Just-in-time Defect Prediction		10^3 samples of defects
2020	29	ICSE	c	Unsuccessful Story about Few Shot Malware Family Classification and Siamese Network to the Rescue	pdf	10,000s ?
2022	65	ASE	c	Few-shot training LLMs for project-specific code-summarization	pdf	10 samples
2022	101	FSE	c	Less Training_ More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning	pdf	?

P. Norvig. (2011) The Unreasonable Effectiveness of Data. Youtube. https://www.youtube.com/watch?v=yvDCzhbjYWs↩︎
F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu, “Sample size vs. bias in defect prediction,” in Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, 2013, pp. 147–157.↩︎
S. Amasaki, “Cross-version defect prediction: use historical data, crossproject data, or both?” Empirical Software Engineering, pp. 1–23, 2020.↩︎
S. McIntosh and Y. Kamei, “Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction,” IEEE Transactions on Software Engineering, vol. 44, no. 5, pp. 412–428, 2017.↩︎
S. N.C., S. Majumder and T. Menzies, “Early Life Cycle Software Defect Prediction. Why? How?,” 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, ES, 2021, pp. 448-459, doi: 10.1109/ICSE43902.2021.00050.↩︎
Jill Larkin, John McDermott, Dorothea P. Simon, and Herbert A. Simon. 1980. Expert and Novice Performance in Solving Physics Problems. Science 208, 4450 (1980), 1335–1342. DOI:http://dx.doi.org/10.1126/science.208.4450.1335 arXiv:http://science.sciencemag.org/content/208/4450/1335.full.pdf↩︎
N. Cowan. 2001. The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behav Brain Sci 24, 1 (Feb 2001), 87–114.↩︎
George A Miller. 1956. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological review 63, 2 (1956), 81.↩︎
Jill Larkin, John McDermott, Dorothea P. Simon, and Herbert A. Simon. 1980. Expert and Novice Performance in Solving Physics Problems. Science 208, 4450 (1980), 1335–1342. DOI:http://dx.doi.org/10.1126/science.208.4450.1335 arXiv:http://science.sciencemag.org/content/208/4450/1335.full.pdf↩︎
Recently, Ma et al. [^wei14] used evidence from neuroscience and functional MRIs to argue that STM capacity might be better measured using other factors than “number of items”. But even they conceded that “the concept of a limited (STM) has considerable explanatory power for behavioral data”.↩︎
Susan Wiedenbeck, Vikki Fix, and Jean Scholtz. 1993. Characteristics of the mental representations of novice and expert programmers: an empirical study. International Journal of Man-Machine Studies 39, 5 (1993), 793–812.↩︎
Valerdi, Ricardo. “Heuristics for systems engineering cost estimation.” IEEE Systems Journal 5.1 (2010): 91-98.↩︎
Kington, Alison. “Defining Teachers’ Classroom Relationships.” (2009). https://eprints.worc.ac.uk/1885/1/Kington%202009.pdf↩︎
Easterby-Smith, Mark. “The Design, Analysis and Interpretation of Repertory Grids.” Int. J. Man Mach. Stud. 13 (1980): 3-24.↩︎
Helen M. Edwards, Sharon McDonald, S. Michelle Young, The repertory grid technique: Its place in empirical software engineering research, Information and Software Technology, Volume 51, Issue 4, 2009, Pages 785-798, ISSN 0950-5849,↩︎
Austin PC, Steyerberg EW. Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models. Stat Methods Med Res. 2017 Apr;26(2):796-808. doi: 10.1177/0962280214558972. Epub 2014 Nov 19. PMID: 25411322; PMCID: PMC5394463.↩︎
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996 Dec;49(12):1373-9. doi: 10.1016/s0895-4356(96)00236-3. PMID: 8970487.↩︎
Alvarez, L., & Menzies, T. (2023). Don’t Lie to Me: Avoiding Malicious Explanations With STEALTH. IEEE Software, 40(3), 43-53.↩︎
Zhu, X., Vondrick, C., Fowlkes, C.C. et al. Do We Need More Training Data?. Int J Comput Vis 119, 76–92 (2016). https://doi-org.prox.lib.ncsu.edu/10.1007/s11263-015-0812-2↩︎
Menzies, T., Turhan, B., Bener, A., Gay, G., Cukic, B., & predictors. In Proceedings of the 4th international workshop on Predictor models in software engineering (pp. 47-54).↩︎
J. Nam, W. Fu, S. Kim, T. Menzies and L. Tan, “Heterogeneous Defect Prediction,” in IEEE Transactions on Software Engineering, vol. 44, no. 9, pp. 874-896, 1 Sept. 2018, doi: 10.1109/TSE.2017.2720603.↩︎
Hamlet, Richard G. “Probable correctness theory.” Information processing letters 25.1 (1987): 17-25.↩︎