| INFO: 2024-10-17 07:12:53,947: llmtf.base.evaluator: Starting eval on ['darumeru/multiq'] | |
| INFO: 2024-10-17 07:12:53,947: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:12:53,947: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:13:01,539: llmtf.base.darumeru/MultiQ: Loading Dataset: 7.59s | |
| INFO: 2024-10-17 07:18:20,829: llmtf.base.darumeru/MultiQ: Processing Dataset: 319.29s | |
| INFO: 2024-10-17 07:18:20,829: llmtf.base.darumeru/MultiQ: Results for darumeru/MultiQ: | |
| INFO: 2024-10-17 07:18:20,830: llmtf.base.darumeru/MultiQ: {'f1': 0.3485719410941241, 'em': 0.24282982791587} | |
| INFO: 2024-10-17 07:18:20,835: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:18:20,835: llmtf.base.evaluator: | |
| mean darumeru/MultiQ | |
| 0.296 0.296 | |
| INFO: 2024-10-17 07:18:30,261: llmtf.base.evaluator: Starting eval on ['darumeru/parus'] | |
| INFO: 2024-10-17 07:18:30,261: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:18:30,261: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:18:34,809: llmtf.base.darumeru/PARus: Loading Dataset: 4.55s | |
| INFO: 2024-10-17 07:18:39,184: llmtf.base.darumeru/PARus: Processing Dataset: 4.37s | |
| INFO: 2024-10-17 07:18:39,184: llmtf.base.darumeru/PARus: Results for darumeru/PARus: | |
| INFO: 2024-10-17 07:18:39,194: llmtf.base.darumeru/PARus: {'acc': 0.68} | |
| INFO: 2024-10-17 07:18:39,194: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:18:39,195: llmtf.base.evaluator: | |
| mean darumeru/MultiQ darumeru/PARus | |
| 0.488 0.296 0.680 | |
| INFO: 2024-10-17 07:18:48,257: llmtf.base.evaluator: Starting eval on ['darumeru/rcb'] | |
| INFO: 2024-10-17 07:18:48,258: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:18:48,258: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:18:52,169: llmtf.base.darumeru/RCB: Loading Dataset: 3.91s | |
| INFO: 2024-10-17 07:18:57,742: llmtf.base.darumeru/RCB: Processing Dataset: 5.57s | |
| INFO: 2024-10-17 07:18:57,742: llmtf.base.darumeru/RCB: Results for darumeru/RCB: | |
| INFO: 2024-10-17 07:18:57,745: llmtf.base.darumeru/RCB: {'acc': 0.5272727272727272, 'f1_macro': 0.47584611730940257} | |
| INFO: 2024-10-17 07:18:57,746: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:18:57,747: llmtf.base.evaluator: | |
| mean darumeru/MultiQ darumeru/PARus darumeru/RCB | |
| 0.492 0.296 0.680 0.502 | |
| INFO: 2024-10-17 07:19:07,388: llmtf.base.evaluator: Starting eval on ['darumeru/ruopenbookqa'] | |
| INFO: 2024-10-17 07:19:07,388: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:19:07,388: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:19:13,124: llmtf.base.darumeru/ruOpenBookQA: Loading Dataset: 5.74s | |
| INFO: 2024-10-17 07:20:12,666: llmtf.base.darumeru/ruOpenBookQA: Processing Dataset: 59.54s | |
| INFO: 2024-10-17 07:20:12,666: llmtf.base.darumeru/ruOpenBookQA: Results for darumeru/ruOpenBookQA: | |
| INFO: 2024-10-17 07:20:12,678: llmtf.base.darumeru/ruOpenBookQA: {'acc': 0.7207903780068728, 'f1_macro': 0.7206838429510474} | |
| INFO: 2024-10-17 07:20:12,689: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:20:12,690: llmtf.base.evaluator: | |
| mean darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/ruOpenBookQA | |
| 0.549 0.296 0.680 0.502 0.721 | |
| INFO: 2024-10-17 07:20:21,945: llmtf.base.evaluator: Starting eval on ['darumeru/ruworldtree'] | |
| INFO: 2024-10-17 07:20:21,945: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:20:21,945: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:20:25,640: llmtf.base.darumeru/ruWorldTree: Loading Dataset: 3.69s | |
| INFO: 2024-10-17 07:20:28,309: llmtf.base.darumeru/ruWorldTree: Processing Dataset: 2.67s | |
| INFO: 2024-10-17 07:20:28,310: llmtf.base.darumeru/ruWorldTree: Results for darumeru/ruWorldTree: | |
| INFO: 2024-10-17 07:20:28,312: llmtf.base.darumeru/ruWorldTree: {'acc': 0.8952380952380953, 'f1_macro': 0.8944916936662219} | |
| INFO: 2024-10-17 07:20:28,313: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:20:28,314: llmtf.base.evaluator: | |
| mean darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/ruOpenBookQA darumeru/ruWorldTree | |
| 0.619 0.296 0.680 0.502 0.721 0.895 | |
| INFO: 2024-10-17 07:20:37,966: llmtf.base.evaluator: Starting eval on ['darumeru/rwsd'] | |
| INFO: 2024-10-17 07:20:37,967: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:20:37,967: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:20:42,582: llmtf.base.darumeru/RWSD: Loading Dataset: 4.62s | |
| INFO: 2024-10-17 07:20:47,988: llmtf.base.darumeru/RWSD: Processing Dataset: 5.41s | |
| INFO: 2024-10-17 07:20:47,988: llmtf.base.darumeru/RWSD: Results for darumeru/RWSD: | |
| INFO: 2024-10-17 07:20:47,989: llmtf.base.darumeru/RWSD: {'acc': 0.5343137254901961} | |
| INFO: 2024-10-17 07:20:47,989: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:20:47,990: llmtf.base.evaluator: | |
| mean darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/ruOpenBookQA darumeru/ruWorldTree | |
| 0.605 0.296 0.680 0.502 0.534 0.721 0.895 | |
| INFO: 2024-10-17 07:20:57,317: llmtf.base.evaluator: Starting eval on ['daru/treewayextractive'] | |
| INFO: 2024-10-17 07:20:57,317: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:20:57,317: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:21:13,664: llmtf.base.daru/treewayextractive: Loading Dataset: 16.35s | |
| INFO: 2024-10-17 07:24:01,803: llmtf.base.daru/treewayextractive: Processing Dataset: 168.14s | |
| INFO: 2024-10-17 07:24:01,803: llmtf.base.daru/treewayextractive: Results for daru/treewayextractive: | |
| INFO: 2024-10-17 07:24:02,038: llmtf.base.daru/treewayextractive: {'r-prec': 0.3983020202020202} | |
| INFO: 2024-10-17 07:24:02,084: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:24:02,085: llmtf.base.evaluator: | |
| mean daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/ruOpenBookQA darumeru/ruWorldTree | |
| 0.575 0.398 0.296 0.680 0.502 0.534 0.721 0.895 | |
| INFO: 2024-10-17 07:24:11,344: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/rummlu'] | |
| INFO: 2024-10-17 07:24:11,345: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:24:11,345: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:29:12,497: llmtf.base.nlpcoreteam/ruMMLU: Loading Dataset: 301.15s | |
| INFO: 2024-10-17 07:35:18,210: llmtf.base.nlpcoreteam/ruMMLU: Processing Dataset: 365.71s | |
| INFO: 2024-10-17 07:35:18,210: llmtf.base.nlpcoreteam/ruMMLU: Results for nlpcoreteam/ruMMLU: | |
| INFO: 2024-10-17 07:35:18,279: llmtf.base.nlpcoreteam/ruMMLU: metric | |
| subject | |
| abstract_algebra 0.330000 | |
| anatomy 0.422222 | |
| astronomy 0.625000 | |
| business_ethics 0.580000 | |
| clinical_knowledge 0.592453 | |
| college_biology 0.506944 | |
| college_chemistry 0.340000 | |
| college_computer_science 0.540000 | |
| college_mathematics 0.370000 | |
| college_medicine 0.549133 | |
| college_physics 0.431373 | |
| computer_security 0.570000 | |
| conceptual_physics 0.536170 | |
| econometrics 0.385965 | |
| electrical_engineering 0.531034 | |
| elementary_mathematics 0.515873 | |
| formal_logic 0.333333 | |
| global_facts 0.390000 | |
| high_school_biology 0.670968 | |
| high_school_chemistry 0.487685 | |
| high_school_computer_science 0.660000 | |
| high_school_european_history 0.733333 | |
| high_school_geography 0.696970 | |
| high_school_government_and_politics 0.569948 | |
| high_school_macroeconomics 0.523077 | |
| high_school_mathematics 0.429630 | |
| high_school_microeconomics 0.521008 | |
| high_school_physics 0.443709 | |
| high_school_psychology 0.706422 | |
| high_school_statistics 0.523148 | |
| high_school_us_history 0.642157 | |
| high_school_world_history 0.729958 | |
| human_aging 0.587444 | |
| human_sexuality 0.641221 | |
| international_law 0.694215 | |
| jurisprudence 0.638889 | |
| logical_fallacies 0.533742 | |
| machine_learning 0.419643 | |
| management 0.650485 | |
| marketing 0.726496 | |
| medical_genetics 0.550000 | |
| miscellaneous 0.629630 | |
| moral_disputes 0.575145 | |
| moral_scenarios 0.248045 | |
| nutrition 0.614379 | |
| philosophy 0.643087 | |
| prehistory 0.546296 | |
| professional_accounting 0.358156 | |
| professional_law 0.373533 | |
| professional_medicine 0.500000 | |
| professional_psychology 0.495098 | |
| public_relations 0.500000 | |
| security_studies 0.665306 | |
| sociology 0.701493 | |
| us_foreign_policy 0.700000 | |
| virology 0.433735 | |
| world_religions 0.672515 | |
| INFO: 2024-10-17 07:35:18,289: llmtf.base.nlpcoreteam/ruMMLU: metric | |
| subject | |
| STEM 0.496176 | |
| humanities 0.566481 | |
| other (business, health, misc.) 0.541724 | |
| social sciences 0.592209 | |
| INFO: 2024-10-17 07:35:18,294: llmtf.base.nlpcoreteam/ruMMLU: {'acc': 0.549147460511024} | |
| INFO: 2024-10-17 07:35:18,341: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:35:18,343: llmtf.base.evaluator: | |
| mean daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/ruOpenBookQA darumeru/ruWorldTree nlpcoreteam/ruMMLU | |
| 0.572 0.398 0.296 0.680 0.502 0.534 0.721 0.895 0.549 | |
| INFO: 2024-10-17 07:35:27,953: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/enmmlu'] | |
| INFO: 2024-10-17 07:35:27,953: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:35:27,953: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:37:30,758: llmtf.base.nlpcoreteam/enMMLU: Loading Dataset: 122.80s | |
| INFO: 2024-10-17 07:43:10,625: llmtf.base.nlpcoreteam/enMMLU: Processing Dataset: 339.87s | |
| INFO: 2024-10-17 07:43:10,626: llmtf.base.nlpcoreteam/enMMLU: Results for nlpcoreteam/enMMLU: | |
| INFO: 2024-10-17 07:43:10,691: llmtf.base.nlpcoreteam/enMMLU: metric | |
| subject | |
| abstract_algebra 0.370000 | |
| anatomy 0.622222 | |
| astronomy 0.697368 | |
| business_ethics 0.670000 | |
| clinical_knowledge 0.709434 | |
| college_biology 0.701389 | |
| college_chemistry 0.450000 | |
| college_computer_science 0.570000 | |
| college_mathematics 0.360000 | |
| college_medicine 0.670520 | |
| college_physics 0.480392 | |
| computer_security 0.720000 | |
| conceptual_physics 0.655319 | |
| econometrics 0.500000 | |
| electrical_engineering 0.565517 | |
| elementary_mathematics 0.539683 | |
| formal_logic 0.357143 | |
| global_facts 0.370000 | |
| high_school_biology 0.800000 | |
| high_school_chemistry 0.561576 | |
| high_school_computer_science 0.670000 | |
| high_school_european_history 0.763636 | |
| high_school_geography 0.772727 | |
| high_school_government_and_politics 0.849741 | |
| high_school_macroeconomics 0.679487 | |
| high_school_mathematics 0.440741 | |
| high_school_microeconomics 0.756303 | |
| high_school_physics 0.450331 | |
| high_school_psychology 0.849541 | |
| high_school_statistics 0.643519 | |
| high_school_us_history 0.813725 | |
| high_school_world_history 0.835443 | |
| human_aging 0.695067 | |
| human_sexuality 0.763359 | |
| international_law 0.768595 | |
| jurisprudence 0.787037 | |
| logical_fallacies 0.779141 | |
| machine_learning 0.464286 | |
| management 0.805825 | |
| marketing 0.884615 | |
| medical_genetics 0.750000 | |
| miscellaneous 0.784163 | |
| moral_disputes 0.650289 | |
| moral_scenarios 0.270391 | |
| nutrition 0.718954 | |
| philosophy 0.717042 | |
| prehistory 0.737654 | |
| professional_accounting 0.496454 | |
| professional_law 0.458931 | |
| professional_medicine 0.672794 | |
| professional_psychology 0.668301 | |
| public_relations 0.681818 | |
| security_studies 0.718367 | |
| sociology 0.810945 | |
| us_foreign_policy 0.790000 | |
| virology 0.487952 | |
| world_religions 0.812865 | |
| INFO: 2024-10-17 07:43:10,700: llmtf.base.nlpcoreteam/enMMLU: metric | |
| subject | |
| STEM 0.563340 | |
| humanities 0.673223 | |
| other (business, health, misc.) 0.667000 | |
| social sciences 0.736716 | |
| INFO: 2024-10-17 07:43:10,705: llmtf.base.nlpcoreteam/enMMLU: {'acc': 0.6600696360837558} | |
| INFO: 2024-10-17 07:43:10,741: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:43:10,743: llmtf.base.evaluator: | |
| mean daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/ruOpenBookQA darumeru/ruWorldTree nlpcoreteam/enMMLU nlpcoreteam/ruMMLU | |
| 0.582 0.398 0.296 0.680 0.502 0.534 0.721 0.895 0.660 0.549 | |
| INFO: 2024-10-17 07:43:20,115: llmtf.base.evaluator: Starting eval on ['daru/treewayabstractive'] | |
| INFO: 2024-10-17 07:43:20,115: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:43:20,115: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:43:24,372: llmtf.base.daru/treewayabstractive: Loading Dataset: 4.26s | |
| INFO: 2024-10-17 07:47:01,407: llmtf.base.daru/treewayabstractive: Processing Dataset: 217.03s | |
| INFO: 2024-10-17 07:47:01,407: llmtf.base.daru/treewayabstractive: Results for daru/treewayabstractive: | |
| INFO: 2024-10-17 07:47:01,408: llmtf.base.daru/treewayabstractive: {'rouge1': 0.32720307606797727, 'rouge2': 0.10857945570692258} | |
| INFO: 2024-10-17 07:47:01,409: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:47:01,410: llmtf.base.evaluator: | |
| mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/ruOpenBookQA darumeru/ruWorldTree nlpcoreteam/enMMLU nlpcoreteam/ruMMLU | |
| 0.545 0.218 0.398 0.296 0.680 0.502 0.534 0.721 0.895 0.660 0.549 | |
| INFO: 2024-10-17 07:47:10,811: llmtf.base.evaluator: Starting eval on ['darumeru/cp_para_ru'] | |
| INFO: 2024-10-17 07:47:10,811: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [131508] | |
| INFO: 2024-10-17 07:47:10,811: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['<|im_end|>'] | |
| INFO: 2024-10-17 07:47:15,676: llmtf.base.darumeru/cp_para_ru: Loading Dataset: 4.86s | |
| INFO: 2024-10-17 07:49:51,029: llmtf.base.darumeru/cp_para_ru: Processing Dataset: 155.35s | |
| INFO: 2024-10-17 07:49:51,030: llmtf.base.darumeru/cp_para_ru: Results for darumeru/cp_para_ru: | |
| INFO: 2024-10-17 07:49:51,031: llmtf.base.darumeru/cp_para_ru: {'symbol_per_token': 3.76859951568896, 'len': 0.9950709951674359, 'lcs': 0.9} | |
| INFO: 2024-10-17 07:49:51,031: llmtf.base.evaluator: Ended eval | |
| INFO: 2024-10-17 07:49:51,032: llmtf.base.evaluator: | |
| mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/cp_para_ru darumeru/ruOpenBookQA darumeru/ruWorldTree nlpcoreteam/enMMLU nlpcoreteam/ruMMLU | |
| 0.578 0.218 0.398 0.296 0.680 0.502 0.534 0.900 0.721 0.895 0.660 0.549 | |