Judging Llm Through Ruby Scope

<!DOCTYPE html>

ruby-scope-llm-eval

import%20marimo%0A%0A__generated_with%20%3D%20%220.13.15%22%0Aapp%20%3D%20marimo.App(width%3D%22medium%22)%0A%0A%0A%40app.cell%0Adef%20_(mo%2C%20prompt)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20rf%22%22%22%0A%20%20%20%20%23%20Judging%20LLM%20through%20Ruby%20Scope%0A%0A%20%20%20%20One%20of%20the%20prompts%20I%20use%20to%20judge%20the%20quality%20of%20the%20model%20is%20to%20check%20if%20it%20understands%20Ruby%20scope.%20This%20is%20the%20exact%20prompt%20I%20use%3A%0A%0A%20%20%20%20%5C---%0A%20%20%20%20%7Bprompt%7D%0A%20%20%20%20%5C---%0A%0A%20%20%20%20Since%20%60a%60%20is%20defined%20on%20the%20%60main%60%2C%20outside%20%60hello%60%2C%20it%20won't%20have%20access%20to%20it%2C%20therefore%20will%20throw%3A%20%60undefined%20local%20variable%20or%20method%20'a'%20for%20main%20(NameError)%60%2C%20unlike%20Python.%0A%0A%20%20%20%20I%20write%20professionally%20in%20Ruby%2C%20if%20the%20model%20can't%20figure%20this%20out%2C%20there's%20no%20point%20in%20using%20it%20for%20other%20Ruby%20coding%20tasks.%20And%20since%20Ruby%20is%20a%20much%20smaller%20language%20than%20Python%20it%20shows%20the%20diversity%20of%20the%20training%20set%20of%20the%20model.%0A%0A%20%20%20%20Generally%20the%20bigger%20the%20model%2C%20the%20better%20it%20does%20on%20this%20task.%20Most%20of%20the%20models%20under%2020B%20parameters%20fail.%20Some%20around%20that%20size%20return%20the%20right%20answer%20some%20of%20the%20time.%20I%20wanted%20to%20get%20an%20accurate%20picture%20of%20which%20model%20on%20my%20local%20(installed%20through%20ollama)%20works%20and%20how%20often%20it%20works%2C%20so%20I%20wrote%20a%20simple%20eval%20script%20to%20get%20the%20results%20from%20the%20models%20few%20times%20and%20then%20ask%20LLM%20to%20check%20the%20results%20and%20keep%20a%20counter.%0A%0A%20%20%20%20First%2C%20we%20need%20list%20of%20available%20models.%20Ollama%20provides%20an%20enpoint%20which%20returns%20exactly%20that.%20This%20isn't%20an%20exhaustive%20test%20since%20I%20don't%20have%20all%20of%20the%20models%20installed%20locally.%20And%20I%20believe%20there's%20even%20a%20discrepancy%20between%20what%20ollama%20makes%20available%20and%20what%20other%20providers%20make%20available%20under%20the%20same%20name%20so%20just%20keep%20that%20in%20mind.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%23%23%23%20Get%20list%20of%20all%20locally%20installed%20models%20through%20ollama%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(requests)%3A%0A%20%20%20%20%23%20get%20list%20of%20all%20locally%20installed%20models%20through%20ollama%0A%20%20%20%20result%20%3D%20requests.get(%22http%3A%2F%2Flocalhost%3A11434%2Fapi%2Ftags%22).json()%0A%20%20%20%20models%20%3D%20%5Bmodel%5B%22model%22%5D%20for%20model%20in%20result%5B%22models%22%5D%5D%0A%20%20%20%20models%0A%20%20%20%20return%20(models%2C)%0A%0A%0A%40app.cell%0Adef%20_(json%2C%20remove_think_tags%2C%20requests)%3A%0A%20%20%20%20%23%20helper%20methods%20to%20make%20requests%0A%20%20%20%20prompt%20%3D%20%22%22%22%0A%20%20%20%20%60%60%60ruby%0A%20%20%20%20a%20%3D%201%0A%0A%20%20%20%20def%20hello%0A%20%20%20%20%20%20puts%20a%0A%20%20%20%20end%0A%0A%20%20%20%20hello%0A%20%20%20%20%60%60%60%0A%0A%20%20%20%20What%20would%20this%20return%3F%20Be%20conscise%20and%20return%20only%20the%20answer.%0A%20%20%20%20%22%22%22%0A%0A%0A%20%20%20%20def%20make_request(model_name)%3A%0A%20%20%20%20%20%20%20%20data%20%3D%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22model%22%3A%20model_name%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22prompt%22%3A%20prompt%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22stream%22%3A%20False%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22temperature%22%3A%200%2C%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20return%20requests.post(%0A%20%20%20%20%20%20%20%20%20%20%20%20%22http%3A%2F%2Flocalhost%3A11434%2Fapi%2Fgenerate%22%2C%20data%3Djson.dumps(data)%0A%20%20%20%20%20%20%20%20).json()%0A%0A%0A%20%20%20%20def%20make_requests(models)%3A%0A%20%20%20%20%20%20%20%20responses%20%3D%20%7B%7D%0A%20%20%20%20%20%20%20%20for%20m%20in%20models%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20r%20%3D%20make_request(m)%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20print(f%22'%7Bm%7D'%20responded%3A%20%7Br%5B'response'%5D%7D%22)%0A%20%20%20%20%20%20%20%20%20%20%20%20responses%5Bm%5D%20%3D%20remove_think_tags(r%5B%22response%22%5D)%0A%0A%20%20%20%20%20%20%20%20return%20responses%0A%0A%0A%20%20%20%20%23%20make%20eval%20request%20to%20check%20if%20model%20returned%20the%20right%20result.%20This%20is%20only%20as%20good%20as%20this%20eval%20response.%0A%20%20%20%20def%20eval_request(responses%2C%20model%3D%22mistral-small%3A24b%22)%3A%0A%20%20%20%20%20%20%20%20prompt%20%3D%20f%22%22%22%0A%20%20%20%20Input%3A%0A%20%20%20%20%7Bresponses%7D%0A%0A%20%20%20%20Given%20input%20is%20list%20of%20responses%20from%20different%20LLM%20models.%0A%20%20%20%20For%20each%20of%20the%20key%20value%20pairs%2C%20if%20value%20contains%20'NameError'%20or%20talks%20about%20how%20this%20code%20will%20fail%2C%20it's%20considered%20a%20success.%20If%20not%2C%20it's%20a%20failure.%0A%20%20%20%20Please%20iterate%20through%20the%20input%20and%20categorize%20the%20success%20and%20failures%20of%20the%20models%20in%20the%20following%20JSON%20format%3A%0A%0A%20%20%20%20%7B%7B%0A%20%20%20%20%20%20%20%20%22success%22%3A%20%5B%22model1%22%2C%20%22model2%22%5D%2C%0A%20%20%20%20%20%20%20%20%22failures%22%3A%20%5B%22model3%22%2C%20%22model4%22%5D%0A%20%20%20%20%7D%7D%0A%0A%20%20%20%20Only%20return%20JSON%20and%20only%20return%20in%20the%20above%20format%0A%20%20%20%20%22%22%22%0A%20%20%20%20%20%20%20%20data%20%3D%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22model%22%3A%20model%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22prompt%22%3A%20prompt%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22stream%22%3A%20False%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22temperature%22%3A%200%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22response_format%22%3A%20%7B%22type%22%3A%20%22json_object%22%7D%2C%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20try%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20response%20%3D%20requests.post(%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%22http%3A%2F%2Flocalhost%3A11434%2Fapi%2Fgenerate%22%2C%20data%3Djson.dumps(data)%0A%20%20%20%20%20%20%20%20%20%20%20%20).json()%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20return%20json.loads(%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20response%5B%22response%22%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20.strip()%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20.replace(%22%60%60%60json%22%2C%20%22%22)%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20.replace(%22%60%60%60%22%2C%20%22%22)%0A%20%20%20%20%20%20%20%20%20%20%20%20)%0A%20%20%20%20%20%20%20%20except%20Exception%20as%20e%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20return%20%7B%22success%22%3A%20%5B%5D%2C%20%22failures%22%3A%20%5B%5D%7D%0A%20%20%20%20return%20eval_request%2C%20make_requests%2C%20prompt%0A%0A%0A%40app.cell%0Adef%20_(eval_request%2C%20make_requests%2C%20models)%3A%0A%20%20%20%20eval_outputs%20%3D%20%5B%5D%0A%0A%20%20%20%20for%20i%20in%20range(10)%3A%0A%20%20%20%20%20%20%20%20responses%20%3D%20make_requests(models)%0A%20%20%20%20%20%20%20%20eval_output_json%20%3D%20eval_request(responses)%0A%20%20%20%20%20%20%20%20eval_outputs.append(eval_output_json)%0A%0A%20%20%20%20model_count%20%3D%20%7B%22success%22%3A%20%7B%7D%2C%20%22failures%22%3A%20%7B%7D%7D%0A%20%20%20%20for%20e%20in%20eval_outputs%3A%0A%20%20%20%20%20%20%20%20try%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20for%20s%20in%20e%5B%22success%22%5D%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20if%20s%20not%20in%20model_count%5B%22success%22%5D%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20model_count%5B%22success%22%5D%5Bs%5D%20%3D%200%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20model_count%5B%22success%22%5D%5Bs%5D%20%2B%3D%201%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20for%20s%20in%20e%5B%22failures%22%5D%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20if%20s%20not%20in%20model_count%5B%22failures%22%5D%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20model_count%5B%22failures%22%5D%5Bs%5D%20%3D%200%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20model_count%5B%22failures%22%5D%5Bs%5D%20%2B%3D%201%0A%0A%20%20%20%20%20%20%20%20except%20Exception%20as%20ee%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20print(f%22Error%20iterating%20through%20eval%20response%3A%20%7Bee%7D%20%7Be%7D%22)%0A%20%20%20%20return%20(model_count%2C)%0A%0A%0A%40app.cell%0Adef%20_(mo%2C%20model_count)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20rf%22%22%22%0A%20%20%20%20This%20is%20the%20result%20of%20asking%20the%20models%2010%20times%20the%20same%20prompt%3A%0A%0A%20%20%20%20Success%3A%0A%20%20%20%20%7Bmodel_count%5B%22success%22%5D%7D%0A%0A%20%20%20%20Failures%3A%0A%20%20%20%20%7Bmodel_count%5B%22failures%22%5D%7D%0A%0A%20%20%20%20%5B%60mistral-small%3A24b%60%5D(https%3A%2F%2Fhuggingface.co%2Fmistralai%2FMistral-Small-24B-Instruct-2501)%20and%20%60deepseek%60%20models%20consistently%20return%20the%20correct%20answers%2C%20however%20%60deepseek%60%20models%20are%20very%20slow.%20%60gemma%60%20surprisingly%20always%20returns%20the%20wrong%20answer%20for%20this%20particular%20prompt%3B%20%60gemma2%3A9b%60%20is%20my%20go%20to%20model%20for%20any%20kind%20of%20extraction%20or%20parsing%20jobs.%0A%0A%20%20%20%20I%20was%20using%20%60mistral-small%3A24b%60%20as%20the%20llm%20for%20evaluating%20the%20result%20as%20well%2C%20it%20was%20generally%20pretty%20good%2C%20however%20I%20noticed%20it%20would%20at%20times%20not%20return%20result%20in%20JSON%20format%20or%20not%20in%20the%20specified%20JSON%20schema.%0A%0A%20%20%20%20Ollama%20does%20support%20structured%20outputs%2C%20perhaps%20that%20might%20fix%20the%20issue.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_()%3A%0A%20%20%20%20import%20re%0A%0A%0A%20%20%20%20def%20remove_think_tags(text)%3A%0A%20%20%20%20%20%20%20%20return%20re.sub(r%22%3Cthink%3E.*%3F%3C%2Fthink%3E%22%2C%20%22%22%2C%20text%2C%20flags%3Dre.DOTALL).strip()%0A%20%20%20%20return%20(remove_think_tags%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_()%3A%0A%20%20%20%20import%20marimo%20as%20mo%0A%20%20%20%20import%20requests%0A%20%20%20%20import%20json%0A%20%20%20%20return%20json%2C%20mo%2C%20requests%0A%0A%0Aif%20__name__%20%3D%3D%20%22__main__%22%3A%0A%20%20%20%20app.run()%0A

Related Posts