LLMs Struggle with Essay Grading, but Excel at Generative Tasks
A new study finds that large language models (LLMs) like GPT and Llama perform poorly at grading essays compared to human raters. However, AI excels at generative tasks like image, video, and text generation.
Why it matters
This research highlights the limitations of current LLMs in subjective evaluation tasks and the need for human oversight, while also showcasing AI's strengths in generative applications.
Key Points
- 1LLMs tend to over-score short essays and under-score longer essays with minor errors
- 2LLMs apply their own internal signals rather than following explicit grading rubrics
- 3LLMs' performance varies significantly based on essay length and style
- 4LLMs can assist human graders but cannot replace them for subjective evaluation
- 5AI APIs like NexaAPI provide powerful generative capabilities at scale and low cost
Details
The research paper published on arXiv concludes that the agreement between LLM scores and human scores for essay grading remains relatively weak. LLMs struggle with subjective evaluation, rubric-based scoring, and maintaining consistency across different essay types. While they follow coherent internal patterns, these patterns diverge significantly from how human raters think. The paper suggests that LLMs can assist human graders but cannot replace their nuanced judgment. On the other hand, AI excels at generative and creative tasks like image generation, video synthesis, text-to-speech, and large-scale text generation. Services like NexaAPI provide access to these powerful AI capabilities through a unified API at affordable costs.
No comments yet
Be the first to comment