Automating Data Labeling and Annotation Pipelines for Large Language Models (LLMs) in the Financial Industry using Machine Learning
Main Article Content
Abstract
The growing magnitude and intricacy of financial information present considerable obstacles for machine learning systems, especially Large Language Models (LLMs), which necessitate extensive, high-caliber labeled datasets for training. Conventional manual labeling approaches are ineffective and expensive, constraining the expandability of LLMs in the finance sector. This research introduces an automated data labeling and annotation framework utilizing Principal Component Analysis (PCA) and Decision Trees (DT), two robust machine learning methodologies, to optimize and improve the labeling procedure for financial information. PCA is utilized for reducing dimensionality, assisting in the identification of crucial features and trends in financial datasets, while DTs are employed to categorize data and automate the annotation process. The proposed system aims to enhance the precision, effectiveness, and scalability of the data labeling procedure, ultimately facilitating the ongoing training of LLMs with contextually pertinent, labeled financial data.