Automating Data Labeling and Annotation Pipelines for Large Language Models (LLMs) in the Financial Industry using Machine Learning

Sai Arundeep Aetukuri

PDF

Published: Dec 31, 2024

Keywords:

Large Language Models (LLMs), Banking Industry,Machine learning(ML), data-driven technologies, banking industry, data governance, data quality, Principal Component Analysis(PCA), Decision Trees(DT), predictive accuracy, data integrity, data security.

Sai Arundeep Aetukuri

Abstract

The growing magnitude and intricacy of financial information present considerable obstacles for machine learning systems, especially Large Language Models (LLMs), which necessitate extensive, high-caliber labeled datasets for training. Conventional manual labeling approaches are ineffective and expensive, constraining the expandability of LLMs in the finance sector. This research introduces an automated data labeling and annotation framework utilizing Principal Component Analysis (PCA) and Decision Trees (DT), two robust machine learning methodologies, to optimize and improve the labeling procedure for financial information. PCA is utilized for reducing dimensionality, assisting in the identification of crucial features and trends in financial datasets, while DTs are employed to categorize data and automate the annotation process. The proposed system aims to enhance the precision, effectiveness, and scalability of the data labeling procedure, ultimately facilitating the ongoing training of LLMs with contextually pertinent, labeled financial data.

How to Cite

Sai Arundeep Aetukuri. (2024). Automating Data Labeling and Annotation Pipelines for Large Language Models (LLMs) in the Financial Industry using Machine Learning. International Journal on Recent and Innovation Trends in Computing and Communication, 12(2), 1092–1104. Retrieved from https://www.ijritcc.org/index.php/ijritcc/article/view/11457