As the saying goes, if you can't measure it, you can't improve it., In this section, we are going to cover different ways to measure and ultimately validate Llama so it's possible to determine the improvements provided by different fine tuning techniques.
The focus of these techniques is to gather objective metrics that can be compared easily during and after each fine tuning run and to provide quick feedback on whether the model is performing. The main metrics collected are loss and perplexity.
This method consists in dividing the dataset into k subsets or folds, and then fine tuning the model k times. On each run, a different fold is used as a validation dataset, using the rest for training. The performance results of each run are averaged out for the final report. This provides a more accurate metric of the performance of the model across the complete dataset, as all entries serve both for validation and training. While it produces the most accurate prediction on how a model is going to generalize after fine tuning on a given dataset, it is computationally expensive and better suited for small datasets.
In our recipes, we use a simple holdout during fine tuning. Using the logged loss values, both for train and validation dataset, the curves for both are plotted to analyze the results of the process. Given the setup in the recipe, the expected behavior is a log graph that shows a diminishing train and validation loss value as it progresses.
If the validation curve starts going up while the train curve continues decreasing, the model is overfitting and it's not generalizing well. Some alternatives to test when this happens are early stopping, verifying the validation dataset is a statistically significant equivalent of the train dataset, data augmentation, using parameter efficient fine tuning or using k-fold cross-validation to better tune the hyperparameters.
Manually evaluating a fine tuned model will vary according to the FT objective and available resources. Here we provide general guidelines on how to accomplish it.
With a dataset prepared for fine tuning, a part of it can be separated into a manual test subset, which can be further increased with general knowledge questions that might be relevant to the specific use case. In addition to these general questions, we recommend executing standard evaluations as well, and compare the results with the baseline for the fine tuned model.
To rate the results, a clear evaluation criteria should be defined that is relevant to the dataset being used. Example criteria can be accuracy, coherence and safety. Create a rubric for each criteria and define what would be required for an output to receive a specific score.
With these guidelines in place, distribute the test questions with a diverse set of reviewers to have multiple data points for each question. With multiple data points for each question and different criteria, a final score can be calculated for each query, allowing for weighting the scores based on the preferred focus for the final model.