I’m new to the field of large language models (LLMs) and I’m really interested in learning how to train and use my own models for qualitative analysis. However, I’m not sure where to start or what resources would be most helpful for a complete beginner. Could anyone provide some guidance and advice on the best way to get started with LLM training and usage? Specifically, I’d appreciate insights on learning resources or tutorials, tips on preparing datasets, common pitfalls or challenges, and any other general advice or words of wisdom for someone just embarking on this journey.

Thanks!

  • halcyon@slrpnk.net
    link
    fedilink
    English
    arrow-up
    0
    ·
    5 months ago

    How much do you know about natural language processing? If you aren’t already familiar, you’ll probably want to start with some basics like tokenizing, lemmatizing, stemming, identifying stop words, determining parts of speech, spell checking, and vectorizing. Putting together a clean normalized training set will require some or most of these, and it should help give you some context of what you’re putting in.

    I’m most familiar with python’s natural language toolkit for most of those, sklearn does also have some text tools (vectorizer for sure), or you could jump right into keras/tensorflow.

    After that, look into the concept of transformer models - this tutorial does cover some of the basic cleanup steps, although I’d still want to understand them better than just copy pasting their code/regexes:

    https://machinelearningmastery.com/building-transformer-models-with-attention-crash-course-build-a-neural-machine-translator-in-12-days/

    https://machinelearningmastery.com/what-are-large-language-models/

    • 🐝bownage [they/he]@beehaw.org
      link
      fedilink
      arrow-up
      0
      ·
      5 months ago

      Good recommendations! I’d suggest doing some spacy tutorials as well, regarding the topics in the first paragraph. But arguably it’s possible nowadays to just start at transformers without any NLP knowledge, e.g. using huggingface’s AutoTrain or something similar. I wouldn’t recommend it, but you definitely could.