Large Language Models

Large language models, or LLMs, represent a notable advancement in the field of artificial intelligence. These models are a central topic of any discussion of generative AI which aims to provide an introduction to their workings and significance. Understanding LLMs involves grasping their fundamental principles and the components that enable their capabilities.

LLMs are built upon foundational models. These foundational models serve as the backbone for generative AI models, making the creation of new content possible. They possess a broad knowledge base, which is helpful for effective transfer learning. This transfer learning capability means they can be used to generate new content that is contextually appropriate across diverse domains. Foundational models represent a unified approach, where a single model can produce various outputs, and their extensive training contributes to strong performance. They are also good at identifying nuances like grammar and spelling mistakes.

LLMs are generative AI models with the ability to (seemingly) understand and generate text that resembles human writing based on a given input. They are considered a significant step forward in AI development. These models learn patterns in human language through training on a vast amount of text data, sourced from materials like books, articles, and websites. Developing and maintaining them can be complex, requiring considerable data, computing power, and engineering effort.

The process by which LLMs generate text involves predicting the probability of the next word, considering the words already used in the text. They learn to create coherent and contextually relevant sentences by adjusting their internal parameters to reduce the difference between their predictions and the actual outcomes observed in the training data. When generating text, the model typically selects the word with the highest probability as the next output and repeats this process.

LLMs are presented as foundational models adapted for natural language processing (NLP) and language generation tasks. Because they are general-purpose, they can handle various tasks without needing specific training data for each one. For example, with appropriate input, they can answer questions, compose essays, summarize texts, translate languages, and even generate code. These capabilities make them applicable across numerous industries for tasks such as summarization, classification, creating Q&A chatbots, generating content, analyzing data, and extracting entities.

The architecture that makes these foundational models and LLMs possible is the transformer architecture. A relevant consideration when working with LLMs is their training cutoff. This refers to the point in time after which the model does not have any new information. Their understanding and knowledge are limited to the data they were trained on up to this date. This means they cannot access or utilize information that has emerged since their last training.

Tere are different types of LLMs. These include base LLMs, which are versatile and capable of handling many tasks without additional training. However, for specific tasks or domains, particularly in an enterprise context, they might not achieve the desired level of accuracy or reliability. Another type is instruction-based LLMs, sometimes referred to as instruction-tuned LLMs. These involve using a base LLM and providing explicit instructions within the prompt input. Examples include instructing the model to translate text or summarize an article. A third type is fine-tuned LLMs. This process involves taking a base LLM and training it further on a task where it might not perform optimally, often within a specific domain. For instance, training a model on medical literature could help it understand medical topics, or training it on customer service interactions could help it respond to customer inquiries in a particular industry. Fine-tuning can enhance the model’s accuracy or helpfulness for specific tasks or domains, but it requires additional data and training time.

An emerging trend is small language models (SLMs). SLMs are scaled-down versions of LLMs, offering many of the benefits of larger models while being more resource-efficient and accessible. They differ from LLMs like GPT-4 in aspects such as size, complexity, computational resource needs, training and operational costs, and performance quality. Techniques like knowledge distillation and transfer learning contribute to SLMs excelling in tasks like analysis, translation, and summarization with faster training. In some instances, SLMs can even match or surpass the performance of larger LLMs. A significant factor in the success of some SLMs, like the Phi series, has been strategic data selection. This approach prioritizes high-quality data over sheer quantity, incorporating textbook-quality data, synthetic datasets, and carefully chosen web data. This strategic data selection provides a foundation of common-sense reasoning and general knowledge, supporting strong performance across various tasks.

We differentiate open source and commercial LLMs. Open source LLM’s include Azure OpenAI and OpenAI, and these are known for their stability and readiness for enterprise use. Thousands of enterprises reportedly depend on and use these platforms. Examples of other LLMs and providers, including open-source models, are also covered to show how concepts are similar and APIs/SDKs are comparable to those of OpenAI.

All LLMs operate on in similar ways, and we perform similar functions when we use LLMs. Example of these are in the following paragraphs.

Prompts—A prompt is the way a user communicates with these models, essentially text that describes the task desired using natural language. The output generated by the models is also text. The ability to express intentions using natural language, rather than conforming to machine input restrictions, is a characteristic of prompts. The design and crafting of the text in a prompt is referred to as prompt engineering. This is likened to programming the model and represents a new paradigm. Prompt engineering involves understanding the capabilities of the AI model, its underlying training data, and how it responds to different inputs. Effective prompt engineering can improve the usefulness of AI models for various tasks. Prompts define the task a model should perform and can include instructions, the main content, examples, cues, and supporting documentation. Prompt engineering is often an iterative process, involving analyzing the model’s output and adjusting the prompt accordingly.
Tokens—Tokens are the units that LLMs work with. They play a part in determining computational costs, as the cost is related to the token count of both the prompt and the generated response. Managing tokens is relevant, especially to ensure that the input and output fit within the model’s maximum limit. Being aware of the total number of tokens in a conversation, including both input and output, is part of managing tokens, potentially requiring text truncation, omission, or shortening if the total exceeds the model’s limit. Unintentionally lengthy prompts can potentially lead to unexpectedly long and costly responses based on the token count.
Embeddings—are powerful machine-learning tools used for large inputs representing words. They are explained as capturing semantic similarities within a vector space, which is a collection of vectors. Embeddings make it possible to determine if two chunks of text convey the same meaning by providing a similarity score. The underlying idea is that words with similar meanings should have vector representations that are close to each other, as measured by their distance. Vectors with smaller distances indicate a higher degree of relatedness, while those with longer distances suggest low relatedness. These vectors, which are floating-point numbers, are learned during the training process and are used to capture the meaning of words or phrases. They can serve as input for various machine learning tasks. Embeddings are also relevant in the context of retrieval-augmented generation (RAG), where they are used to represent text data like documents, phrases, and words, distilling the complexity of language into machine-interpretable mathematical vectors that encapsulate semantic essence and enable tasks like semantic search.
Model configuration refers to parameters that can influence the model’s output. Parameters like temperature and top_p are ways to steer the model’s generation and its randomness. Setting a lower top_p value means the model considers a smaller set of the most probable words for the next output, resulting in more predictable text. A higher top_p value allows the model to consider a wider range of less likely words, potentially leading to more diverse generation. While temperature can go up to a value of 2, values above 1.2 are not generally recommended as the model may start producing nonsensical text. Temperature values around 0.8, and at most 1.2, are suggested for more creativity. Other parameters include frequency penalty and presence penalty. Frequency penalty reduces the chance of repeating a token proportionally based on how often it has already appeared, helping prevent the repetition of the same text. Presence penalty reduces the chance of repeating any token that has appeared in the text so far, which can increase the likelihood of introducing new topics. These penalties help steer the model and potentially improve generation results. A high presence penalty makes the model less likely to reuse the same token, encouraging new topics. The parameter value is subtracted from the log probability of a token each time it is generated. The logit_bias parameter is also mentioned as a powerful tool to guide the model’s output by steering the likelihood of certain tokens. It can be used to decrease the likelihood of undesirable tokens (effectively banning them at a value of -100) or increase the likelihood of preferred tokens (making them exclusive at a value of 100). Smaller adjustments to the token values can increase or decrease their probability in the output. Using logit_bias requires understanding the tokens corresponding to specific words. However, its use needs care, as excessive or inappropriate application could lead to nonsensical, overly repetitive, or biased outputs.
The context window—refers to the range of tokens or words surrounding a particular word or token that an LLM considers when making predictions. The context window helps the model understand the dependencies and relationships between words, which assists in generating more accurate and coherent predictions. LLMs have a maximum limit on the number of tokens they can process within their context window. This limit varies depending on the specific model. In practical applications like RAG, the usable length of the context window is shorter because space is needed for the generated output.
Model adaptation—involves customizing generative AI models, such as fine-tuning, to better suit specific use cases.

Emergent behavior—refers to capabilities or behaviors that a model exhibits that were not explicitly programmed or anticipated.

Understanding these elements of how large language models works is a step toward applying generative AI techniques effectively.