LLM-CodeSlim: Автоматическое сжатие и очистка кода для эффективного использования с LLM

2024-09-14 в 5:51, admin, рубрики: большие языковые модели, Программирование

Как известно, у больших языковых моделей (LLM) существуют ограничения по размеру контекстного окна. При постановке вопроса часто невозможно вставить весь исходный текст, что требует объединения кода из разных файлов в одном месте.

В связи с этим я разработал скрипт, который минимизирует исходный код проекта путем удаления пробелов, табуляций, комментариев и тестовых функций. Скрипт позволяет собрать все или выбранные файлы проекта в одном месте.

Для использования просто запустите скрипт в директории вашего проекта, чтобы сгенерировать минимизированный файл out.txt, содержащий оптимизированный код, готовый для использования с крупными языковыми моделями.

Перед запуском скрипта отредактируйте следующие массивы в соответствии с потребностями вашего проекта: folders_to_ignore, extensions_to_search, filenames_to_search, comment_chars и stop_words.

Пример конфигурации для проекта на Rust (включение всех файлов *.rs в out.txt):

folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore
extensions_to_search=( "rs" )                              # File extensions to search for
filenames_to_search=("Cargo.toml")                       # Filenames to search for
comment_chars=("#" "//" "/*")                            # Characters that denote comments
stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file

Пример конфигурации для проекта на Rust (включение только определенных файлов в out.txt):

folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore
extensions_to_search=( )                              # File extensions to search for
filenames_to_search=("Cargo.toml" "lib.rs" "core.rs")                       # Filenames to search for
comment_chars=("#" "//" "/*")                            # Characters that denote comments
stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file

Bash-версия скрипта:

#!/bin/bash

# Remove existing out.txt if it exists
rm -f out.txt

# Arrays
folders_to_ignore=("target" ".git" ".github" ".gitignore" ".idea" )   # Folders to ignore
extensions_to_search=( "rs" )                              # File extensions to search for
filenames_to_search=("Cargo.toml" "core.rs" "text.rs" "json.rs")                       # Filenames to search for
comment_chars=("#" "//" "/*")                            # Characters that denote comments
stop_words=("#[cfg(test)]")                              # Stop words after which to ignore the remaining lines in the file

# Build the 'find' command

# Start with the basic 'find' command
find_cmd="find ."

# Add folders to ignore
if [ ${#folders_to_ignore[@]} -gt 0 ]; then
    ignore_dir_expr=""
    for dir in "${folders_to_ignore[@]}"; do
        if [ -n "$ignore_dir_expr" ]; then
            ignore_dir_expr+=" -o "
        fi
        ignore_dir_expr+="-path './$dir' -prune"
    done
    find_cmd+=" \( $ignore_dir_expr \) -o"
fi

# Add conditions to search for files
find_cmd+=" \( "

name_patterns=()

# Add file extensions
for ext in "${extensions_to_search[@]}"; do
    name_patterns+=("-name '*.$ext'")
done

# Add filenames
for fname in "${filenames_to_search[@]}"; do
    name_patterns+=("-name '$fname'")
done

# Combine all patterns using -o
for ((i=0; i<${#name_patterns[@]}; i++)); do
    find_cmd+=" ${name_patterns[$i]}"
    if [ $i -lt $((${#name_patterns[@]} - 1)) ]; then
        find_cmd+=" -o"
    fi
done

find_cmd+=" \) -type f -print"

# Print the final command for debugging (you can comment out this line)
# echo "Running command: $find_cmd"

# Build the regular expression for comments
comment_pattern=""
for ((i=0; i<${#comment_chars[@]}; i++)); do
    # Escape special characters in comment characters
    escaped_char=$(printf '%sn' "${comment_chars[$i]}" | sed 's/[][(){}.*+?^$\|/]/\&/g')
    if [ $i -eq 0 ]; then
        comment_pattern="$escaped_char"
    else
        comment_pattern="$comment_pattern|$escaped_char"
    fi
done

# Execute the 'find' command and process the results
while read filepath; do
    echo -e "n#### $filepath ####" >> out.txt
    stop=false
    # Process the file line by line
    while IFS= read -r line; do
        if [ "$stop" = true ]; then
            break
        fi
        # Remove tabs
        line="${line//$'t'/}"
        # Remove leading spaces
        line="${line#"${line%%[![:space:]]*}"}"
        # Remove trailing spaces
        line="${line%"${line##*[![:space:]]}"}"
        # Skip lines that are empty or contain only spaces
        if [[ -z "$line" ]]; then
            continue
        fi
        # Check for stop words
        for stop_word in "${stop_words[@]}"; do
            if [[ "$line" == "$stop_word" ]]; then
                stop=true
                break
            fi
        done
        if [ "$stop" = true ]; then
            break
        fi
        # Skip lines that are comments
        if [[ "$line" =~ ^($comment_pattern) ]]; then
            continue
        fi
        # Write the processed line to out.txt
        echo "$line" >> out.txt
    done < "$filepath"
done < <(eval $find_cmd)

PowerShell-версия скрипта:

# Remove existing out.txt if it exists
if (Test-Path -Path "out.txt") {
    Remove-Item -Path "out.txt" -Force
}

# Define arrays

# Folders and files to ignore during the search
$foldersToIgnore = @("target", ".git", ".github", ".gitignore", ".idea")

# File extensions to search for
$extensionsToSearch = @("rs")

# Specific filenames to search for
$filenamesToSearch = @("Cargo.toml", "core.rs", "text.rs", "json.rs")

# Characters that denote comments in the files
$commentChars = @("#", "//", "/*")

# Words that, when encountered, will stop processing the current file
$stopWords = @("#[cfg(test)]")

# Function to build file filtering based on provided criteria
function Get-FilteredFiles {
    param (
        [string[]]$IgnoreFolders,
        [string[]]$Extensions,
        [string[]]$Filenames
    )

    # Build a regex pattern for ignored folders
    if ($IgnoreFolders.Count -gt 0) {
        $ignorePattern = ($IgnoreFolders | ForEach-Object { [regex]::Escape($_) }) -join '|'
    } else {
        $ignorePattern = ""
    }

    # Build a list of filters for extensions and filenames
    $nameFilters = @()
    foreach ($ext in $Extensions) {
        $nameFilters += "*.$ext"
    }
    foreach ($fname in $Filenames) {
        $nameFilters += $fname
    }

    # Get all files with the specified extensions or filenames
    Get-ChildItem -Path . -Recurse -File -Include $nameFilters | Where-Object {
        if ($ignorePattern) {
            # Check if the full path contains any of the ignored folders
            -not ($_.FullName -match "\($ignorePattern)\")
        } else {
            $true
        }
    }
}

# Build a regex pattern for comments
$escapedCommentChars = $commentChars | ForEach-Object { [regex]::Escape($_) }
$commentPattern = $escapedCommentChars -join '|'

# Get the list of files to process
$files = Get-FilteredFiles -IgnoreFolders $foldersToIgnore -Extensions $extensionsToSearch -Filenames $filenamesToSearch

# Process each file
foreach ($file in $files) {
    # Add file header to out.txt
    "`n#### $($file.FullName) ####" | Out-File -FilePath "out.txt" -Append -Encoding utf8

    $stop = $false

    # Read the file line by line
    Get-Content -Path $file.FullName | ForEach-Object {
        if ($stop) {
            return
        }

        $line = $_

        # Remove tabs
        $line = $line -replace "`t", ""

        # Trim leading and trailing spaces
        $line = $line.Trim()

        # Skip empty lines
        if ([string]::IsNullOrWhiteSpace($line)) {
            return
        }

        # Check for stop words
        foreach ($stopWord in $stopWords) {
            if ($line -eq $stopWord) {
                $stop = $true
                break
            }
        }
        if ($stop) {
            return
        }

        # Skip lines that are comments
        if ($line -match "^($commentPattern)") {
            return
        }

        # Write the processed line to out.txt
        $line | Out-File -FilePath "out.txt" -Append -Encoding utf8
    }
}

P.S.

Содержимое файла out.txt необходимо скопировать в буфер обмена и вставить как текст в окно ввода LLM. Не прикрепляйте файл out.txt к вопросу. Обычно LLM из соображений оптимизации обрабатывает файлы, извлекая из них резюме, и на основе этого резюме отвечает на вопрос. Другими словами, если вы вставите содержимое файла out.txt в окно ввода LLM и затем зададите вопрос, модель будет отвечать на основе всего содержимого файла out.txt.

Исходный код скриптов находится на GitHub, если у вас есть улучшения, то делайте pull request.

Автор: igumnov

Источник