Summary

Builds a locally-hosted voice assistant for security operations using open-source components. The model reasons over defined tools and invokes them on demand, with no training or fine-tuning required. Skills to the moon.

ComponentToolRole
Speech-to-textfaster-whisperTranscribe voice input locally
LLMOllama (Llama 3.1 / Phi-4)Tool selection and response generation
Text-to-speechKokoroSpeak results back
Browser controlPlaywrightRead authenticated dashboards
Ticket searchJira REST APIQuery tickets by keyword/person
Display controlPowerShell via subprocessToggle scheduled power tasks

Out of scope for this post: multi-turn memory, authentication from scratch, any UI.


1. Architecture

The assistant runs as a background process and follows this loop on every interaction:

Microphone input
      
      
Wake word detection  ──  keyword match on transcription (or Porcupine)
      
      
Speech-to-text  ──  faster-whisper
      
      
LLM with tool calling  ──  Ollama (Llama 3.1 or Phi-4)
      
      ├──► Tool: Display Control  ──  PowerShell via subprocess
      ├──► Tool: Jira Search      ──  Jira REST API
      ├──► Tool: Browser Read     ──  Playwright
      └──► Tool: Web Fetch        ──  httpx + BeautifulSoup
      
      
Text-to-speech  ──  Kokoro
      
      
Speaker output

The LLM does not answer questions directly. It selects a tool and arguments, Python executes the tool and passes the result back, and the model formulates a spoken response.

Model Selection

Criteria: 8GB VRAM target, native tool/function calling, under 3s response time.

  • Llama 3.1 8B Q4_K_M — best general-purpose option, strong tool calling
  • Phi-4 Mini — faster, smaller, good for CPU-only deployments
ollama pull llama3.1
# or
ollama pull phi4-mini

2. Implementation

2.1 Dependencies

pip install ollama faster-whisper pyaudio kokoro playwright httpx \
            beautifulsoup4 requests soundfile sounddevice pvporcupine
playwright install chromium

pyaudio requires PortAudio on Windows: pip install pipwin && pipwin install pyaudio, or download the prebuilt wheel from Christoph Gohlke's repository.

2.2 Speech-to-Text

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cuda", compute_type="float16")
# CPU: WhisperModel("base", device="cpu", compute_type="int8")

def transcribe(audio_path: str) -> str:
    segments, _ = model.transcribe(audio_path, beam_size=5)
    return " ".join(s.text.strip() for s in segments)

2.3 Audio Capture

Energy-based VAD — records until silence is detected:

import pyaudio, wave
import numpy as np

RATE       = 16000
CHUNK      = 1024
SILENCE_DB = 40
SILENCE_S  = 1.5

def record_command(output_path="command.wav") -> str:
    pa = pyaudio.PyAudio()
    stream = pa.open(format=pyaudio.paInt16, channels=1,
                     rate=RATE, input=True, frames_per_buffer=CHUNK)
    frames, silent_chunks = [], 0
    silence_limit = int(RATE / CHUNK * SILENCE_S)

    while True:
        data = stream.read(CHUNK)
        frames.append(data)
        rms = np.sqrt(np.mean(np.frombuffer(data, dtype=np.int16).astype(np.float32) ** 2))
        db  = 20 * np.log10(rms + 1e-6)
        silent_chunks = silent_chunks + 1 if db < SILENCE_DB else 0
        if silent_chunks >= silence_limit and len(frames) > silence_limit:
            break

    stream.stop_stream(); stream.close(); pa.terminate()
    with wave.open(output_path, 'wb') as wf:
        wf.setnchannels(1)
        wf.setsampwidth(pa.get_sample_size(pyaudio.paInt16))
        wf.setframerate(RATE)
        wf.writeframes(b''.join(frames))
    return output_path

2.4 Wake Word Detection

WAKE_WORD = "jarvis"

def contains_wake_word(text: str) -> bool:
    return text.lower().strip().startswith(WAKE_WORD)

def strip_wake_word(text: str) -> str:
    words = text.lower().strip().split()
    if words and words[0] == WAKE_WORD:
        return " ".join(words[1:]).strip()
    return text.strip()

Porcupine by Picovoice detects the wake word before Whisper is invoked — lower latency, lower CPU. Free tier supports one custom wake word.

2.5 Tool Definitions

Tools follow the OpenAI function-calling schema that Ollama uses:

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "control_displays",
            "description": (
                "Turn the SOC display monitors on or off. "
                "Use when the user asks to start, wake, turn on, shut down, or turn off monitors."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "action": {
                        "type": "string",
                        "enum": ["on", "off"],
                        "description": "Whether to turn displays on or off."
                    }
                },
                "required": ["action"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_jira",
            "description": (
                "Search Jira tickets by keyword, topic, or person. "
                "Use when the user asks about a ticket or what someone said about a topic."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query, e.g. 'blocking traffic on weekends Andrew'"
                    }
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_dashboard",
            "description": (
                "Read a metric from an authenticated security dashboard. "
                "Use for asset counts, alert counts, or any metric on the SOC displays."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "dashboard": {
                        "type": "string",
                        "enum": ["sentinelone", "jira", "darktrace"],
                        "description": "Which dashboard to read from."
                    },
                    "metric": {
                        "type": "string",
                        "description": "What to look for, e.g. 'total enrolled assets'"
                    }
                },
                "required": ["dashboard", "metric"]
            }
        }
    }
]

2.6 Tool Implementations

Display control

import subprocess

def control_displays(action: str) -> str:
    if action == "off":
        subprocess.run(["powershell", "-Command",
            "Disable-ScheduledTask -TaskName 'SOC-Display-Shutdown'; shutdown /s /t 60"])
        return "Displays shutting down in 60 seconds."
    elif action == "on":
        subprocess.run(["powershell", "-Command",
            "Enable-ScheduledTask -TaskName 'SOC-Display-Launch'"])
        return "Display launch task re-enabled."
    return "Unknown action."

Jira search

import requests, os

JIRA_BASE  = os.getenv("JIRA_BASE_URL")
JIRA_USER  = os.getenv("JIRA_USER")
JIRA_TOKEN = os.getenv("JIRA_API_TOKEN")

def search_jira(query: str) -> str:
    jql  = f'text ~ "{query}" ORDER BY updated DESC'
    resp = requests.get(
        f"{JIRA_BASE}/rest/api/3/search",
        params={"jql": jql, "maxResults": 3, "fields": "summary,description,comment"},
        auth=(JIRA_USER, JIRA_TOKEN),
        timeout=10
    )
    if resp.status_code != 200:
        return f"Jira search failed: {resp.status_code}"
    issues = resp.json().get("issues", [])
    if not issues:
        return "No matching tickets found."
    results = []
    for issue in issues:
        comments = issue["fields"].get("comment", {}).get("comments", [])
        last     = comments[-1]["body"] if comments else "No comments."
        results.append(f"{issue['key']}: {issue['fields']['summary']}\nLatest comment: {last}")
    return "\n\n".join(results)

Browser read

from playwright.sync_api import sync_playwright
import os

PROFILE = os.path.expandvars(r'%LOCALAPPDATA%\Microsoft\Edge\User Data')
DASHBOARD_URLS = {
    "sentinelone": "https://your-s1-console.com",
    "jira":        "https://yourorg.atlassian.net",
    "darktrace":   "https://your-darktrace.com",
}

def read_dashboard(dashboard: str, metric: str) -> str:
    url = DASHBOARD_URLS.get(dashboard)
    if not url:
        return f"Unknown dashboard: {dashboard}"
    with sync_playwright() as p:
        ctx  = p.chromium.launch_persistent_context(
            user_data_dir=PROFILE, channel='msedge', headless=True)
        page = ctx.new_page()
        page.goto(url, wait_until="networkidle")
        text = page.inner_text("body")
        ctx.close()
    return text[:3000]

headless=True is fine for reads and does not affect the display windows already open. Flip to False and run in a separate process if the session requires a visible browser to stay authenticated.

2.7 LLM Loop

import ollama

MODEL = "llama3.1"
TOOL_FUNCTIONS = {
    "control_displays": control_displays,
    "search_jira":      search_jira,
    "read_dashboard":   read_dashboard,
}
SYSTEM_PROMPT = (
    "You are a concise voice assistant for a security operations center. "
    "Answer in one or two sentences. Summarize tool results clearly and briefly. "
    "Do not repeat the user's question."
)

def run_agent(command: str) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": command},
    ]
    response = ollama.chat(model=MODEL, messages=messages, tools=TOOLS)
    msg = response["message"]

    if not msg.get("tool_calls"):
        return msg["content"]

    for call in msg["tool_calls"]:
        fn     = TOOL_FUNCTIONS.get(call["function"]["name"])
        result = fn(**call["function"]["arguments"]) if fn else "Tool not found."
        messages.append(msg)
        messages.append({"role": "tool", "content": result})

    return ollama.chat(model=MODEL, messages=messages)["message"]["content"]

2.8 Text-to-Speech (Kokoro)

import sounddevice as sd
import soundfile as sf
import tempfile, os
from kokoro import KPipeline

pipeline = KPipeline(lang_code="a")  # "a" = American English

def speak(text: str):
    generator = pipeline(text, voice="af_heart", speed=1.0)
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        out_path = f.name
    for _, _, audio in generator:
        sf.write(out_path, audio, samplerate=24000)
        break
    data, sr = sf.read(out_path)
    sd.play(data, sr); sd.wait()
    os.unlink(out_path)

Kokoro uses CUDA when available, falls back to CPU. On 8GB VRAM, short responses generate in under a second. For CPU-only deployments, Piper is faster.

2.9 Main Loop

import time

def main():
    while True:
        try:
            text = transcribe(record_command())
            if not contains_wake_word(text):
                continue
            command = strip_wake_word(text)
            if not command:
                speak("Yes?")
                command = transcribe(record_command())
            speak(run_agent(command))
        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"Error: {e}")
            time.sleep(1)

if __name__ == "__main__":
    main()

2.10 Task Scheduler Registration

$action = New-ScheduledTaskAction `
    -Execute  "python.exe" `
    -Argument "C:\SOC\assistant.py"
$trigger = New-ScheduledTaskTrigger -AtLogOn
Register-ScheduledTask `
    -TaskName "SOC-AI-Assistant" `
    -Action   $action `
    -Trigger  $trigger `
    -RunLevel Highest `
    -Force

Add time.sleep(10) at the top of assistant.py so Ollama has time to load the model before the first request.


3. Future Extensions

  • Persistent context — carry last N turns for natural follow-up questions
  • Porcupine wake word — replace always-transcribing loop with on-device keyword detection
  • More tools — CVE lookups, firewall rule queries, alert triage
  • Scheduled briefings — summarize overnight alerts and open tickets at shift start