Building a Local, Privacy-First Tool-Calling Agent with Gemma 4 and Ollama

The landscape of open-weight large language models has been significantly reshaped by the recent introduction of the Gemma 4 model family. Developed by Google, these models aim to deliver advanced capabilities while adhering to a permissive Apache 2.0 license, a move that grants machine learning practitioners enhanced control over their infrastructure and a stronger guarantee of data privacy. This new generation of models, ranging from the substantial 31B and architecturally intricate 26B Mixture of Experts (MoE) variants to more streamlined, edge-optimized versions, brings native support for agentic workflows to the forefront. Crucially, Gemma 4 models have been fine-tuned to reliably generate structured JSON outputs and directly invoke function calls based on system instructions. This capability transforms them from purely generative text engines into practical systems capable of executing workflows and interacting with external APIs in a localized environment.
The evolution of language models from closed-loop conversationalists to dynamic agents hinges on the development of "tool calling," also known as "function calling." Historically, if a language model was asked for real-time information, such as current sensor readings or live market data, its responses were limited to apologies for its inability to access external data or, in some cases, speculative or "hallucinated" answers. Tool calling addresses this fundamental limitation by acting as a bridge that empowers static models to become dynamic, autonomous agents.
The Mechanism of Tool Calling
When tool calling is enabled, a language model evaluates a user’s prompt against a predefined registry of available programmatic tools. These tools are typically described using a JSON schema, which outlines their purpose, expected inputs, and output formats. Instead of attempting to synthesize an answer solely from its internal knowledge base, the model pauses its inference process. It then formats a structured request specifically designed to trigger an external function, awaiting the result. Once the external function executes and returns its output, this result is processed by the host application and fed back to the model. The model then synthesizes this new, live context to formulate a grounded and accurate final response. This iterative process allows language models to interact with the real world in a controlled and predictable manner.
Setting the Stage: Ollama and Gemma 4:E2B
To construct a truly local and privacy-first tool-calling system, the integration of Ollama as the local inference runner with the gemma4:e2b model (Edge 2 billion parameters) is key. Ollama is a popular platform designed to simplify the process of downloading, running, and managing large language models locally. Its user-friendly interface and efficient resource management make it an ideal choice for developers looking to experiment with and deploy LLMs without the complexities of cloud-based infrastructure.
The gemma4:e2b model, specifically engineered for mobile devices and Internet of Things (IoT) applications, represents a significant advancement in on-device AI capabilities. It achieves an effective 2 billion parameter footprint during inference, optimizing system memory usage and enabling near-zero latency execution. By operating entirely offline, this model eliminates concerns about rate limits and API costs, while simultaneously ensuring strict data privacy. Despite its remarkably small size, Google has ensured that gemma4:e2b inherits the multimodal properties and native function-calling capabilities found in its larger Gemma 4 counterparts. This makes it an excellent foundation for developing fast, responsive desktop agents, and crucially, it allows for the exploration of the new model family’s capabilities without the necessity of high-end GPU hardware.
The Codebase: Orchestrating the Agent
The implementation of this agent prioritizes a zero-dependency philosophy, relying solely on standard Python libraries such as urllib and json. This approach ensures maximum portability and transparency, while also preventing unnecessary bloat. The complete codebase for this tutorial is available on GitHub, providing a comprehensive resource for developers.

The architectural flow of the application operates as follows:
- User Input: The user provides a natural language query to the agent.
- Initial Model Call: The query is sent to the
gemma4:e2bmodel via Ollama, along with a defined set of available tools described by their JSON schemas. - Tool Identification: If the model determines that an external tool is necessary to answer the query, it generates a JSON output specifying the tool to be called and its required arguments.
- Tool Execution: The Python application parses this JSON, identifies the corresponding Python function for the requested tool, and executes it with the provided arguments.
- Result Injection: The output from the executed Python function is then formatted as a "tool" role message and added back to the conversation history.
- Final Response Generation: The updated conversation history, including the tool’s result, is sent back to the
gemma4:e2bmodel. The model then synthesizes this information to generate a coherent and contextually relevant natural language response to the user.
Crafting the Tools: The get_current_weather Function
The effectiveness of any tool-calling agent is directly proportional to the quality and utility of its underlying functions. Our initial tool, get_current_weather, is designed to retrieve real-time weather data for a specified location using the open-source Open-Meteo API.
import urllib.request
import json
def get_current_weather(city: str, unit: str = "celsius") -> str:
"""Gets the current temperature for a given city using open-meteo API."""
try:
# Geocode the city to get latitude and longitude
geo_url = f"https://geocoding-api.open-meteo.com/v1/search?name=urllib.parse.quote(city)&count=1"
geo_req = urllib.request.Request(geo_url, headers='User-Agent': 'Gemma4ToolCalling/1.0')
with urllib.request.urlopen(geo_req) as response:
geo_data = json.loads(response.read().decode('utf-8'))
if "results" not in geo_data or not geo_data["results"]:
return f"Could not find coordinates for city: city."
location = geo_data["results"][0]
lat = location["latitude"]
lon = location["longitude"]
country = location.get("country", "")
# Fetch the weather
temp_unit = "fahrenheit" if unit.lower() == "fahrenheit" else "celsius"
weather_url = f"https://api.open-meteo.com/v1/forecast?latitude=lat&longitude=lon¤t=temperature_2m,wind_speed_10m&temperature_unit=temp_unit"
weather_req = urllib.request.Request(weather_url, headers='User-Agent': 'Gemma4ToolCalling/1.0')
with urllib.request.urlopen(weather_req) as response:
weather_data = json.loads(response.read().decode('utf-8'))
if "current" in weather_data:
current = weather_data["current"]
temp = current["temperature_2m"]
wind = current["wind_speed_10m"]
temp_unit_str = weather_data["current_units"]["temperature_2m"]
wind_unit_str = weather_data["current_units"]["wind_speed_10m"]
return f"The current weather in city.title() (country) is temptemp_unit_str with wind speeds of windwind_unit_str."
else:
return f"Weather data for city is unavailable from the API."
except Exception as e:
return f"Error fetching weather for city: e"
This Python function employs a two-stage API interaction. Standard weather APIs typically require precise geographical coordinates. Therefore, the function first intercepts the city name provided by the language model and uses a geocoding service to convert it into latitude and longitude. Once these coordinates are obtained, the function queries the weather forecast endpoint and then synthesizes the retrieved data into a concise, natural language string.
However, simply defining the Python function is insufficient. The language model must be explicitly informed about this capability. This is achieved by mapping the Python function into an Ollama-compliant JSON schema dictionary:
"type": "function",
"function":
"name": "get_current_weather",
"description": "Gets the current temperature for a given city.",
"parameters":
"type": "object",
"properties":
"city":
"type": "string",
"description": "The city name, e.g. Tokyo"
,
"unit":
"type": "string",
"enum": ["celsius", "fahrenheit"]
,
"required": ["city"]
This structured blueprint is critical. It precisely defines the expected data types, specifies strict string enumerations (like "celsius" or "fahrenheit"), and identifies required parameters. These specifications guide the gemma4:e2b model’s weights to reliably generate syntactically correct function calls.
The Inner Workings of Tool Calling
The core of the autonomous workflow is managed by the main loop orchestrator. When a user initiates a prompt, an initial JSON payload is constructed for the Ollama API. This payload explicitly designates the gemma4:e2b model and includes a global array containing the parsed toolkit definitions.
# Initial payload to the model
messages = ["role": "user", "content": user_query]
payload =
"model": "gemma4:e2b",
"messages": messages,
"tools": available_tools,
"stream": False
try:
response_data = call_ollama(payload)
except Exception as e:
print(f"Error calling Ollama API: e")
return
message = response_data.get("message", )
Upon receiving the response from the initial web request, it is imperative to analyze the structure of the returned message block. The application does not simply assume the presence of text. Instead, it checks for a tool_calls dictionary, which the model will attach to signal its intention to utilize an available tool.

If tool_calls are present, the standard response synthesis workflow is temporarily suspended. The application parses the requested function name from the tool_calls dictionary and executes the corresponding Python tool with the dynamically parsed arguments. The live data returned by the executed tool is then injected back into the conversational array.
# Check if the model decided to call tools
if "tool_calls" in message and message["tool_calls"]:
# Add the model's tool calls to the chat history
messages.append(message)
# Execute each tool call
num_tools = len(message["tool_calls"])
for i, tool_call in enumerate(message["tool_calls"]):
function_name = tool_call["function"]["name"]
arguments = tool_call["function"]["arguments"]
if function_name in TOOL_FUNCTIONS:
func = TOOL_FUNCTIONS[function_name]
try:
# Execute the underlying Python function
result = func(**arguments)
# Add the tool response to messages history
messages.append(
"role": "tool",
"content": str(result),
"name": function_name
)
except TypeError as e:
print(f"Error calling function: e")
else:
print(f"Unknown function: function_name")
# Send the tool results back to the model to get the final answer
payload["messages"] = messages
try:
final_response_data = call_ollama(payload)
print("[RESPONSE]")
print(final_response_data.get("message", ).get("content", "") + "n")
except Exception as e:
print(f"Error calling Ollama API for final response: e")
This secondary interaction is crucial. After the dynamic result is appended with a "tool" role, the entire messages history is bundled again and sent back to the API. This second pass enables the gemma4:e2b reasoning engine to process the telemetry strings it previously generated, effectively bridging the gap to produce a logical, human-readable output.
Expanding Capabilities with Additional Tools
With the foundational architecture in place, extending the agent’s capabilities is a matter of modularly adding new Python functions. Following the same methodology, three additional live tools have been incorporated:
convert_currency: This function leverages a currency exchange rate API to convert monetary values between different currencies.get_current_time: This tool fetches the current time for a specified city, utilizing a time API.get_latest_news: This function integrates with a news API to retrieve the latest headlines relevant to a particular location or topic.
Each of these new capabilities is processed through the JSON schema registry, thereby expanding the baseline model’s utility without requiring complex external orchestration or heavy dependencies.
Demonstrating Tool Calling in Action
To validate the implementation, several test queries were performed.
The first query, "What is the weather in Ottawa?", successfully invoked the get_current_weather function. The system returned a detailed weather report for Ottawa, confirming the accurate retrieval and formatting of data.
Next, the query "Given the current currency exchange rate, how much is 1200 Canadian dollars in euros?" demonstrated the convert_currency tool. The agent accurately performed the conversion, providing the equivalent value in euros.

The real test of the system’s robustness came with a more complex, multi-part query: "I am going to France next week. What is the current time in Paris? How many euros would 1500 Canadian dollars be? what is the current weather there? what is the latest news about Paris?"
This single prompt triggered four distinct tool calls: get_current_time for Paris, convert_currency for the dollar-to-euro conversion, get_current_weather for Paris, and get_latest_news related to Paris. The agent successfully processed each of these requests sequentially, providing a comprehensive response that addressed all aspects of the user’s query. This demonstration is particularly noteworthy given that it was executed on a 4 billion parameter model with only half its parameters active during inference, highlighting the efficiency and power of Gemma 4.
Extensive testing over the weekend revealed the model’s consistent performance. Hundreds of prompts, even with variations in phrasing, did not lead to reasoning failures, underscoring the reliability of Gemma 4’s tool-calling capabilities.
Broader Implications and Future Directions
The advent of robust tool-calling behavior within open-weight models represents one of the most significant and practical advancements in local AI development. The release of Gemma 4, in particular, empowers users to build complex, autonomous systems securely offline, free from the constraints of cloud-based services and API limitations. By architecturally integrating direct access to web resources, local file systems, raw data processing logic, and localized APIs, even low-powered consumer devices can now operate with a degree of autonomy previously exclusive to high-end cloud infrastructure.
This paradigm shift opens doors to a wide array of applications, from sophisticated personal assistants and automated data analysis tools to custom workflow automations for both individuals and businesses. The ability to run these sophisticated agents locally ensures enhanced data privacy, reduced operational costs, and greater control over the AI’s behavior.
The next logical step in this development trajectory is the creation of a fully agentic system. This would involve enabling the agent to not only respond to user queries but also to proactively identify tasks, plan sequences of actions, and interact with its environment dynamically and autonomously. The foundation laid by Gemma 4 and Ollama provides a powerful and accessible platform for exploring these advanced agentic capabilities, promising further innovation in the field of decentralized and privacy-preserving artificial intelligence.







