The rapid development of internet content is a challenge for effective separation and summary of relevant information. In this tutorial we show how one can use For internet scraping and processing separate data using AI models reminiscent of Google Gemini. Integrating these tools in Google Colab, we create a comprehensive flow of work that raises web sites, downloads significant content and generates concise summaries using the newest language models. Regardless of whether you need to automate the research, separate insights from articles, or construct AI powered applications, this tutorial provides a solid and flexible solution.
!pip install google-generativeai firecrawl-py
First, we install Google-Generativeai Firecrawl-Py, which installs two obligatory libraries required for this tutorial. Google-Generativeai provides access to the API Google Gemini interface to generate a text driven by artificial intelligence, while Firecrawl-Py permits you to scrape internet by downloading content from web sites in a structured format.
import os
from getpass import getpass
# Input your API keys (they can be hidden as you type)
os.environ["FIRECRAWL_API_KEY"] = getpass("Enter your Firecrawl API key: ")
Then we safely set the API FireCrall key as an environmental variable in Google Colab. Uses Getpass () to encourage the user to the API key without displaying it, ensuring confidentiality. Storing the important thing in OS.ENVIRON permits you to easily authenticate the Firecrawl scraping function through the session.
from firecrawl import FirecrawlApp
firecrawl_app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
target_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
result = firecrawl_app.scrape_url(target_url)
page_content = result.get("markdown", "")
print("Scraped content length:", len(page_content))
We initialize Firecrawl by creating Firecwapp with the saved API key. Then scrape the contents of a selected website (on this case Python Wikipedia programming page) and isolates data in Markdown format. Finally, it prints the length of the scraped content, enabling us to confirm successful download before further processing.
import google.generativeai as genai
from getpass import getpass
# Securely input your Gemini API Key
GEMINI_API_KEY = getpass("Enter your Google Gemini API Key: ")
genai.configure(api_key=GEMINI_API_KEY)
We initialize the API Google Gemini, safely capturing the APi key using Getpass (), stopping it from displaying it in an everyday text. The Genai.Configure (API_KEY = Gemini_API_KEY) command configures the API client, enabling smooth interaction with Google’s Gemini AI to generate text and summary. This ensures protected authentication before submitting applications to the AI model.
for model in genai.list_models():
print(model.name)
We heaten available models at Google Gemini API using Genai.List_Models () and prints their names. This helps users check which models can be found with the API key and select the proper one for tasks, reminiscent of text generation or summary. If the model is just not found, this step helps to debulate and select another.
model = genai.GenerativeModel("gemini-1.5-pro")
response = model.generate_content(f"Summarize this:nn{page_content[:4000]}")
print("Summary:n", response.text)
Finally, initiating the Gemini 1.5 Pro model using Genai.GenerativeModel (“Gemini-1.5-PRO”) sends a request to generate a summary of the scraped content. Limits the input text to 4000 characters to remain as part of API restrictions. The model processes a request and returns a concise summary, which is then printed, providing a structured and generated by AI review of the separate contents of the web site.
To sum up, combining Firecrawl and Google Gemini, we’ve created an automatic pipeline that can scale the content of the network and generates significant summaries with minimal effort. In this tutorial, many AI powered solutions are presented, enabling flexibility based on the provision of the API interface and limitations of amounts. Regardless of whether you might be working on NLP applications, test automation or content aggregation, this approach allows efficient extraction and summary of data on a scale.
Here . Don’t forget to follow us either Twitter and join ours Telegram channel AND LinkedIn GROup. Don’t forget to affix ours Subreddit 80K+ ML.
Asif Razzaq is the overall director of the MarktechPost Media Inc .. As a visionary entrepreneur and engineer, ASIF is involved within the use of the potential of the factitious intelligence of social good. His latest undertaking is to launch the factitious intelligence media platform, Marktechpost, which is distinguished by an in -depth relationship from machine learning and deep learning news, that are each technically solid and easily comprehensible by a large audience. The platform boasts over 2 million monthly views, illustrating its popularity amongst recipients.