請 GPT 4 寫 python 程式解析 html,將資料存成 JSON 檔 — 以 ptt 網球板,WWDC 影片清單 & wiki 為例

彼得潘的 iOS App Neverland
16 min readJul 8, 2023

--

我們現在不只能請 GPT 4 寫程式,還可以請他執行程式,實現一些方便的應用。比方我們可以上傳感興趣的 html 檔,請 GPT 4 寫 python 爬蟲程式並執行程式,最後再下載程式輸出的 JSON 檔。

接下來我們試試請 AI 解析幾個有趣的網頁。

  • ptt 網球版文章列表。
  • Apple WWDC 影片清單。
  • Agatha Christie bibliography 的 wiki。
  • 可不可的飲料 menu。

ptt 網球板文章列表

解析最近如火如荼進行的溫網比賽資訊。

儲存 ptt 網球版文章列表的 html

連到 ptt 網球版文章列表。

點選 File > Save As。

Format 選擇 Page Source。

GPT-4 選擇 Advanced data analysis

點選輸入框的 + 上傳檔案

選擇 html 檔上傳

寫一個 python 程式解析 html,輸出 JSON 格式的文章清單

import json

# Function to extract information from a row
def extract_info(row):
# Initialize a dictionary to store the information
info = {}

# Extract the title and link
title_div = row.find('div', class_='title')
title_link = title_div.find('a')
if title_link is not None:
info['title'] = title_link.text.strip()
info['link'] = "https://www.ptt.cc" + title_link.get('href')
else:
info['title'] = title_div.text.strip()
info['link'] = None

# Extract the author
author_div = row.find('div', class_='author')
info['author'] = author_div.text.strip()

# Extract the date
date_div = row.find('div', class_='date')
info['date'] = date_div.text.strip()

# Extract the number of recommendations
nrec_div = row.find('div', class_='nrec')
nrec_span = nrec_div.find('span')
if nrec_span is not None:
info['nrec'] = nrec_span.text.strip()
else:
info['nrec'] = ''

return info

# Loop through all the rows and extract information
articles = [extract_info(row) for row in rows]

# Convert the list of dictionaries to a JSON string
json_articles = json.dumps(articles, ensure_ascii=False, indent=4)

json_articles

將 JSON 儲存為文件

點選下載 articles.json。

成功取得爬蟲程式輸出的網球版文章 JSON。

[
{
"title": "[情報] 溫網 博斯轉播場次(7/06)",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688621022.A.104.html",
"author": "jack86326",
"date": "7/06",
"nrec": "31"
},
{
"title": "[新聞] 36歲仍宰制網壇 喬帥:內心保持年輕",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688633099.A.A3F.html",
"author": "Gbywhite6969",
"date": "7/06",
"nrec": "35"
},
{
"title": "[LIVE] Andrey Rublev VS Aslan Karatsev",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688637313.A.ACD.html",
"author": "lowes",
"date": "7/06",
"nrec": "17"
},
{
"title": "[LIVE] Wawrinka VS Etcheverry",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688638246.A.DFA.html",
"author": "monkeyqoo",
"date": "7/06",
"nrec": "爆"
},
{
"title": "[LIVE] Liam Broady VS Casper Ruud",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688646302.A.852.html",
"author": "lowes",
"date": "7/06",
"nrec": "70"
},
{
"title": "[LIVE] Alexander Zverev VS Gijs Brouwer",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688647713.A.01A.html",
"author": "thnlkj0665",
"date": "7/06",
"nrec": "12"
},
{
"title": "[新聞] 瓦文卡過關預約夢幻對決 對手約克維奇:",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688651779.A.F97.html",
"author": "Supergreen",
"date": "7/06",
"nrec": "74"
},
{
"title": "[LIVE] Caroline Garcia vs. Leylah Fernandez",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688655557.A.419.html",
"author": "thnlkj0665",
"date": "7/06",
"nrec": "27"
},
{
"title": "[新聞] 大家16歲在幹嘛? 俄國天才少女連兩場大",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688659192.A.85E.html",
"author": "iamshana",
"date": "7/06",
"nrec": "19"
},
{
"title": "[LIVE] Alize Cornet VS Elena Rybakina",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688660702.A.A02.html",
"author": "lowes",
"date": "7/07",
"nrec": "72"
},
{
"title": "[LIVE] Andy Murray VS Stefanos Tsitsipas",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688668622.A.95B.html",
"author": "lowes",
"date": "7/07",
"nrec": "爆"
},
{
"title": "[LIVE] Daniil Medvedev VS Adrian Mannarino",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688670948.A.4D9.html",
"author": "thnlkj0665",
"date": "7/07",
"nrec": "8"
},
{
"title": "[情報] Anett Kontaveit 正式結束球員生涯",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688697669.A.85D.html",
"author": "sthho",
"date": "7/07",
"nrec": "32"
},
{
"title": "[情報] 溫網 博斯轉播場次(7/07)",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688720389.A.295.html",
"author": "jack86326",
"date": "7/07",
"nrec": "52"
},
{
"title": "[LIVE] Aliaksandra Sasnovich VS Petra Kvitova",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688723702.A.F5D.html",
"author": "lowes",
"date": "7/07",
"nrec": "15"
},
{
"title": "[新聞] 溫網》大滿貫350勝 喬帥史上第3位",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688728426.A.7D8.html",
"author": "hvariables",
"date": "7/07",
"nrec": "20"
},
{
"title": "[LIVE] Carlos Alcaraz VS Alexandre Muller",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688732702.A.B01.html",
"author": "lowes",
"date": "7/07",
"nrec": "49"
},
{
"title": "[LIVE] Aryna Sabalenka vs. Varvara Gracheva",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688734185.A.BCF.html",
"author": "thnlkj0665",
"date": "7/07",
"nrec": "25"
},
{
"title": "[情報] 胖卡轉推:囧直落三最有可能啦",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688741383.A.F37.html",
"author": "unred",
"date": "7/07",
"nrec": "45"
},
{
"title": "[LIVE] 謝淑薇/Strycova VS Blinkova/Gracheva",
"link": "https://www.ptt.cc/bbs/Tennis/M.1688746009.A.7CD.html",
"author": "thnlkj0665",
"date": "7/08",
"nrec": "82"
}
]

Apple WWDC 影片清單

儲存 Apple WWDC 影片清單的 html

上傳 html,寫一個 python 程式解析 html,將資料存成 JSON 檔

# Loop through each video section and extract information
for video in video_sections:
# Extract video title
title_section = video.find('h4', {'class': 'no-margin-bottom video-title'})
title = title_section.text.strip() if title_section else None

# Extract video description
description_section = video.find('p', {'class': 'description'})
description = description_section.text.strip() if description_section else None

# Extract video URL
video_url_section = video.find('a', {'class': 'video-image-link'})
video_url = video_url_section['href'] if video_url_section else None

# Extract image URL
image_url_section = video.find('img', {'class': 'video-image medium-scale'})
image_url = image_url_section['src'] if image_url_section else None

# Extract video duration
duration_section = video.find('span', {'class': 'video-duration'})
duration = duration_section.text.strip() if duration_section else None

# Extract WWDC year
wwdc_year_section = video.find('span', {'class': 'smaller'})
wwdc_year = wwdc_year_section.text.strip() if wwdc_year_section else None

# Append the video data to the list
video_data.append({
'title': title,
'description': description,
'video_url': video_url,
'image_url': image_url,
'duration': duration,
'wwdc_year': wwdc_year,
})

# Print the first few items to verify extraction
video_data[:5]

成功取得爬蟲程式輸出的 WWDC 影片 JSON,包含 1096 部影片。

https://raw.githubusercontent.com/AppPeterPan/WWDCJSON/main/video_data.json

Agatha Christie bibliography 的 wiki

我們目標將以下表格的小說清單整理成 JSON。

產生的 JSON。

https://raw.githubusercontent.com/AppPeterPan/Wiki/main/agatha_christie_novels.json

對話內容。

可不可的飲料 menu

我們目標將以下的飲料清單整理成 JSON。

產生的 JSON。

https://raw.githubusercontent.com/AppPeterPan/kebuke/main/menu.json

對話內容。

--

--

彼得潘的 iOS App Neverland

彼得潘的iOS App程式設計入門,文組生的iOS App程式設計入門講師,彼得潘的 Swift 程式設計入門,App程式設計入門作者,http://apppeterpan.strikingly.com