Never stop learning, approaching the AI
  • 😄About
  • 🅰️AI-Candy on YouTube
    • 🎬YouTube 视频列表
    • 💠AI项目本地部署
      • 1️⃣安装 Pytorch 运行环境
      • 2️⃣在Python虚拟环境下使用 VS Code or PyCharm
        • VS Code Python format
      • 3️⃣Create python virtual environment in Linux
    • 👽AI 入门系列课程
      • 0️向量及向量运算
        • Reshape array code
      • 1️人工智能,机器学习,深度学习 和神经网络的区别
      • 2️卷积神经网络(CNN)
        • 3D CNN sample code
      • 3️Transformer 原理
      • 4️前馈神经网络 (FNN)
  • 🎭Artificial intelligence
    • 1️⃣Deep learning / machine learning
      • 👉Deep Learning Resources
      • 👉Deep learning notes
    • 2️⃣Python
      • Youtube音乐下载
      • Pytorch 安装环境配置 (old)
        • TorchEEG
      • Anaconda3 Python path
      • Data
      • IEEE-754 Floating Point Converter
        • ieee 754 conversion function
      • 文件读写
      • 文本清理
      • Python 下载在线视频
      • 修改Jupyter Notebook 默认工作目录
    • 3️⃣AI Websites
  • 🪤Programming
    • SQL
      • Delete data and reset auto-increment ID
    • Angular
      • Angular-datatables, dd/MM/yyyy, sorting (no paged list)
      • Datatables save state using localStorage
      • Variable storage method
      • Colour picker
      • Error fix for click columns on Datatable
      • Auto address use Google place
      • Auto address use Azure Maps
      • Upload file to Server
      • Validators.required OnChange input
      • Date, Time field
      • VS Code: Auto add missing imports when save
      • Datatable setting
      • Date time format
      • sticky <th> and <td> content
      • Filter booked time
      • Dropdown time selection with interval
      • Angular date online test
      • Updating data without refreshing the page
      • Object array sort and sum
      • Multi-type of columns use in one column
      • Select button for datatables
      • Switch button and event
      • Delete column from Array
      • Three-layer structure
      • Remove shadow when print mat-dialog content
      • JSON Parse && Object Array
      • Detect unused import in Typescript
      • Change location using radio button
      • display multi line message in the Toastr
      • Custom LOCALE_ID
      • Batch add data from csv to API server
        • Angular read csv and upload to Server
      • USB Port reader Web solution
      • Debug Angular app using JavaScript Debugger in VS Code
      • Skills
        • FormData & FormGroup to JSON
        • Dropdown list (customer)
        • Get current datetime
        • Get first day of year, month, and date
        • Call a function in a forEach loop
        • disable and readonly
        • Form element value
        • HTML input type
        • Input pattern (validation)
      • Display pipe (UI format)
        • Input upper case and button checked
        • Icons (Bootstrap and CoreUI)
        • Page Refresh
        • Selection list (two ways)
        • onChange Selection event
        • Random Password and Toggle
        • Password match
        • Select checkbox disable
      • Print and save to PDF
      • Import JS into Angular
      • LocalStorage
      • Angular DataTable
        • Data sort
        • A sample usage
        • Angular DataTable server side big data query
      • Change chart.js chart type
      • Angular UI - .NET API - .NET Auth
      • Angular - .NET API
      • *ngIf else && change to @if
      • Angular add reCAPTCHA v3 (Google)
      • Angular update
        • Update from v13 to v15
        • Update from v15 to v18
      • Angular application version central
      • Face detection
        • Face-api.js
      • Angular, Node version compatibility matrix
      • Clear cache
      • Angular oauth2 OIDC
      • Angular add header
    • .NET Skills
      • Add ID manually
      • Auto Mapping
        • Ignore Nesting
        • Startup setting
        • Datetime processing in AutoMapping
        • AutoMapper example
      • Validation filter
      • BaseController
      • Group by many
      • Database first, scaffold to class
      • Log setting and exception handler
      • Update appsetting.json value
      • Azure service bus message (queue)
      • Read appsetting.json value
      • Auth get user info by email
      • Azure Time zone
      • .NET API Add Service
      • Object comparison
      • Coravel Schedule
        • Read appsettings.json
      • .Net Core RDLC Report, Coravel and Email
      • Check Network and SQL server connections.
      • Datatime custom format
      • Many to Many EF
        • Many to Many CheckBox
      • PDFpig: Send Email with PDF attachment
      • .NET Core Middleware order
      • .NET API add Worker Service
      • .NET Router
      • Partial columns update
      • Add and Delete
      • 图片自适应宽度
      • ASP.NET Identity
      • Upload file to Azure
        • Upload file to Blob
      • Developer Guide
      • Code first one-many
      • ASP.NET MVC 5 Custom Error Page
      • VS can't debug
      • 通过邮编查 NSW COVID-19 感染人数
      • Jquery File Upload
      • Jquery Datepicker
      • ajax delete file from server
      • Autofac in MVC
      • Autofac in .NET Core
      • .NET Core
      • HTTP Return code
      • IdentityServer4
    • Power BI
      • Add parameter to PowerBI report
      • Convert UTC to Local time
      • Python in PowerBI
        • IEEE-754 conversion
      • PowerBI embed app - Server
      • PowerBI embed app - Client
        • Setting on portal
    • Azure service
      • Key Vault
      • Service bus - queue
      • Power Automate
      • Kusto Query Language
      • Azure Data Explorer
      • Reserved keyword on Azure Error
      • SQL Azure time convert
    • Azure blob
      • Azure blob setting
      • Display image from Blob
      • Upload image to Blob through .NET API
    • Html Bootstrap Icon, colour, size
      • Html spacing
      • Html text alignment
    • Video stream - JsMpeg
      • SSL - generate key
      • Client (SSL)
      • Websocket-Server (SSL)
      • Play RTSP video stream
    • ⏰Time Zone
      • datetime-local set date range
      • 🕐Get data by local time (UI, API)
      • 🕑Add offset hours for local UI and report
      • 🕒UTC time and Datetime convert
      • 🕓Angular - Timezone selection
      • 🕔Angular - Convert UTC to local time
      • 🕕C# Time Zone
  • >>>>>>>>>>>>>>>>>>>>>>>>>>>>
  • 🪜Apps and Skills
    • 1️Windows system app skills
      • Brother HL-2130 打印机 Toner 报警
      • VS Code 快捷键
      • Check SHA256 on windows
      • blob 视频下载
      • Photoshop 制作证件照片
      • 获取 Windows Key
      • 10进制36进制互转
      • Error when publish to Azure
      • Disable windows automatic update
      • Outlook setup for Yahoo Email
      • IIS setting
      • Windows 8/10, IIS Service
      • 安装程序出错 2052,2053 报警
      • 6 Yao Chinese UI
    • 2️Linux command
    • 3️Git command
    • 4️Bitbucket
    • 5️Gitbook Skills
    • 6️GitHub Desktop
    • 7️⃣EndNote
      • EndNote V21
      • Endnote使用技巧
      • 批量删除/修改Endnote 中 notes 栏内容
  • Android mobile connect PC
  • 💎USEFUL LINKS
    • 1️Coding websites
      • Website links
    • 2️Windows 平台工具,网站
    • 3️PotPlayer 设置
  • >>>>>>>>>>>>>>>>>>>>>>>>>>>>
  • 🚩Research >>EEG
    • 1️EEG基本知识的理论介绍
      • EEG 简介
      • EEG 的节律信号
      • EEG电极帽
      • EEG 伪迹
      • ERP 介绍
      • ERP 成分
      • EEG 数据分析软件
    • 2️LSL 应用
    • 3️EEG公开数据集汇总整理
    • 4️REDCap
      • Migration (Export & Import)
    • 5️⃣ScaneR
  • ☕Buy me a coffee
Powered by GitBook
On this page

Was this helpful?

  1. Artificial intelligence
  2. Python

文本清理

import re
import json
import os
import unicodedata
import string 
import re  # regular expression

json_o_filename = './output_files/json_file_original'
text_o_filename = './output_files/text_original.txt'
cleaned_text_filename = './output_files/cleaned_tweet_text.txt'
# read tweets from json file
def read_json_file(json_filename, json_file_number):
    json_file = json_filename + '_' + str(json_file_number) +'.txt'
    if os.path.exists(json_file):
        with open(json_file, 'r', encoding="utf-8") as f:
            json_string = f.read()
            parsed = json.loads(json_string)
    return parsed       
//# clearn text functions
def remove_at(text_sentence):
    text_out = re.sub("@\S+",'',text_sentence)
    return text_out

def remove_hashtag(text_sentence):
    text_out = re.sub("#\S+",'',text_sentence)
    return text_out

def remove_url(text_sentence):
    text_out = re.sub("https*\S+",'',text_sentence)
    return text_out

def remove_punctuation(text_sentence):
    text_out = re.sub('[%s]' % re.escape(string.punctuation),'',text_sentence)
    return text_out

def remove_number(text_sentence):
    text_out = re.sub(r'\w*\d+\w*','',text_sentence)
    return text_out

def remove_space(text_sentence):
    text_out = re.sub('\s{2,}','',text_sentence)
    text_out = text_out.strip()
    return text_out

def remove_others(text_sentence):
    text_out = text_sentence.replace('\r', '')     ## 回车符  win: \r\n
    text_out = text_sentence.replace('\r\n', '')     ## 回车符  win: \r\n
    text_out = text_sentence.replace('\t', ' ')    ## 水平制表符
    text_out = text_sentence.replace('\f', ' ')    ## 换页符
    return text_out

def remove_unicode(text_sentence):
    text_out = text_sentence.encode('ascii', 'ignore').decode() 
    return text_out

def join_multi_line(text_sentence):
    text_out = ''
    for line in text_sentence:
        text_out += line.strip('\n')
    return text_out
def clear_text(text):
    print('--------------- begin to clean ---------------\n')
    
    print('\n1- remove_at\n')   
    text = remove_at(text)
    print(text)
        
    print('\n2- remove_hashtag\n')
    text = remove_hashtag(text)
    print(text)
    
    print('\n3- remove_url\n')
    text = remove_url(text)
    print(text)
    
    print('\n4- remove_punctuation\n')
    text = remove_punctuation(text)
    print(text)
    
    print('\n5- remove_number\n')
    text = remove_number(text)
    print(text)
    
    print('\n6- remove_space\n')
    text = remove_space(text)
    print(text)
    
    print('\n7- remove_unicode\n')
    text = remove_unicode(text)
    print(text)
    
    print('\n8- remove_others\n')
    text = remove_others(text)
    print(text)
    
    print('\n9- join_multi_line\n')
    text = join_multi_line(text)
    print(text)
    
    print('\n--------------- end clean ----------------\n')
 
def save_o_text(filename, json_parsed, option):
    with open(filename, option, encoding="utf-8") as f:
        title = 'number | created time | text \n'
        f.write(title)
        number = 1
        for data in json_parsed:
            create_time = data['created_at']
            tweet_text = remove_space(data['text'])
            text = str(number) + ' | ' +  create_time[:10] + ' | ' + tweet_text +'\n'
            f.write(text)
            number = number + 1
            print(text)
            cleaned_text = clear_text(text)
            f.write('\n'+cleaned_text)
            
# start processing
json_file_number = 1
json_parsed = read_json_file(json_o_filename, json_file_number)
# print(json.dumps(json_parsed, indent=4, sort_keys=True))

save_o_text(text_o_filename, json_parsed['data'], 'w')

Previous文件读写NextPython 下载在线视频

Last updated 3 years ago

Was this helpful?

🎭
2️⃣